I am Ashay Shirwadkar prospective Master's student at University of California Riverside. While working at Seagate I heard about Center for Research in Open Source Software (CROSS) and I was intrigued by the variety of work done in that center. GSoC gave me an excellent opportunity to be a part of open source research group. With the desire to contribute to storage systems, SkyhookDM project seemed to be an ideal choice where I will get to learn about programmable storage and the architecture of Ceph. And last but not least, it is an honour to be contributing at the birthplace of Ceph!!
GSoC Student Ashay Shirwadkar Extending Current Processing Methods with Column-Oriented Processing via Arrow
By Jeff LeFevre, CROSS Incubator Fellow / Adjunct Professor at UCSC
On May 6, 2019, the Google Summer of Code, officially announced its list of students for the Summer 2019 Program. Included in these was Ashay Shirwadkar, who will be working with the Center for Research in Open Source Software (CROSS) on a project he proposed entitled “Extend current processing methods with column-oriented processing via Arrow.” Ashay based his proposal on a project idea provided on the CROSS Ideas Page for 2019 GSoC applicants. Ashay’s project will directly contribute to my SkyhookDM incubator project, which aims to develop programmable storage for databases.
The goal of SkyhookDM is to bring data management capabilities into the storage system, which helps with scalability for databases in the cloud. In particular, 'scaling out' a single node database such as PostgreSQL to support much larger data storage and processing as well as enable elasticity through capabilities embedded within the storage layer. SkyhookDM is part of the larger 'programmable storage' research at UC Santa Cruz, which leverages and further develops key functionalities already present in the storage system, as well as adding new capabilities by combining these in unique ways.
SkyhookDM extends Ceph distributed object storage with a limited set of database functionality. This includes common data processing tasks such as select, project, and aggregate data, as well as higher level data management tasks such as indexing, statistics collection, data layout and formatting tasks. The ability to delegate some tasks to the storage layer allows a database to directly benefit from the properties of a hardened storage system such as reliability, scalability, and correctness while also benefitting from these new capabilities.
A requirement of this approach is to partition the data wisely and embed the data's semantics as close to the data as possible. In this way the storage system can 'understand' the data to a limited degree, which is what allows SkyhookDM to offload some work into the storage layer. By embedding data semantics and support for off the shelf data structures such as Google Flatbuffers or Flexbuffers, we are enabling storage to manipulate data in-place (for data layout changes) or process data locally that is then returned to the client (for database queries). These methods include row-based or column-based layouts which SkyhookDM is designed to support now, and conversion between the two.
Ashay's work this summer will add the capability to support a different column-based data layout within storage -- the Apache Arrow data format, thus adding a new data storage file format that we store on disk to our current format Google Flatbuffers (row-based). Column-based format can be beneficial for certain types of query workloads, so this can help to improve SkyhookDM performance for some cases. The ability to support and even convert between existing SkyhookDM data formats on the fly greatly enhances the data management functions provided by the Ceph storage layer. Ashay, under the guidance of the SkyhookDM mentor team -- Jeff LeFevre and Noah Watkins -- will explore and evaluate different possible Arrow formats to benefit query processing directly within the storage layer, as well as transferring data efficiently over the wire back to the database.
Ashay’s GSoC project will enhance and highlight the benefits SkyhookDM brings for database systems in the cloud. Combined with conversion between multiple data formats and data layouts, the project improves SkyhookDM toward supporting different layouts based on workload requirements, and even switch layouts if needed as workloads change over time. The new layouts will also inform cost models that are used for both traditional query planning at the database layer and for our new query planning options within the storage layer based on current storage system state. The SkyhookDM team is very excited to explore these column-based layouts and look forward to the results. We are also grateful to Google Summer of Code for helping make this all possible!