Mapping Datasets to Object Storage

Research Fellow: Xiaowei (Aaron) Chu

Advisor: Carlos MaltzahnJeff LeFevre 


Access libraries such as HDF5 allow users to interact with datasets using a high level abstraction. But the implementations of access libraries are based on outdated assumptions about storage systems interfaces and generally do not scale. In this proposed research project we will explore distributed dataset mapping infrastructures that can integrate and scale out important existing access libraries using programmable storage abstractions available in Ceph while avoiding reimplementation or even modifications of these access libraries as much as possible. Such a distributed dataset mapping infrastructure will allow operations of access libraries to be offloaded to storage system servers (or devices) and fully leverage load balancing, elasticity, and failure management of distributed storage systems like Ceph.

Distributed dataset mapping will also enable local optimizations that are specific to a particular heterogenous device so that global optimization strategies of distributed access can abstract over a multi-tiered storage hierarchy. The project’s research goal is to explore the means by which distributed dataset mapping can be abstracted over particular access libraries. Challenges to be addressed by this project revolve around data partitioning, composability of access operations, and distributed and modular optimizers.


Mapping Scientific Datasets to Programmable Storage. EPJ Web Conference, 2020.

Aaron Chu, Ivo Jimenez, Jeff LeFevre, Carlos Maltzahn. “SkyhookDM: Programmable Storage for Datasets.” Poster at IRIS-HEP Poster Session, February 2020.