Current Projects
CROSS is organized around incubator and research projects.
Software related to CROSS projects can be found in the CROSS Software Portal.
PolyPhy: Reconstruction and Visualization of Complex Spatial Networks
Now Funded under the UCSC OSPO
Fellow: Oskar Elek
Advisor: Angus ForbesAbstract: Sparse datasets with complex, network-like structures originate in many natural processes but also different areas of human activity. Our mission is to develop a toolkit for identifying and visualizing the underlying structures in such data. The starting point is Polyphorm, an unconventional software inspired by the growth patterns of Physarum polycephalum which has already been applied to 3D data in astrophysics and linguistics. PolyPhy will become a domain-independent port of Polyphorm, aimed at a broader community of interdisciplinary data scientists. In addition to the analysis of 3D data, PolyPhy will support high-dimensional datasets which arise in genomics, neuroscience, and more generally in applications of deep learning (such as the autoencoder and transformer neural networks).
SkyhookDM: Programmable Storage for Databases
Project team: Jayjeet Chakraborty
Abstract: The cloud business model requires flexible resource usage but traditional relational databases strongly couple data to physical resources making it difficult to add and remove database nodes. While skyhook is not a database itself, it is an enabling technology that takes some of the metadata management and data processing tasks normally handled by the DBMS and delegates them to the storage system. This approach is immediately useful to enable smaller/single node databases growing to much larger sizes, and the project team identified this as a point of interest within the Postgres community, which is currently limited to storing database table files on local disk. Their current options are to replace local disk with perhaps RAID arrays or migrate entirely to the cloud where they can rent Postgres instances. However, both of these approaches still require the single node Postgres instance to do all of the actual DMBS work. By pushing some of these capabilities from the DBMS into the storage, skyhook enables a single node Postgres instance to scale (in-part) with the amount of storage added. These storage capabilities are the new focus of skyhook (see also skyhookdm.com).
Managing Bufferbloat in Storage Systems
Fellow: Esmaeil Mirvakili (Advisor: Carlos Maltzahn)
Abstract: Scalable storage servers consist of multiple parts that communicate asynchronously via queues. There is usually a frontend that queues access requests from storage clients and uses one or more threads to forward queued requests to a backend. The backend queues these forwarded requests and batches them to efficiently use storage devices it manages. Storage servers can have multiple kinds of backends with different design assumptions about their underlying storage device technologies. Requests are scheduled in the frontend to ensure different levels of service for different classes of requests. For example, requests that are generated by data scrubbers working in the background generally have a lower priority than requests from an application. A common solution to the above problem is to move request scheduling from the frontend to the backend. For various reasons that is not always practical. The scope of this project is to have the scheduler reside in the frontend and to explore designs for backends to dynamically control the admission of requests depending on continually changing workloads and storage device technologies.
In-Storage LSM Trees for Real-Time Data Lakehousing
Fellow: Holly Casaletto (Advisor: Peter Alvaro)
Abstract: A common analysis workload consists of a continual, bandwidth-sensitive input stream of event data and a latency-sensitive set of advanced analysis queries across a large accumulation of event data that include most recent events. The input stream is commonly represented as a stream of rows whereas advanced analysis involves ranges over projections of tables and therefore favors data organized by columns. The scope of this project is to address the overall challenge by designing a scalable way to manage the lifecycle of such workload data such that the most recent and the still-needed old data co-exist in a storage system that provides high performance on both processing the incoming events and providing analysis answers. In the data management community, the organization of data and its adaptation to changing workloads is referred to as design management and typically involves the addition and removal of indices on the logical level of data management. In contrast, this project the focuses on organizing and transforming data on the physical layer, i.e. in the storage layer, also known as physical design management. Physical design management makes coexisting the most recent together with the aged data in the same storage system without degrading performance possible.
Label Noise Detector
Fellow: Jiaheng Wei (Advisor: Yang Liu)
Abstract: A dataset contains numerous noisy labels if collected from unverified sources. Such label noise in real-world datasets encodes wrong correlation patterns and impairs the generalization of deep neural networks (DNNs). This project (LND -– Label Noise Detector) is an open-source project that provides efficient ways to detect corrupted patterns, i.e., the label noise transition matrix, which characterizes the probabilities of a training instance being wrongly annotated.