CROSS & Google Summer of Code

Welcome to the Center for Research In Open Source Software (CROSS) Project Ideas page for GSoC participants.

GSOC 2018 SELECTIONS HAVE BEEN MADE. See two chosen CROSS projects here.





 

Project Ideas

Title: Skyhook: Elastic Databases for the Cloud
Mentors: Jeff LeFevre
Description: The cloud business model requires flexible resource usage but traditional relational databases strongly couple data to physical resources making it difficult to add and remove database nodes. The Skyhook project extends PostgreSQL with a data/resource decoupling that allows dynamic expansion and shrinking of database clusters and enables the query optimizer to leverage this functionality.

 

Title: ZLog entry caching
Difficulty: easy
Skills: C++
Mentors: Noah Watkins
Description: ZLog is a distributed shared-log that uses Ceph for its storage. Entries in the log are written in parallel to the storage system, and can be read back in any order. Entries in the log are immutable, which means once they are written they never change. This property makes it easy to aggressively cache data throughout the system. This project would involve building a cache layer that stores log entries on fast storage, and allow clients to avoid always reading from Ceph which can be slower.

Title: ZLog management commands
Difficulty: medium
Skills: C++
Mentors: Noah Watkins
Description: ZLog is a distributed shared-log that uses Ceph for its storage. Entries in the log are written in parallel to the storage system, and can be read back in any order. In practice Ceph may store many logs, but currently there is limited support for management (e.g. log create, delete, rename). This project would create (1) the necessary data schema changes to support basic log management functions, (2) implement algorithms for safely executing these management functions, and (3) build a command line interface that clients can use to run these commands.

Title: ZLog sequencer fail over
Difficulty: hard
Skills: C++
Mentors: Noah Watkins
Description: ZLog is a distributed shared-log that uses Ceph for its storage. Entries in the log are written in parallel to the storage system, and can be read back in any order. A component in ZLog called the sequencer is responsible for assigning log positions to clients that wish to append to the log. At any moment there is a current primary sequencer process, but in practice the process may fail and the system should handle this. This project would introduce a fail over protocol that allows a new sequencer to take over the role of a failed sequencer.

Title: CruzDB tracing and monitoring
Difficulty: easy
Skills: C++
Mentors: Noah Watkins
Description: CruzDB is an append-only key-value database that runs on top of the distributed shared-log ZLog, and allows distributed clients to run transactions using multi-version concurrency control. This project would integrate tracing and monitoring into CruzDB to help with debugging and performance monitoring.

Title: CruzDB garbage collection
Difficulty: hard
Skills: C++
Mentors: Noah Watkins
Description: CruzDB is an append-only key-value database that runs on top of the distributed shared-log ZLog, and allows distributed clients to run transactions using multi-version concurrency control. The database is structured as a copy-on-write binary tree, and over time parts of the tree become unreachable. This project would implement garbage collection to remove unreachable parts of the tree from persistent storage in the underlying log.

Title: CruzDB local node caching
Difficulty: easy
Skills: C++
Mentors: Noah Watkins
Description: CruzDB is an append-only key-value database that runs on top of the distributed shared-log ZLog, and allows distributed clients
to run transactions using multi-version concurrency control. All data in the database is immutable, which means that its easy to aggressively cache data. This project would add a client cache that stores evicted data from the database in a local, fast storageallowing cache misses to resolve faster than retrieving data from Ceph.

Title: Cudele integration with Hadoop/Spark
Difficulty: Medium
Skills: C++
Mentor: Michael Sevilla
Description: Cudele lets administrators dynamically control the consistency and durability guarantees for subtrees in the file system namespace. The ideas were recently accepted for publication but the prototype (which is implemented on CephFS) is incomplete. This project would bring the prototype to a fully functional file system and connect it with big data processing frameworks in the Apache stack (e.g., Hadoop or Spark). This includes implementing the callbacks for all file system operations. The goal of this project is to facilitate the deployment of real-world workloads on Cudele, thus addressing the biggest criticism of the project to date.

Title: Hybrid systems software environment hardware integration
Difficulty: medium
Skills: Java
Mentors: Brendan Short and Ricardo Sanfelice
description: The hybrid systems software environment is a tool for the simulation of hybrid dynamical systems.  This tool can also be expanded to interface with external hardware such as sensors, actuators, microcontrollers, and other digital devices.  A software interface needs to be developed in order to achieve this task.  This interface would need to integrate with the simulation engine in order to create a combined system where simulated and real components work together.
 
Title: Web interface to run the hybrid systems environment software package
Difficulty: hard
Skills: Java, JavaScript, HTML
Mentors: Brendan Short and Ricardo Sanfelice
description: The hybrid systems software environment is a tool for the simulation of hybrid
dynamical systems. Right now this is a java package that is run from a local machine to produce numerical results, but it would be an improvement to run said package from a server via a web interface.  This would enable larger and more complex simulations using the resources of online servers, such as UC supercomputers, AWS, etc.  It would also provide a convenient platform for researchers to share and analyze data collected.
 
Title: External data source plugin manager for the hybrid systems environment software package
Difficulty: medium
Skills: Java, SQL
Mentors: Brendan Short, Ricardo Sanfelice
Description: The hybrid systems software environment is a tool for the simulation of hybrid
dynamical systems. This environment supports data from external sources, such as Google Earth and USGS, for example, that could be used for a variety of real-world applications.  Integrating such sources requires the development of a system with data structures that can perform queries, store data efficiently, and provide an API to be used within other systems.

 

Title: Port the Popper CLI tool from go to python
Difficulty: Medium
Skills: Python (strong); Go (very basic)
Mentors: Ivo Jimenez

Description: The Popper CLI tool is implemented in the Go language. This project consists of porting it to Python with the goal of making the code more accessible to new contributors. The current implementation is available at this location. The new Python port will be implemented using click.
 
Title: Support for Figshare and Zenodo to the Popper CLI tool
Difficulty: Medium
Skills: REST and Python(familiar);Python and Go (strong)
Mentors: Ivo Jimenez
Description: The Popper CLI tool is used to implement experimentation pipelines that, in some cases, are associated to a scholarly article. The code for Popper pipelines is managed in a repository such as git. Zenodo and Figshare are two popular online repositories that can be used to archive (create a snapshot) and generate DOIs for a code repository. This project involves the implementation of new subcommands of the popper CLI tool that communicate with Zenodo and Figshare using their public REST APIs to archive the content of a repository, and optionally generate DOIs.
GH issues: #198  & #199
 
Title: Homebrew, pip and apt packages for the Popper CLI tool
Difficulty: medium
Skills: Package managers (familiar), Bash (strong), Python (strong), Ruby (familiar)
Mentors: Ivo Jimenez
Description: Popper CLI tool is currently installed by downloading a binary file from Github. This project involves creating Popper packages for at least three package managers: homebrew(OSX), pip (python) and apt (debian/ubuntu). If time permits, this will also involve creating packages for other Linux distributions.
GH issues: #216#217#218
 
Title: Improve CI functionality of Popper
Difficulty: medium
Skills: Python (strong), Continuous integration (familiar)
Mentors: Ivo Jimenez
Description: The Popper CLI tool provides basic CI functionality via its popper run command, that allows experiment pipelines to be automatically executed on multiple CI platforms. This project consists in extending the current functionality to support matrix executions, where the two main axis of the matrix are 1) Runtime and 2) Environment.
GH issues: #211 #212 #213 #214
 
Title: Validate dependencies of a Popper pipeline
Difficulty: medium
Skills: Python (strong), Docker (familiar)
Mentors: Ivo Jimenez
Description: The Popper CLI tool provides basic CI functionality via its popper run command, that allows experiment pipelines to be automatically executed on multiple CI platforms. This project involves extending this functionality to check dependencies when a –dependenciesflag is passed to the popper run subcommand. The convention for checking things is the following:

Folders named after a tool (e.g. docker or terraform) have special meaning. For each of these, tests are executed that check the integrity of the associated files. For example, if we have an experiment that is orchestrated with Ansible, the associated files are stored in an ansible folder. When checking the integrity of this experiment, the ansible folder is inspected and associated files are checked to see if they are healthy. The following is a list of currently supported folder names and their CI semantics (support for others is in the making):

  • docker. An image is created for every Dockerfile.
  • datapackages. Availability of every dataset is checked.
  • vagrant. Definition of the VM is verified.
  • terraform. Infrastructure configuration files are checked by running terraform validate.

GH issues: #41

Title: Import pipeline from existing repositories
Difficulty: medium
Skills: Python (strong), REST (familiar)
Mentors: Ivo Jimenez
Description: The Popper CLI tool tool provides scaffolding capabilities for initializing new Popper pipelines. Currently, if users want to reuse an existing pipeline (implemented by others), they need to go the corresponding repository (e.g. one of the repos in https://github.com/popperized), download or clone a repository and copy-paste the contents of a folder into their own repository. This project involves automating this process by creating an import subcommand that takes an existing github repo URL, and imports one or more pipelines to the current repository.
GH issues: #70 #144 #145 #147 #148 #152