Git-Theta: A Git Extension for Collaborative Development of Machine Learning Models

by   Nikhil Kandpal, et al.

Currently, most machine learning models are trained by centralized teams and are rarely updated. In contrast, open-source software development involves the iterative development of a shared artifact through distributed collaboration using a version control system. In the interest of enabling collaborative and continual improvement of machine learning models, we introduce Git-Theta, a version control system for machine learning models. Git-Theta is an extension to Git, the most widely used version control software, that allows fine-grained tracking of changes to model parameters alongside code and other artifacts. Unlike existing version control systems that treat a model checkpoint as a blob of data, Git-Theta leverages the structure of checkpoints to support communication-efficient updates, automatic model merges, and meaningful reporting about the difference between two versions of a model. In addition, Git-Theta includes a plug-in system that enables users to easily add support for new functionality. In this paper, we introduce Git-Theta's design and features and include an example use-case of Git-Theta where a pre-trained model is continually adapted and modified. We publicly release Git-Theta in hopes of kickstarting a new era of collaborative model development.


page 1

page 2

page 3

page 4


Git4Voc: Git-based Versioning for Collaborative Vocabulary Development

Collaborative vocabulary development in the context of data integration ...

Knowledge is at the Edge! How to Search in Distributed Machine Learning Models

With the advent of the Internet of Things and Industry 4.0 an enormous a...

nPrint: A Standard Data Representation for Network Traffic Analysis

Conventional detection and classification ("fingerprinting") problems in...

Aspirations and Practice of Model Documentation: Moving the Needle with Nudging and Traceability

Machine learning models have been widely developed, released, and adopte...

The openCARP CDE – Concept for and implementation of a sustainable collaborative development environment for research software

This work describes the setup of an advanced technical infrastructure fo...

PTMTorrent: A Dataset for Mining Open-source Pre-trained Model Packages

Due to the cost of developing and training deep learning models from scr...

NSML: Meet the MLaaS platform with a real-world case study

The boom of deep learning induced many industries and academies to intro...

Please sign up or login with your details

Forgot password? Click here to reset