Moving Fast With Broken Data

by   Shreya Shankar, et al.

Machine learning (ML) models in production pipelines are frequently retrained on the latest partitions of large, continually-growing datasets. Due to engineering bugs, partitions in such datasets almost always have some corrupted features; thus, it's critical to detect data issues and block retraining before downstream ML model accuracy decreases. However, it's difficult to identify when a partition is corrupted enough to block retraining. Blocking too often yields stale model snapshots in production; blocking too little yields broken model snapshots in production. In this paper, we present an automatic data validation system for ML pipelines implemented at Meta. We employ what we call a Partition Summarization (PS) approach to data validation: each timestamp-based partition of data is summarized with data quality metrics, and summaries are compared to detect corrupted partitions. We describe how we can adapt PS for several data validation methods and compare their pros and cons. Since none of the methods by themselves met our requirements for high precision and recall in detecting corruptions, we devised GATE, our high-precision and recall data validation method. GATE gave a 2.1x average improvement in precision over the baseline on a case study with Instagram's data. Finally, we discuss lessons learned from implementing data validation for Meta's production ML pipelines.


Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data Pipelines

Data pipelines are widely employed in modern enterprises to power a vari...

Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes

Complex data pipelines are increasingly common in diverse applications s...

Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities

Machine learning (ML) is now commonplace, powering data-driven applicati...

Operationalizing Machine Learning: An Interview Study

Organizations rely on machine learning engineers (MLEs) to operationaliz...

Picket: Self-supervised Data Diagnostics for ML Pipelines

Data corruption is an impediment to modern machine learning deployments....

Finding Label and Model Errors in Perception Data With Learned Observation Assertions

ML is being deployed in complex, real-world scenarios where errors have ...

Please sign up or login with your details

Forgot password? Click here to reset