Data Invariants: On Trust in Data-Driven Systems

by   Anna Fariha, et al.

The reliability and proper function of data-driven applications hinge on the data's continued conformance to the applications' initial design. When data deviates from this initial profile, system behavior becomes unpredictable. Data profiling techniques such as functional dependencies and denial constraints encode patterns in the data that can be used to detect deviations. But traditional methods typically focus on exact constraints and categorical attributes, and are ill-suited for tasks such as determining whether the prediction of a machine learning system can be trusted or for quantifying data drift. In this paper, we introduce data invariants, a new data-profiling primitive that models arithmetic relationships involving multiple numerical attributes within a (noisy) dataset and which complements the existing data-profiling techniques. We propose a quantitative semantics to measure the degree of violation of a data invariant, and establish that strong data invariants can be constructed from observations with low variance on the given dataset. A concrete instance of this principle gives the surprising result that low-variance components of a principal component analysis (PCA), which are usually discarded, generate better invariants than the high-variance components. We demonstrate the value of data invariants on two applications: trusted machine learning and data drift. We empirically show that data invariants can (1) reliably detect tuples on which the prediction of a machine-learned model should not be trusted, and (2) quantify data drift more accurately than the state-of-the-art methods. Additionally, we show four case studies where an intervention-centric explanation tool uses data invariants to explain causes for tuple non-conformance.


Data-Driven Loop Invariant Inference with Automatic Feature Synthesis

We present LoopInvGen, a tool for generating loop invariants that can pr...

Machine-Learning Arithmetic Curves

We show that standard machine-learning algorithms may be trained to pred...

DetAIL : A Tool to Automatically Detect and Analyze Drift In Language

Machine learning and deep learning-based decision making has become part...

Data-driven Numerical Invariant Synthesis with Automatic Generation of Attributes

We propose a data-driven algorithm for numerical invariant synthesis and...

Similarity Measure Development for Case-Based Reasoning- A Data-driven Approach

In this paper, we demonstrate a data-driven methodology for modelling th...

Discovering conservation laws from trajectories via machine learning

Invariants and conservation laws convey critical information about the u...

Detecting cities in aerial night-time images by learning structural invariants using single reference augmentation

This paper examines, if it is possible to learn structural invariants of...

Please sign up or login with your details

Forgot password? Click here to reset