Learnability of Learning Performance and Its Application to Data Valuation
For most machine learning (ML) tasks, evaluating learning performance on a given dataset requires intensive computation. On the other hand, the ability to efficiently estimate learning performance may benefit a wide spectrum of applications, such as active learning, data quality management, and data valuation. Recent empirical studies show that for many common ML models, one can accurately learn a parametric model that predicts learning performance for any given input datasets using a small amount of samples. However, the theoretical underpinning of the learnability of such performance prediction models is still missing. In this work, we develop the first theoretical analysis of the ML performance learning problem. We propose a relaxed notion for submodularity that can well describe the behavior of learning performance as a function of input datasets. We give a learning algorithm that achieves a constant-factor approximation under certain assumptions. Further, we give a learning algorithm that achieves arbitrarily small error based on a newly derived structural result. We then discuss a natural, important use case of learning performance learning – data valuation, which is known to suffer computational challenges due to the requirement of estimating learning performance for many data combinations. We show that performance learning can significantly improve the accuracy of data valuation.
READ FULL TEXT