Optimizing Predictions for Very Small Data Sets: a case study on Open-Source Project Health Prediction

by   Andre Lustosa, et al.

When learning from very small data sets, the resulting models can make many mistakes. For example, consider learning predictors for open source project health. The training data for this task may be very small (e.g. five years of data, collected every month means just 60 rows of training data). Using this data, prior work had unacceptably large errors in their learned predictors. We show that these high errors rates can be tamed by better configuration of the control parameters of the machine learners. For example, we present here a landscape analytics method (called SNEAK) that (a) clusters the data to find the general landscape of the hyperparameters; then (b) explores a few representatives from each part of that landscape. SNEAK is both faster and and more effective than prior state-of-the-art hyperparameter optimization algorithms (FLASH, HYPEROPT, OPTUNA, and differential evolution). More importantly, the configurations found by SNEAK had far less error that other methods. We conjecture that SNEAK works so well since it finds the most informative regions of the hyperparameters, then jumps to those regions. Other methods (that do not reflect over the landscape) can waste time exploring less informative options. From this, we make the following conclusions. Firstly, for predicting open source project health, we recommend landscape analytics (e.g.SNEAK). Secondly, and more generally, when learning from very small data sets, using hyperparameter optimization (e.g. SNEAK) to select learning control parameters. Due to its speed and implementation simplicity, we suggest SNEAK might also be useful in other “data-light” SE domains. To assist other researchers in repeating, improving, or even refuting our results, all our scripts and data are available on GitHub at https://github.com/zxcv123456qwe/niSneak


Simpler Hyperparameter Optimization for Software Analytics: Why, How, When?

How to make software analytics simpler and faster? One method is to matc...

Predicting Project Health for Open Source Projects (using the DECART Hyperparameter Optimizer)

Software developed on public platforms are a source of data that can be ...

Improving Deep Learning for Defect Prediction (using the GHOST Hyperparameter Optimizer)

There has been much recent interest in the application of deep learning ...

The Early Bird Catches the Worm: Better Early Life Cycle Defect Predictors

Before researchers rush to reason across all available data, they should...

Revisiting Process versus Product Metrics: a Large Scale Analysis

Numerous methods can build predictive models from software data. But wha...

Flareon: Stealthy any2any Backdoor Injection via Poisoned Augmentation

Open software supply chain attacks, once successful, can exact heavy cos...

Why is Differential Evolution Better than Grid Search for Tuning Defect Predictors?

Context: One of the black arts of data mining is learning the magic para...

Please sign up or login with your details

Forgot password? Click here to reset