Less, but Stronger: On the Value of Strong Heuristics in Semi-supervised Learning for Software Analytics

by   Huy Tu, et al.

In many domains, there are many examples and far fewer labels for those examples; e.g. we may have access to millions of lines of source code, but access to only a handful of warnings about that code. In those domains, semi-supervised learners (SSL) can extrapolate labels from a small number of examples to the rest of the data. Standard SSL algorithms use “weak” knowledge (i.e. those not based on specific SE knowledge) such as (e.g.) co-train two learners and use good labels from one to train the other. Another approach of SSL in software analytics is potentially use “strong” knowledge that use SE knowledge. For example, an often-used heuristic in SE is that unusually large artifacts contain undesired properties (e.g. more bugs). This paper argues that such “strong” algorithms perform better than those standard, weaker, SSL algorithms. We show this by learning models from labels generated using weak SSL or our “stronger” FRUGAL algorithm. In four domains (distinguishing security-related bug reports; mitigating bias in decision-making; predicting issue close time; and (reducing false alarms in static code warnings), FRUGAL required only 2.5 out-performed standard semi-supervised learners that relied on (e.g.) some domain-independent graph theory concepts. Hence, for future work, we strongly recommend the use of strong heuristics for semi-supervised learning for SE applications. To better support other researchers, our scripts and data are on-line at https://github.com/HuyTu7/FRUGAL.


When Less is More: On the Value of "Co-training" for Semi-Supervised Software Defect Predictors

Labeling a module defective or non-defective is an expensive task. Hence...

FRUGAL: Unlocking SSL for Software Analytics

Standard software analytics often involves having a large amount of data...

Can We Achieve Fairness Using Semi-Supervised Learning?

Ethical bias in machine learning models has become a matter of concern i...

SNEAK: Faster Interactive Search-based Software Engineering (using Semi-Supervised Learning)

When reasoning over complex models, AI tools can generate too many solut...

Semi-Supervised Learning with Declaratively Specified Entropy Constraints

We propose a technique for declaratively specifying strategies for semi-...

Simpler Hyperparameter Optimization for Software Analytics: Why, How, When?

How to make software analytics simpler and faster? One method is to matc...

How to Improve AI Tools (by Adding in SE Knowledge): Experiments with the TimeLIME Defect Reduction Tool

AI algorithms are being used with increased frequency in SE research and...

Please sign up or login with your details

Forgot password? Click here to reset