Using Score Distributions to Compare Statistical Significance Tests for Information Retrieval Evaluation

by   Javier Parapar, et al.

Statistical significance tests can provide evidence that the observed difference in performance between two methods is not due to chance. In Information Retrieval, some studies have examined the validity and suitability of such tests for comparing search systems. We argue here that current methods for assessing the reliability of statistical tests suffer from some methodological weaknesses, and we propose a novel way to study significance tests for retrieval evaluation. Using Score Distributions, we model the output of multiple search systems, produce simulated search results from such models, and compare them using various significance tests. A key strength of this approach is that we assess statistical tests under perfect knowledge about the truth or falseness of the null hypothesis. This new method for studying the power of significance tests in Information Retrieval evaluation is formal and innovative. Following this type of analysis, we found that both the sign test and Wilcoxon signed test have more power than the permutation test and the t-test. The sign test and Wilcoxon signed test also have a good behavior in terms of type I errors. The bootstrap test shows few type I errors, but it has less power than the other methods tested.


Inference at Scale Significance Testing for Large Search and Recommendation Experiments

A number of information retrieval studies have been done to assess which...

Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors

Statistical significance testing is widely accepted as a means to assess...

Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches

Developing state-of-the-art approaches for specific tasks is a major dri...

A conformal test of linear models via permutation-augmented regressions

Permutation tests are widely recognized as robust alternatives to tests ...

NLPStatTest: A Toolkit for Comparing NLP System Performance

Statistical significance testing centered on p-values is commonly used t...

Size Matters: The Use and Misuse of Statistical Significance in Discrete Choice Models in the Transportation Academic Literature

In this paper we review the academic transportation literature published...

Assessing Keyness using Permutation Tests

We propose a resampling-based approach for assessing keyness in corpus l...

Please sign up or login with your details

Forgot password? Click here to reset