The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing

Testing for statistical significance is an important aspect of most research concerning data analysis, for instance in the fields of statistics and social science. I discuss here the paper titled above, which gives an introduction to the concept and talks about the importance of such testing.

Photo by Carlos Muza on Unsplash

What is significance testing?

At the most basic level, a result is statistically significant if it has a very low probability of occurring if the null hypothesis (i.e. a default hypothesis which assumes that some given condition does not hold) were to be true. An alternative hypothesis, which contradicts the null hypothesis, is stated as well. Then a suitable test is chosen, under which a probability value called the p-value is computed. If this is below a certain confidence value (usually 5% or 1%), only then is the null hypothesis rejected in favor of the alternate hypothesis. A good introduction to this can be found here.

Photo by Antoine Dautry on Unsplash

Significance testing in NLP

Significance testing is quite conspicuously rare in NLP research papers. The paper addresses this issue, and presents the importance of such testing, along with some common tests that can be used for NLP algorithms. The paper also surveys several papers from the ACL conference proceedings and the TACL journal, and analyzes the prevalence of significance testing in them. Finally, some open questions are discussed, including the validity of this when the data has no discernible distribution, and the issue with dependent data points.

Why is this needed?

Significance testing can be an important tool to verify that a proposed algorithm’s performance compared to its baseline is not merely coincidental, that there is sound statistical proof to support it. The paper focuses on the problem of comparing one algorithm with one other, rather than multiple comparisons across several datasets. Such comparison can be easily validated by the appropriate significance test.

Types of tests

There are two types of tests for significance: parametric (when the data distribution is known beforehand) and non-parametric (unknown distribution). In parametric tests, the paper discusses the Student’s t test, to be used when the data distribution for both algorithms is assumed as normal. In non-parametric tests, two types are given: sampling-free and sampling-based.

  • Sampling-free tests: are those which do not sample the data and do not consider the evaluation metric values. The paper discusses four such tests: the sign test and two of its variants, and the Wilcoxon signed-rank test.


Further ahead, the authors analyze several papers from ACL and TACL and observe that very few report statistical significance results; that the word significant is used in a misleading manner; and many works even assume the data to be i.i.d and incorrectly use the t-test. They also discuss open questions, such as:

  1. Can significance testing be used in every experiment?

In my opinion, while statistical significance testing is an excellent method to assess the veracity of a paper’s claim, it should be used with caution because 1. it cannot fit for every kind of work and 2. it cannot be used on data with dependencies. Furthermore, choosing the correct test is especially important.

Part-time graduate student at University of Washington | Software Engineer at Paytm, India | I try not to sweat it. Meanwhile, I write on NLP research!