NLP’s generalization problem, and how researchers are tackling it

According to NLP’s Clever Hans Moment has Arrived:

a model making predictions based on the presence or absence of a handful of words like “not”, “is”, or “do” does not understand anything about argumentation. The authors declare that their SOTA result is meaningless.

To combat it,

We hypothesize that statistical NLI models may adopt three fallible syntactic heuristics: the lexical overlap heuristic, the subsequence heuristic, and the constituent heuristic. To determine whether models have adopted these heuristics, we introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains many examples where the heuristics fail. We find that models trained on MNLI, including BERT, a state-of-the-art model, perform very poorly on HANS, suggesting that they have indeed adopted these heuristics.