According to NLP’s Clever Hans Moment has Arrived:
a model making predictions based on the presence or absence of a handful of words like “not”, “is”, or “do” does not understand anything about argumentation. The authors declare that their SOTA result is meaningless.
To combat it, https://arxiv.org/abs/1902.01007:
We hypothesize that statistical NLI models may adopt three fallible syntactic heuristics: the lexical overlap heuristic, the subsequence heuristic, and the constituent heuristic. To determine whether models have adopted these heuristics, we introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains many examples where the heuristics fail. We find that models trained on MNLI, including BERT, a state-of-the-art model, perform very poorly on HANS, suggesting that they have indeed adopted these heuristics.