Intergenerational Test Generation for Natural Language Processing Applications

Kavli Affiliate: Jia Liu

| First 5 Authors: Pin Ji, Yang Feng, Weitao Huang, Jia Liu, Zhihong Zhao

| Summary:

The development of modern NLP applications often relies on various benchmark
datasets containing plenty of manually labeled tests to evaluate performance.
While constructing datasets often costs many resources, the performance on the
held-out data may not properly reflect their capability in real-world
application scenarios and thus cause tremendous misunderstanding and monetary
loss. To alleviate this problem, in this paper, we propose an automated test
generation method for detecting erroneous behaviors of various NLP
applications. Our method is designed based on the sentence parsing process of
classic linguistics, and thus it is capable of assembling basic grammatical
elements and adjuncts into a grammatically correct test with proper oracle
information. We implement this method into NLPLego, which is designed to fully
exploit the potential of seed sentences to automate the test generation.
NLPLego disassembles the seed sentence into the template and adjuncts and then
generates new sentences by assembling context-appropriate adjuncts with the
template in a specific order. Unlike the taskspecific methods, the tests
generated by NLPLego have derivation relations and different degrees of
variation, which makes constructing appropriate metamorphic relations easier.
Thus, NLPLego is general, meaning it can meet the testing requirements of
various NLP applications. To validate NLPLego, we experiment with three common
NLP tasks, identifying failures in four state-of-art models. Given seed tests
from SQuAD 2.0, SST, and QQP, NLPLego successfully detects 1,732, 5301, and
261,879 incorrect behaviors with around 95.7% precision in three tasks,
respectively.

| Search Query: ArXiv Query: search_query=au:”Jia Liu”&id_list=&start=0&max_results=3