Abstract
In this work, we systematically expose and mea- sure the inconsistency and knowledge gaps of Large Language Models (LLMs). Specifically, we propose an automated testing framework (called KONTEST) which leverages a knowl- edge graph to construct test cases. KONTEST probes and measures the inconsistencies in the LLM’s knowledge of the world via a combina- tion of semantically-equivalent queries and test oracles (metamorphic or ontological oracle). KONTEST further mitigates knowledge gaps via a weighted LLM model ensemble. Using four state-of-the-art LLMs (Falcon, Gemini, GPT3.5, and Llama2), we show that KONTEST generates 19.2% error inducing inputs (1917 errors from 9979 test inputs). It also reveals a 16.5% knowledge gap across all tested LLMs. A mitigation method informed by KONTEST’s test suite reduces LLM knowledge gap by 32.48%. Our ablation study further shows that GPT3.5 is not suitable for knowledge-based consistency testing because it is only 60%-68% effective in knowledge construction.
| Original language | English |
|---|---|
| Title of host publication | Findings of the Association for Computational Linguistics: EMNLP 2024 |
| Publisher | Association for Computational Linguistics |
| Pages | 10185-10196 |
| Number of pages | 12 |
| Publication status | Published - Nov 2024 |
| Event | The 2024 Conference on Empirical Methods in Natural Language Processing - Hyatt Regency Miami Hotel, Miami, United States Duration: 12 Nov 2024 → 16 Nov 2024 https://2024.emnlp.org/ |
Conference
| Conference | The 2024 Conference on Empirical Methods in Natural Language Processing |
|---|---|
| Abbreviated title | EMNLP 2024 |
| Country/Territory | United States |
| City | Miami |
| Period | 12/11/24 → 16/11/24 |
| Internet address |
Keywords
- consistency testing
- knowledge testing
- LLMs
- Large Language Models
- knowledge tracing/discovering/inducing
- probing
- robustness
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver