Abstract
In this work, we systematically expose and mea- sure the inconsistency and knowledge gaps of Large Language Models (LLMs). Specifically, we propose an automated testing framework (called KONTEST) which leverages a knowl- edge graph to construct test cases. KONTEST probes and measures the inconsistencies in the LLM’s knowledge of the world via a combina- tion of semantically-equivalent queries and test oracles (metamorphic or ontological oracle). KONTEST further mitigates knowledge gaps via a weighted LLM model ensemble. Using four state-of-the-art LLMs (Falcon, Gemini, GPT3.5, and Llama2), we show that KONTEST generates 19.2% error inducing inputs (1917 errors from 9979 test inputs). It also reveals a 16.5% knowledge gap across all tested LLMs. A mitigation method informed by KONTEST’s test suite reduces LLM knowledge gap by 32.48%. Our ablation study further shows that GPT3.5 is not suitable for knowledge-based consistency testing because it is only 60%-68% effective in knowledge construction.
Original language | English |
---|---|
Title of host publication | Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. |
Publication status | Accepted/In press - 20 Sept 2024 |
Event | The 2024 Conference on Empirical Methods in Natural Language Processing - Hyatt Regency Miami Hotel, Miami, United States Duration: 12 Nov 2024 → 16 Nov 2024 https://2024.emnlp.org/ |
Conference
Conference | The 2024 Conference on Empirical Methods in Natural Language Processing |
---|---|
Abbreviated title | EMNLP 2024 |
Country/Territory | United States |
City | Miami |
Period | 12/11/24 → 16/11/24 |
Internet address |
Keywords
- consistency testing
- knowledge testing
- LLMs
- Large Language Models
- knowledge tracing/discovering/inducing
- probing
- robustness