Knowledge-based Consistency Testing of Large Language Models

Sai Sathiesh Rajan, Ezekiel Soremekun, Sudipta Chattopadhyay

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this work, we systematically expose and mea- sure the inconsistency and knowledge gaps of Large Language Models (LLMs). Specifically, we propose an automated testing framework (called KONTEST) which leverages a knowl- edge graph to construct test cases. KONTEST probes and measures the inconsistencies in the LLM’s knowledge of the world via a combina- tion of semantically-equivalent queries and test oracles (metamorphic or ontological oracle). KONTEST further mitigates knowledge gaps via a weighted LLM model ensemble. Using four state-of-the-art LLMs (Falcon, Gemini, GPT3.5, and Llama2), we show that KONTEST generates 19.2% error inducing inputs (1917 errors from 9979 test inputs). It also reveals a 16.5% knowledge gap across all tested LLMs. A mitigation method informed by KONTEST’s test suite reduces LLM knowledge gap by 32.48%. Our ablation study further shows that GPT3.5 is not suitable for knowledge-based consistency testing because it is only 60%-68% effective in knowledge construction.
Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics: EMNLP 2024. 2024.
Publication statusAccepted/In press - 20 Sept 2024
EventThe 2024 Conference on Empirical Methods in Natural Language Processing - Hyatt Regency Miami Hotel, Miami, United States
Duration: 12 Nov 202416 Nov 2024
https://2024.emnlp.org/

Conference

ConferenceThe 2024 Conference on Empirical Methods in Natural Language Processing
Abbreviated titleEMNLP 2024
Country/TerritoryUnited States
CityMiami
Period12/11/2416/11/24
Internet address

Keywords

  • consistency testing
  • knowledge testing
  • LLMs
  • Large Language Models
  • knowledge tracing/discovering/inducing
  • probing
  • robustness

Cite this