Can LLMs Judge Information Sufficiency?
Martirosyan, Vahan
Martirosyan, Vahan
Author
Supervisor
Department
Natural Language Processing
Embargo End Date
30/05/2025
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Recent methods in Retrieval-Augmented Generation (RAG) and Large Language Model (LLM) reasoning often utilize LLMs to evaluate whether the retrieved context is sufficient for generating correct responses to user queries, and the context is iteratively augmented until the LLM infers that it contains sufficient information to generate a correct response. The ability to assess information sufficiency for complex tasks is not trivial, and the abilities of LLMs to this end are not sufficiently studied. Furthermore, modern state-of-the-art LLMs incur significant computational costs during inference as a result of their computational complexity and scale, and more efficient tools are necessary to infer information sufficiency during retrieval. This work attempts to make a meaningful contribution towards addressing these issues. Leveraging the Think-on-Graph (ToG) Graph-RAG pipeline, we use GPT-4o-mini to search the Freebase knowledge graph and sequentially retrieve relevant triples in order to answer queries from the GrailQA KGQA dataset. After each hop of knowledge graph exploration, the GPT-4o-mini agent predicts whether the combination of retrieved knowledge graph triples and its internal knowledge is sufficient to answer the query. We track this information sufficiency inference for approximately 1,500 queries and obtain reasoning traces that we then systematically ablate by strategically removing sets of critical triples. Using the results obtained from inference on these ablations, we analyze decision boundaries for successful and unsuccessful information sufficiency classification. We find that GPT4o-mini achieves an overall F1 of about 0.65, and has a strong tendency to overestimate sufficiency, often leading to incorrect final answers. To address issues of costly and often inaccurate information sufficiency classification, we finetune Flan-T5-Large and Llama-3.1-8B as dedicated sufficiency classifiers, using a
dataset construction from ablations on reasoning traces obtained using ToG. Finetuned Llama-3.1-8B surpasses the performance of GPT-4o-mini, achieving 70% F1 and more reliably detecting insufficient context. Finally, we introduce a method to quantify and rank the importance of individual knowledge graph triples, and finetune Llama-3.1-8B to predict triple importance rankings. The finetuned model exhibited promising results, correctly determining the first and second ranks with approximately 82% and 64% accuracy, respectively. Thus, our findings contribute to an improved understanding of information sufficiency reasoning, and more broadly towards more efficient, factually grounded, and trustworthy LLM.
Citation
Vahan Martirosyan, “Can LLMs Judge Information Sufficiency?,” Master of Science thesis, Natural Language Processing, MBZUAI, 2025.
Source
Conference
Keywords
Large Language Model (LLM), Knowledge Graph, Information Sufficiency, Self - referential reasoning
