Academic Papers

Assessing the Reliability of Large Language Model Knowledge

With the increasing use of large language models(LLM), it’s important to understand their capabilities and limitations. This research paper, “Assessing the Reliability of Large Language Model Knowledge,” focuses on evaluating LLMs to determine their factual reliability.


It assesses the impact of factors that cause hallucinations on LLM accuracy by analysing different prompts and formats used to probe knowledge. The researchers test different sized language models with a range of prompts and formats to see how well they perform under different conditions.


 The authors, including the Head of Aveni Labs, Dr Alexandra Birch, and Dr Barry Haddow, Aveni’s Head of Natural Language Processing, introduce a new metric called MOdel kNowledge relIabiliTy scORe (MONITOR).


MONITOR works by asking the same question in different ways (changing the wording or adding details) and seeing how likely the model is to give the same answer each time. This helps to identify situations where the model might be making something up instead of providing a real fact.


A high MONITOR score means the LLM is more likely to give the correct answer regardless of how you ask the question, suggesting the knowledge the LLM produced is more reliable.


Key takeaways from the paper:


  • Large language models (LLMs) are influenced by various factors that cause hallucinations, leading to inaccuracies in generated output. Simply measuring accuracy on one set of prompts does not capture this.


  • The novel metric MONITOR is proposed to measure the factual reliability of LLMs by assessing the distance between probability distributions of valid outputs under different prompts and contexts. 


  • MONITOR proves effective in evaluating the factual reliability of LLMs while maintaining low computational overhead.


  • Experiments on 12 LLMs demonstrate the correlation between MONITOR and average accuracy, indicating its suitability for assessing factual knowledge.


  • The FKTC (Factual Knowledge Test Corpus) test set is released to support further research in evaluating the capabilities of LLMs in generating factually correct answers.


Download research paper

Other resources

Drive contact centre efficiency

Hallucinations in Large Multilingual Translation Models

Consumer Duty Solutions Series: 3 risks firms need to address to be compliant

AI: Why an executive understanding is so important

Demonstrating Consumer Duty compliance with technology

Chief Risk Officer Consumer Duty Survey Results

The Value of Voice

Ensure the fair treatment of vulnerable customers

Consumer Duty Solutions Series: What does a data-driven regulator expect from you?

DC Tech Talks: ChatGPT and NLP in Financial Services

Aveni’s platform uses the latest in NLP to transform productivity and risk oversight.

Scale compliance at a fraction of the cost

Cut financial advice admin from hours to minutes with Aveni’s AI assisitant

Aveni Assist

Get up and running with Aveni Assist and how it can help transform productivity and compliance. 

Aveni Detect

Get up and running with Aveni Detect and how it can help transform productivity and compliance. 

Read the latest articles from Aveni

Access our latest whitepapers, webinars, brochures and more

Jargon-bust your way to a better understanding of all things AI