How can we assess the reliability of large language models?
With the increasing use of large language models (LLMs), it’s important to understand their capabilities and limitations. This research paper, “Assessing the Reliability of Large Language Model Knowledge,” focuses on evaluating LLMs to determine their factual reliability.
It assesses the impact of factors that cause hallucinations on LLM accuracy by analysing different prompts and formats used to probe knowledge. The researchers test different sized language models with a range of prompts and formats to see how well they perform under different conditions.
The authors, including the Head of Aveni Labs, Dr Alexandra Birch, and Dr Barry Haddow, Aveni’s Head of Natural Language Processing, introduce a new metric called MOdel kNowledge relIabiliTy scORe (MONITOR).
MONITOR works by asking the same question in different ways (changing the wording or adding details) and seeing how likely the model is to give the same answer each time. This helps to identify situations where the model might be making something up instead of providing a real fact.
A high MONITOR score means the LLM is more likely to give the correct answer regardless of how you ask the question, suggesting the knowledge the LLM produced is more reliable.
Key takeaways from Assessing the Reliability of Large Language Model Knowledge:
- Large language models (LLMs) are influenced by various factors that cause hallucinations, leading to inaccuracies in generated output. Simply measuring accuracy on one set of prompts does not capture this.
- The novel metric MONITOR is proposed to measure the factual reliability of LLMs by assessing the distance between probability distributions of valid outputs under different prompts and contexts.Â
- MONITOR proves effective in evaluating the factual reliability of LLMs while maintaining low computational overhead.
- Experiments on 12 LLMs demonstrate the correlation between MONITOR and average accuracy, indicating its suitability for assessing factual knowledge.
- The FKTC (Factual Knowledge Test Corpus) test set is released to support further research in evaluating the capabilities of LLMs in generating factually correct answers.