The problem with “close enough” in financial advice
When a financial adviser sits down with a client, the conversation covers a lot of ground. Income, savings, debts, spending, assets. All of it needs to be recorded accurately, because it forms the foundation of the advice that follows. Get it wrong, and the advice is wrong too.
Aveni Assist already uses AI to pull this information out of call transcripts and turn it into a structured table, called a Fact Find, that advisers can work from. But even with AI doing the heavy lifting, someone still had to check the output by hand. Every table. Every field. Because in financial services, “probably right” is not good enough.
Manual checking is slow. It does not scale. And it makes it harder to improve the AI over time, because you have no quick, reliable way to measure whether a change made things better or worse.
So we rebuilt how we test and validate the AI.
Instead of relying on human reviewers to spot errors, we built an automated evaluation system that checks the AI’s output against the source material, field by field, and flags anything that looks like a hallucination. In AI, a hallucination is when the model generates something that sounds plausible but is not actually supported by what was said in the call.
Then we ran our in-house model, FinLLM, trained specifically for UK financial services, against GPT-4o to see how they compared.
FinLLM, fine-tuned on realistic synthetic financial conversations, outperformed GPT-4o on the majority of individual fields across income, expenditure, and asset categories. It also produced fewer formatting errors and more consistent output, which matters a lot when the tables feed into downstream systems.
The result is a more accurate Fact Find, a faster way to test and improve the model over time, and less time spent by advisers checking AI output before they can use it.
Read the full case study for the benchmark results, methodology, and what we found when we looked specifically at hallucination rates.