Part of the AI on Trial: The Burden of Proof campaign series
Count V: Building AI on models with no financial services provenance
A firm deployed a generative AI tool across its advice and customer service functions. The model came from a major AI vendor. It looked capable, ran fast, and demoed well. Then the FCA asked the firm to document what the model had been trained on, how it had been tested for regulated work, and what its known limitations were in financial services. The firm could not answer. The vendor had never disclosed the training data. The benchmarks were generic. The audit chain stopped at the firm’s own deployment layer. The model was a black box, and the firm carried the regulatory risk for everything it produced. Nobody had asked the question at procurement.
Most AI in financial services was not built for financial services
The large language models most firms use today were trained on internet text. The training data came from sources like Reddit, Wikipedia, and general web content. Nobody selected that data for financial services, regulated it for accuracy, or reviewed it against FCA expectations.
Firms then wrap the model in policies, prompts, and human oversight to make it safe for regulated work. By the time the model arrives at the firm, it has already been shaped by what it learned. Every control built around it works to compensate for that.
This retrofit approach is starting to break down. The gap between what the model knows and what the firm needs it to know keeps widening. A model trained on general internet data has to translate everything into a regulated context. A model trained on FCA guidance, suitability rules, and regulated firm data starts inside that context.
Most Chief AI Officers and Heads of Transformation will recognise the problem when they think about it from the regulator’s side. If the FCA asked you today to document what your AI model was trained on, who tested it, and what its known weaknesses are in your sector, would you have the answer? For most firms running generic LLMs, the honest answer is no.
For a full breakdown of all five governance gaps firms face when deploying AI, see the complete framework → AI Governance in UK Financial Services: The Accountability Framework
What the EU AI Act expects firms to document about their AI
The EU AI Act treats most financial services AI as high-risk. The first phase came into effect on 2 February 2025, banning a set of high-risk uses. General-purpose AI obligations followed on 2 August 2025, requiring providers to publish risk assessments, transparency disclosures, and mitigation measures for systems with systemic risk. The high-risk obligations begin to apply on 2 August 2026.
The Act puts three concrete obligations on deployers. Firms have to document the training data sources behind the model. They have to provide technical documentation on how the model was built and tested. They have to explain how the system works to authorities when asked.
“We bought it from a major AI vendor” does not satisfy any of these. The Act expects the firm using the model, not just the firm that built it, to be able to explain what the model was trained on, by whom, and with what oversight. Even UK firms that fall outside the EU AI Act face the same standard through SS1/23 and the FCA’s wider direction on AI.
The FCA expects firms to demonstrate effective control of the AI systems they deploy. SS1/23, the PRA and FCA model risk management framework, sets supervisory expectations for model identification, governance, validation, and ongoing monitoring. SMCR places personal accountability on senior managers for outcomes their firm produces, including outcomes produced by third-party AI.
Outsourcing the model does not outsource the responsibility. UK regulators expect the firm to answer for what its AI does, regardless of who built the underlying model.
Why foundation model transparency is going in the wrong direction
The Foundation Model Transparency Index tracks how openly the major AI labs disclose information about their models. The 2025 results show a worrying pattern.
The average score climbed from 37 in 2023 to 58 in 2024, then dropped back to 40 in 2025. The companies building the models are now disclosing less than they were a year ago. The biggest gaps sit upstream, in training data, compute resources, and post-deployment impact. Across 13 assessed foundation models, the average disclosure score for Data Acquisition was 31%. For Data Properties, it was 15%.
The Artificial Analysis Openness Index tells a similar story from a different angle. It scores models on weight access, training methodology, and pre- and post-training data transparency. Most leading models score between 2 and 16 out of 100. Almost every model scores zero on pre-training data transparency.
This creates a structural problem for any regulated firm. The EU AI Act and SS1/23 require firms to evidence what their models were trained on. The vendors are moving in the opposite direction. Firms deploying these models are being asked to produce documentation that the vendors do not provide.
A regulator will not accept a model card as proof. Firms need documented provenance for the model itself. When that provenance does not exist at the vendor layer, the only way to close the gap is to control the training data and documentation directly. That is what domain-specific models are designed to do.
How hallucination rates show the limits of generic models in regulated work
Most people assume the leading AI models are reliable enough for serious work. The benchmark data tells a more complicated story.
Researchers tested 26 leading AI models on the AA-Omniscience benchmark, which covers 6,000 questions across six domains including law and health. Hallucination rates ranged from 22% at the best-performing end to 94% at the worst. A lower rate means the model produces more factual answers, or signals uncertainty rather than guessing with high confidence. Most leading models sit at the higher end of that range.
A separate benchmark, KaBLE, tests something even more relevant to advice work. It measures whether models can tell what is known apart from what is merely believed. GPT-4o scores 98.2% on tasks involving true beliefs. When the same false statement gets framed as something a user believes rather than a third party, accuracy drops to 64.4%. DeepSeek R1 falls from over 90% to 14.4% on the same task.
This distinction shows up constantly in financial advice. A customer says they believe they qualify for a particular tax relief. They say they believe their pension is on track. They say they believe a product matches their risk profile. If the AI accepts those beliefs as facts and builds a recommendation on top of them, the guidance comes out flawed. The model has no built-in instinct to push back.
The market has started to catch on. The share of organisations rating inaccuracy as a relevant AI risk climbed from 60% in 2024 to 74% in 2025. Recognising the risk and structurally addressing it sit on opposite sides of a wide gap. Generic models perform well on benchmarks designed for general knowledge. They fall behind on tasks specific to regulated environments because nobody trained them for that work in the first place.
→ Read how AI safety in financial services depends on domain-specific design, not generic guardrails
Why training data provenance is the foundation of every other AI control
The deepest problem with generic models has nothing to do with their performance on a single benchmark. The controls firms build around them all rest on something the firm cannot actually verify.
Every governance step a firm puts in place (oversight, audit trails, monitoring, coaching) assumes the model produces outputs the firm can review and explain. That assumption falls apart when the firm cannot explain what the model knows, where it learned it, or how it was tested for the work it is doing. Provenance is the evidence chain that makes every other governance step defensible. Without it, every layer above starts to look thinner.
There are four questions a regulator or auditor will ask, and which a firm needs to answer.
- What was this model trained on? Firms need documented training data sources, including sector relevance and known limitations. Phrases like “publicly available data” and “internet text” do not hold up under scrutiny. Specific sources, with documented selection criteria, do.
- How was it tested for the work we are using it for? The benchmarks need to match the firm’s actual use case. A model that scores well on general knowledge tests has not been tested for suitability assessment, vulnerability identification, or COBS 9 compliance. Those tests need to happen separately, and the firm needs to see the results.
- What does it know, and what does it not know? Firms need documented knowledge boundaries, including topics where the model should refuse or escalate to a human. A model that cannot articulate the limits of its own knowledge becomes a liability the moment it gets deployed in regulated advice.
- Who is accountable for changes to the model? Firms need versioning, change control, and clear responsibility for retraining or replacement. When the vendor updates the model without notice, the firm has lost control of its own deployment.
A firm that cannot answer these four questions has no defence to the EU AI Act’s documentation requirements or SS1/23’s validation expectations. Domain-specific models are structured to provide those answers from the start.
Why AI sovereignty matters for UK financial services firms
AI sovereignty has moved out of the policy fringe and into the centre of national AI strategy. The 2026 AI Index Report from Stanford frames sovereignty across five dimensions: infrastructure, data, model, application, and talent. Three of those dimensions matter immediately for UK financial services. Where the model was built. Where regulated data is processed. Where the compute lives.
The data shows how concentrated the situation has become. The United States leads cumulative model releases with 1,618, followed by China at 849. Europe and Central Asia together account for 666 releases, and the United Kingdom leads the European total with 229.
Data localisation tells a different story. By 2024, Europe and Central Asia had adopted 66 data localisation measures. North America had adopted 3. That gap reflects fundamentally different views on cross-border data flows. For a regulated UK or EU firm, the implications come through clearly. A model trained in one jurisdiction, hosted in another, and serving customers in a third creates layers of regulatory risk that domain-specific, in-jurisdiction models avoid altogether.
The UK’s wider direction reinforces the point. At the 2025 Paris AI Action Summit, the UK declined to sign the declaration on inclusive, ethical AI, citing a lack of emphasis on national security. That signals a UK posture more focused on security and sovereignty in AI policy. For regulated firms operating under FCA and PRA expectations, the direction matters. Domestic, documented, and sector-specific has become the regulatory baseline.
In financial services, where data governance and customer protection sit at the centre of how firms operate, sovereignty has become an operational requirement that firms need to plan for, not a debate they can sit out.
→ Explore why Sovereign AI is the foundation for secure, compliant AI in UK financial services
What domain specific AI models for financial services actually looks like
Moving from generic to domain-specific does not mean building a model from scratch. It means deploying a model that meets four specific tests.
The first test is the training data. The model needs a curated financial services corpus behind it: FCA handbook, regulatory guidance, suitability rules, sector-specific terminology, vulnerability frameworks. The sources need to be documented and selected for relevance to the work the model will do, rather than scraped from the open web.
The second test is the benchmarking. Generic benchmarks measure performance on quiz-show trivia. The benchmarks that matter for a regulated firm cover suitability assessments, vulnerability identification, COBS 9 compliance checks, and complaint handling. The model needs to be tested on the actual work, with results that hold up to scrutiny.
The third test is auditability. The architecture needs versioning, change logs, and documented retraining cycles. Someone has to be accountable for when the model changes and what changed. Without that record, the firm loses visibility over its own deployment.
The fourth test is regulatory alignment. Models built for regulated environments treat FCA, PRA, and EU AI Act expectations as design constraints from the start, rather than retrofitting them later. The alignment becomes part of the model rather than a layer bolted on at deployment.
The market has started to move in this direction. Organisations now cite ISO/IEC 42001, the AI management system standard, in 36% of cases as a regulatory influence on their responsible AI practices. The NIST AI Risk Management Framework appears in 33% of cases, and the EU AI Act in 43%. Domain-specific models can demonstrably align with all three. Generic models cannot.
→ Discover how task-specific AI is reshaping financial services before general AI gets there
How vendor evaluation needs to change before procurement
Most AI procurement processes were designed for software, not for models. They focus on features, integrations, and pricing. The provenance questions that determine regulatory defensibility often go missing entirely.
There are five questions to ask any AI vendor before deployment.
- Can you provide documented training data sources, including sector and date coverage?
- What benchmarks have you tested the model against that match our regulated use cases?
- How is the model versioned, and how will we be notified of material changes?
- Where is the model hosted, and does our customer data leave UK or EU jurisdiction?
- Can you provide documentation that supports our obligations under the EU AI Act, SS1/23, and Consumer Duty?
The transparency data shows why most major vendors will struggle to answer these questions in any depth. Across 13 leading model developers, the average disclosure score for Data Acquisition is 31%. For Data Properties, it drops to 15%.
When a vendor cannot answer these questions, the compliance burden shifts onto the buyer. The buyer is the regulated firm, and the buyer carries the liability. Procurement processes that skip these questions before contract end up storing regulatory risk for later.
How FinLLM was built for financial services from the start
The charge in Count V is building AI on models with no financial services provenance. The defence is deploying a model that was built for the work, with the documentation to prove it.
This is what FinLLM does. The first iteration was released in May 2025 and is in live testing with a tier-one UK bank. Aveni co-developed the model with major UK financial services partners including Nationwide and Lloyds Banking Group. The training corpus draws on curated UK financial services materials, including FCA guidance, regulatory documents, and the content financial professionals actually use in their work. FinLLM was built in the UK, hosted in the UK, and designed to keep regulated data inside UK jurisdiction.
The benchmarking matches the work. AveniBench-Safety tests FinLLM against financial services-specific safety risks, rather than generic benchmarks. The architecture is documented. Versioning and change control are clear. Ongoing model risk management aligns with SS1/23 expectations.
Think of FinLLM as Exhibit E: DNA evidence. In a courtroom, DNA evidence answers the question of origin. It establishes where something came from, what it is made of, and whether it matches the conditions of the case. FinLLM provides the same kind of evidence for financial services AI: documented training data, regulated-task benchmarks, in-jurisdiction hosting, and an architecture built around FCA and EU AI Act expectations.
When a regulator or an EU AI Act assessor asks the four provenance questions, FinLLM has the answers on file. Most generic LLMs do not.
The previous four charges (oversight, audit trail, monitoring, guidance quality) all rest on this one. When the model itself was not built for the work, every other control ends up compensating for a problem that should have been solved at the source.
Where does your firm stand?
The five governance gaps outlined in the AI on Trial series cover the questions regulators, boards, and senior managers are asking now. The provenance gap determines whether your firm can defend its AI deployment from the model up, or whether the burden of proof falls on a black box you cannot inspect.
See how Aveni helps firms deploy AI that meets the regulatory standard from the model up →
This article is part of Aveni’s AI on Trial: The Burden of Proof campaign series.
Read the full series:
- AI Governance in UK Financial Services: The Accountability Framework
- Count I: SMCR Compliance and AI Agent Oversight
- Count II: AI Advice Without an Audit Trail
- Count III: Why Sampling 3% Falls Short of Consumer Duty
- Count IV: When AI Guidance Goes Wrong at Scale
- Count V: Why Financial Services AI Needs Domain-Specific Models