AI agent stress testing for financial services: what good looks like

On 20 January 2026, the UK Treasury Committee published a report on AI in financial services that did something unusual for a parliamentary document. It used the word “harm” without softening. It said the regulators were too slow. And it made a specific recommendation that has since reshaped the AI agent governance conversation across UK financial services.

The Bank of England and the FCA, the Committee said, should conduct AI-specific stress testing. Not generic model risk management. Not annual scenario analysis. Stress testing built for the specific failure modes of AI systems. By April 2026, the Bank of England had agreed, confirming work on simulation methods to investigate AI agents demonstrating correlated behaviour, or “herding,” in financial markets.

That was the macroprudential conversation. The microprudential one, the conversation happening inside individual firms scoping individual agentic deployments, has the same shape and a different urgency. Every regulated firm putting an AI agent into a customer journey now needs an answer to a question most second-line teams cannot yet answer: how do we stress test this thing before it goes live?

Stress testing is the part of AI agent governance that most firms still do not have a clear framework for. Capability gets tested in the demo. Performance gets benchmarked. But the question of whether an agent will behave inside its risk envelope when the conditions get adversarial, edge-case, or commercially inconvenient? That part tends to be skipped. Or hoped for. Or assumed away.

This piece is for the firms that have stopped assuming. What follows: what AI agent stress testing actually involves, the scenarios that matter for regulated firms, and how the work differs from generic AI red teaming.

What AI agent stress testing is, and why it is different from QA

Stress testing in financial services is not new. UK banks have been running capital and liquidity stress tests under the Bank of England’s regime since 2014. The logic is straightforward. Take a complex system, hit it with conditions it was not designed for, and see whether it survives.

AI agent stress testing applies the same logic to a different object. The system being tested is not a balance sheet. It is a model, sitting inside a workflow, talking to a customer. The conditions are not a macroeconomic scenario. They are prompts, edge cases, vulnerability indicators, and adversarial inputs. The output is not a capital ratio. It is whether the agent did the right thing, or something close enough to it, when the conditions stopped being benign.

The discipline borrows from three adjacent fields and is not the same as any of them.

It is not model validation. SS1/23, the Prudential Regulation Authority’s model risk management principles for banks, was written for the deterministic models that sit behind credit decisions and capital calculations. Generative AI agents are not deterministic. The same input can produce different outputs. The validation discipline assumed by SS1/23 is necessary, but it does not cover the failure modes that matter most for an agentic deployment.

It is not generic AI red teaming. The OWASP Top 10 for LLM Applications lists prompt injection at number one for the second year running. The list is essential reading. It is also, for a regulated firm, only half the story. Red teaming finds security and safety failures. Stress testing in a regulated context has to find the conduct and regulatory failures too. The agent that gives a confident answer about a complex product to a vulnerable customer is not failing a security test. It is failing Consumer Duty.

It is not traditional QA. QA samples interactions retrospectively. Stress testing happens before deployment, against simulated interactions designed specifically to provoke the agent in directions a live conversation would not, at least not reliably. We covered why retrospective sampling falls short in Count III of the AI on Trial: The Burden of Proof series. The conclusion was that sampling 3% of live interactions tells you almost nothing about the other 97%. Stress testing is what fills the gap. It is the deliberate, structured attempt to find the 3% of failure modes that QA will not see in time.

The four scenarios that matter for UK financial services

A useful stress test is not a generic security exercise dressed up with a financial services badge. It is a structured attempt to provoke the failure modes the FCA is most likely to ask about. Four scenario categories matter most for UK regulated firms scoping an agentic deployment in 2026.

Regulatory edge cases. The agent is given a customer query that sits on the boundary of an advised and a non-advised journey. Or a Consumer Duty question dressed up as a casual one. Or a request for product information that, answered confidently, would constitute a personal recommendation. The test is whether the agent recognises the boundary and behaves accordingly. Most off-the-shelf foundation models do not, because they were not trained to see those boundaries in the first place. This is the case we made in Count V of the series, on why specialist financial services models matter more than generic ones.

Vulnerable customer indicators. The customer’s language signals vulnerability: financial distress, recent bereavement, cognitive impairment, English as a second language. The agent has to recognise the indicator, adjust its approach, and route or escalate where appropriate. A stress test for this scenario does not just check whether the agent uses the word “vulnerable” in a reply. It checks whether the agent’s actual behaviour, across a full interaction, meets the standard set out in the FCA’s vulnerable customer guidance. Count IV of the AI on Trial series sets out why this is the regulator’s most likely first enforcement target.

Prompt injection in regulated contexts. A customer pastes a contractual term into the chat and asks the agent to “agree” to it. A customer asks the agent to “ignore previous instructions” and confirm a transaction the agent is not authorised to confirm. A document uploaded as part of a journey contains an instruction designed to compromise the agent’s behaviour. The OWASP guidance covers the technical mechanics. The regulated-firm question is whether the agent, when subjected to these inputs, produces an output the FCA would call non-compliant. Indirect prompt injection (instructions hidden inside documents the model reads) is the version most firms have not yet tested for.

Consistent failure modes across volume. Generic AI red teaming tends to look for whether a model can be made to fail at least once. The interesting question for an agentic deployment in financial services is different. When the model fails, does it fail consistently? Does the same hallucinated product feature appear in every interaction with that trigger? Does the same vulnerability indicator get missed every time? Consistent failure at scale is the regulatory exposure that matters. A human adviser making the same mistake ten times is a training issue. An AI agent making it ten thousand times is a Final Notice.

What good AI agent stress testing produces

The output of a stress testing programme is not a slide. It is an evidence pack that the second line of defence can file alongside an SMCR sign-off. We covered the components of that evidence pack in detail in our guide to evidencing AI agent compliance to the FCA, but three elements specifically come from stress testing.

A failure mode inventory. Every category of failure the testing identified, with frequency, severity, and the conditions under which it occurred. This is the document the Chief Risk Officer reads before signing.

A remediation log. For each failure mode, what was changed, when, and how it was re-tested. The regulator’s interest is not whether the agent ever failed in testing. The regulator’s interest is whether the failure was found, fixed, and verified.

A residual risk statement. Honest documentation of what remains untestable, what conditions the agent has not been exposed to, and what monitoring is therefore required post-deployment. We covered why audit trails matter in Count II of the series. The residual risk statement is the bridge between pre-deployment stress testing and the real-time monitoring that has to take over once the agent goes live.

This is also where SMCR accountability lands. The named senior manager signing off on the agent’s deployment is signing off on the failure mode inventory, the remediation log, and the residual risk statement specifically. The agent does not take the call from the FCA. They do. Count I of the series sets out what reasonable steps look like in practice.

How Aveni approached this in the FCA Supercharged Sandbox

Aveni was selected for the inaugural FCA Supercharged Sandbox cohort, which ran from October 2025 to January 2026. We used the programme to pilot Agent Assure, our AI agent governance product. A large part of the work focused on exactly the stress testing question this piece covers.

Three things came out of it that should change how firms scope their own approach.

Pre-deployment validation is more defensible when it uses an independent assessor model. Asking the primary LLM to evaluate its own outputs creates a structural conflict. Aveni piloted an approach using small language models trained specifically on UK financial services data to act as independent assessors of primary model behaviour. The assessor sits outside the agent it is testing, which is the configuration the second line of defence will eventually require.

The test scenarios have to come from real interaction data, not synthetic prompts alone. Generic adversarial prompts find generic failure modes. Stress testing that finds the specific failures a UK regulated firm would be exposed to needs scenarios drawn from real customer interactions, labelled by compliance experts, and refined over multiple rounds of testing.

Evidence packs are the deliverable, not the agent. The Sandbox confirmed what we had suspected from earlier client work: the firms moving fastest on agentic deployment are the ones that define the assurance requirement first and design the agent build around it.

Where to start

If you are scoping a stress testing programme for an AI agent deployment in 2026, three actions in the next 60 days.

First, define your failure mode taxonomy. The four categories above are a starting point. Adapt them to your specific deployment and document the version your second line will accept.

Second, agree your evidence pack template before testing starts. The pack format defines what you will look for. Working backwards from the document the senior manager will sign is more efficient than testing first and writing it up after.

Third, make stress testing repeatable. A one-off test before launch is not a programme. The model will drift. The product surface will change. The threat landscape will evolve. The stress testing infrastructure has to be standing capability, not a project.

The Treasury Committee called for AI-specific stress testing because the absence of it leaves consumers and the system exposed. The same logic applies inside individual firms. The agent will be tested either way. The choice is whether it is tested before deployment, by the firm that built it, or after deployment, by the customer it failed.


Read the full AI on Trial: The Burden of Proof series


Frequently Asked Questions

What is AI agent stress testing in financial services? AI agent stress testing is a structured pre-deployment process that exposes an AI agent to adversarial, edge-case, and regulatory-sensitive scenarios to identify how it will behave outside benign conditions. For UK financial services, it covers regulatory edge cases, vulnerable customer indicators, prompt injection, and consistent failure modes at scale. The output is an evidence pack the second line of defence can use to sign off on the deployment.

How is AI agent stress testing different from AI red teaming? Generic AI red teaming, such as the OWASP Top 10 for LLM Applications, focuses on security and safety failures: prompt injection, data leakage, excessive agency. AI agent stress testing in a regulated context covers all of that plus the conduct and regulatory failures that matter for FCA compliance, including Consumer Duty outcomes, vulnerable customer handling, and the boundary between advised and non-advised journeys.

Does the FCA require AI agent stress testing? The FCA does not currently require AI agent stress testing as a named regulatory obligation, but the Bank of England and FCA have committed to introducing AI-specific stress testing following the UK Treasury Committee’s January 2026 report. In practice, firms deploying AI agents already need pre-deployment evidence under Consumer Duty and SMCR, and stress testing is the standard way to produce that evidence.

What scenarios should AI agent stress testing cover for a UK regulated firm? Four scenario categories matter most for UK financial services: regulatory edge cases (advised vs non-advised journey boundaries, Consumer Duty triggers), vulnerable customer indicators (recognition and appropriate behavioural response), prompt injection in regulated contexts (including indirect injection via documents), and consistent failure modes at scale (whether the agent fails the same way every time conditions are met).

What is the difference between AI agent stress testing and model validation under SS1/23? SS1/23, the Prudential Regulation Authority’s model risk management principles, was written for deterministic models behind credit and capital decisions. Generative AI agents are not deterministic and produce variable outputs from the same input. SS1/23 validation remains necessary but does not cover the conduct, prompt injection, and conversational failure modes that matter most for an agentic deployment.

Who is accountable for AI agent stress testing in a UK regulated firm? Under SMCR Senior Manager Conduct Rule 2, the named senior manager responsible for the business area in which the AI agent operates is accountable for the controls relied on at deployment, including the stress testing evidence. The agent does not transfer the liability. The senior manager signs off on the failure mode inventory, the remediation log, and the residual risk statement that come out of stress testing.

Can AI agent stress testing be done with general-purpose foundation models? General-purpose foundation models can be stress tested, but the test scenarios and assessor models need to be specific to UK financial services to find the failure modes that matter for regulatory compliance. Generic adversarial prompts find generic failure modes. Stress testing that produces an evidence pack a UK regulator will accept typically requires specialist financial services data and assessor models trained on UK regulatory materials.

Share with your community!

In this article

Related Articles

Join our newsletter

Be the first to hear about new features, releases, and best-practice guides.

Aveni AI Logo