The pharmaceutical industry has spent 18 months validating AI systems against historical test cases. The errors that will generate your next 483 are in territory no test case anticipated. You don't validate your QA manager. You train them, supervise their work, and review their outputs. That framework already exists — and it's what 211.22(c) actually requires.
Consider what this scenario suggests. A validation manager presents a completed GAMP 5 package with the confidence of someone who has done everything right. IQ/OQ/PQ complete. 147 UAT test cases. All passed. The AI-assisted deviation root cause classification system has been put through the full qualification protocol — historical deviations drawn from the site’s prior five years, covering equipment failures, environmental excursions, process parameter drifts, and personnel errors across every major product line. The validation report is signed. The quality director accepts the package, signs the release memo, and the system goes live. The compliance record is clean.
Six weeks later, the AI misclassifies a contamination event as a minor process deviation. The event involves a novel combination of gowning behaviour and environmental condition — a pattern that has never appeared together in this facility’s deviation history. No prior deviation combined those two signals. Because no prior deviation combined them, no test case did either. The AI encounters the pattern for the first time at runtime, produces a misclassification with high apparent confidence, and a QA reviewer clears it without escalation. The contamination is caught downstream, at significant cost, in a re-review triggered by an unrelated audit flag.
Here is the thought experiment: the validation team did their job perfectly. All 147 UAT cases passed because all 147 UAT cases represented patterns the AI had seen before — that was the selection criterion. The contamination event was not a failure of the validation process. It was a demonstration of what validation does not cover. And the question this raises is not whether the validation was competent. It is whether the validation was asking the right question in the first place.
This is not a story about a bad validation. It may be a story about the right tool for the wrong job. The pharmaceutical industry is applying GAMP 5 — a framework designed for deterministic, testable software — to a category of system that is fundamentally probabilistic. Consider what it would mean if the consequence of that mismatch is that compliance records grow thicker while governance confidence concentrates in exactly the wrong place: on territory that was already tested, rather than on the territory where the AI’s next error will come from.
What does a test library actually prove? It proves the AI handles the past correctly. The more interesting question is what the past proves about the future — and whether that is ever the question pharmaceutical quality most needs answered.
GAMP 5 was first published in 1994 and updated most recently in 2022. Its IQ/OQ/PQ qualification framework was built around a foundational assumption: that validated software is deterministic. Given the same input, a compliant LIMS returns the same output. A validated MES executes the same logic. A qualified laboratory instrument performs within the same tolerance band. This determinism is what makes test-case-based qualification meaningful — you run the test, observe the output, and document the result. Because the system behaves tomorrow the way it behaved during testing, the qualification record carries real evidential weight.
AI systems are not deterministic in this sense. They are probabilistic — their outputs are the product of learned statistical patterns across training data, not fixed logical rules. Given inputs near the edge of their training distribution, their confidence remains high while their error rate rises without warning. The 2022 edition of GAMP 5 acknowledges AI as a new category. What it does not fully resolve is whether the distinction between deterministic and probabilistic behaviour is a difference in degree or a difference in kind — and what follows for governance if it is the latter.
There is a philosophical tension at the centre of this that is worth sitting with rather than resolving quickly. The validation paradigm asks: does this system perform correctly? It answers that question by testing known inputs against expected outputs. This is a coherent approach for systems where the space of relevant inputs is knowable and bounded — where the past is a reliable guide to what the system will encounter.
The probabilistic nature of AI introduces a different relationship between past and future. An AI system’s errors are not distributed randomly across all possible inputs. They concentrate at the boundary of the training distribution — the inputs that resemble what the system has seen least, the scenarios it is least prepared for. In pharmaceutical quality operations, those edge cases are disproportionately the ones that matter most: novel contamination patterns, first-occurrence equipment failures, rare combinations of environmental and process conditions. The question the validation framework is designed to answer — does this system perform correctly on known cases? — may simply be a different question than the one quality governance most needs answered.
One way to think about this tension is that it is not unique to AI. It resembles the question of how any organisation governs a judgment-maker operating in a complex, changing environment. And pharmaceutical companies already have a framework for exactly that. It is just not called validation.
The problem is not that AI validation projects are poorly executed. The 147 UAT cases from the opening scenario represent competent, careful work. What is worth examining is whether the qualification methodology produces a misleading sense of risk coverage — not because the validation team made errors, but because the framework was designed for a different category of system. Consider the following four dimensions of that possible mismatch.
Each raises a different question: what the framework tests, how it ages, where it locates risk, and what compliance requirement it does and does not satisfy. The aim here is not to declare the framework wrong, but to follow each thread and see where it leads.
Consider what it would mean if a UAT library, however comprehensive, shares a fundamental limitation with the AI's training data: both were built from history. A well-constructed test library covers the range of scenarios the system is expected to encounter, and the AI's performance on those scenarios determines its qualified status. But AI systems fail primarily on scenarios outside that range — first-occurrence events, novel combinations, patterns not represented in the training data. The test cases confirm that the AI handles the past correctly. The more open question is what that confirmation implies about the future — and whether there is a point at which the answer is: less than it might appear.
The GAMP 5 framework produces a validated state that reflects the system's performance at the time of validation. AI systems embedded in live quality operations continue to encounter new data, new deviation patterns, and new process conditions. One way to think about this is that a validated AI that has been in production for 12 months has accumulated 12 months of novel inputs — edge cases, first occurrences, pattern combinations — that the validation report did not address. The question this raises is not whether the validation was wrong when it was completed. It is whether a point-in-time qualification is a different kind of document for a probabilistic system than it is for a deterministic one — and whether that distinction has practical consequences for governance.
GAMP 5 identifies system risk at the level of the software category — Category 4 or Category 5 — and scales validation effort accordingly. Consider what it would mean if AI risk is not located at the software category level at all, but at the input level: concentrated in the inputs the system is least prepared to handle. If that is true, then the GAMP 5 framework puts the analytical focus on the system, while the actual risk lives in the interaction between the system and the inputs it encounters. That is a dimension the framework has no mechanism to monitor. Whether this is a gap that matters depends on how often a pharmaceutical quality system encounters the kind of novel inputs where edge-case errors concentrate — and in a manufacturing environment that changes continuously, that may be more often than the validation paradigm assumes.
A completed GAMP 5 validation package satisfies the industry expectation that AI systems deployed in GMP activities be validated. Here is what is worth examining: 21 CFR 211.22(c) requires the quality unit to review and approve the outputs those systems produce. A validated AI whose outputs are not meaningfully reviewed — where the QU representative reads the conclusion without assessing the reasoning — may satisfy the validation expectation without satisfying 211.22(c). A non-validated AI whose outputs are thoroughly reviewed — with full access to the AI's reasoning, and a documented basis for QU approval — may satisfy 211.22(c) regardless of its validation status. The two cases look identical from the outside. The question this raises is whether governance and qualification are the same thing, or whether the industry has been treating them as equivalent when they are not.
When a pharmaceutical company hires a new QA analyst, they do not validate her. There is no test library of quality scenarios, no IQ/OQ/PQ, no validation report issued before she is permitted to review batch records. What happens instead is recognisably different — and recognisably effective. She is given the site’s SOPs and deviation history to study. She is paired with a senior reviewer for her first months. Her assessments are checked before they govern anything. Her error patterns are noted and addressed. She is given more independence as her record demonstrates she has earned it. The governance model scales with demonstrated performance on real work, not with test case results. Consider what this suggests about the nature of governing probabilistic judgment-making — and whether the absence of a validation requirement reflects a gap in the framework or something important about what validation can and cannot cover.
Confidence is established through UAT against historical test cases. The system is qualified if the test cases pass, and the qualification report is the primary evidence of governance readiness. Once the validation report is signed, the confidence it establishes is essentially fixed — it reflects what the system handled during testing, not what it has encountered since. One way to think about this is that governance confidence and production performance become different things from the moment the system goes live, with the qualification record unable to track the divergence.
Point-in-time — established at validation sign-off
Confidence is established through supervised deployment starting at go-live. Every AI output is reviewed before it affects anything. Error patterns are tracked from the first day of production. Autonomy expands as the performance record demonstrates it is warranted — in the areas and scenario types where the record is strong. The interesting feature of this approach is that confidence is never higher than the evidence that supports it, and it is always evidence about actual production behaviour rather than historical test cases.
Continuous — built from the performance record and updated in real time
Errors on in-distribution inputs are caught during UAT and corrected before go-live. Errors on out-of-distribution inputs surface at runtime, potentially after the AI's output has affected a quality decision. The qualification framework has no mechanism to catch novel failure modes — by definition, novel failure modes are outside the test library. The reviewer has no signal that the system is operating outside its competence when it is, because the framework does not create one. Whether this matters depends on how often pharmaceutical quality operations encounter genuinely novel inputs — which may be the most important empirical question in AI governance.
At runtime, after quality decision impact, for novel scenarios
Every output is reviewed before it governs anything during supervised deployment — the same oversight that prevents a new hire from making an ungoverned quality decision in their first weeks. The AI is configured to surface uncertainty on inputs that fall outside its experience, triggering heightened reviewer scrutiny rather than cleared-through processing. The idea here is that the governance obligation intensifies precisely where the AI is least certain — which is a different distribution of review effort than reviewing a fixed percentage of outputs uniformly.
Before quality decision impact, with intensified review for flagged uncertainty
The compliance record consists of an IQ/OQ/PQ package, UAT results, and a signed validation report — evidence that the system performed correctly on a defined set of test cases at the time of validation. The record is static: it shows the system's qualified state at sign-off and does not reflect what has happened since. An inspector reviewing it sees that qualification occurred. Whether the inspector can see that governance is currently active — that the QU representative is exercising judgment on each output, not clearing through conclusions — is a different question.
Static — reflects the system state at validation, not its production performance
The compliance record consists of review records showing QU assessment of each AI output, reasoning records showing what the AI considered and where it was uncertain, and performance tracking showing error patterns over time. The record is dynamic — it grows with each production output and reflects active, continuous governance. One way to think about this is that the record answers a different question than the qualification document: not whether the system was once proven, but whether someone is exercising judgment on it right now.
Dynamic — reflects active governance and grows with each production output
Novel scenarios outside the test library produce outputs with the same confidence profile as scenarios the system handles well. The qualification framework does not distinguish between a case within the AI's competence and a case outside it — both produce an output, and both outputs enter the QU review queue without a signal that one requires more scrutiny than the other. Consider what this means for the reviewer: she has no mechanism to know the system is operating outside its experience, because the framework never created one.
No differentiation — novel and familiar cases enter the review queue identically
The AI is configured to flag uncertainty on inputs that differ significantly from its training distribution — novel pattern combinations, first-occurrence scenarios, inputs that fall outside the process context it was oriented to. The idea is that the governance obligation should intensify for the cases where the AI is least certain, rather than remaining constant across all outputs regardless of the AI's confidence in its own competence. Whether this is achievable in practice — whether AI systems can reliably signal their own uncertainty — is a genuinely open technical question.
Differentiated — novel scenarios trigger heightened review, uncertainty is surfaced
The new hire analogy is worth taking seriously not as a rhetorical flourish but as a structural guide. If the analogy holds, it maps onto a concrete governance architecture — and one way to think about each element is to ask what its equivalent is in how pharmaceutical companies govern new QA analysts. Each of the following reflects an existing GxP principle, not a new requirement invented for AI. The question is whether those principles have been applied to the right category of system.
Before deployment, the AI is oriented to the specific process environment it will operate in — the site's deviation taxonomy, product portfolio, historical quality events, and the specific failure modes that have mattered here. The idea is not to perform a qualification exercise but to give the system the context it needs to be useful, the way you give a new QA analyst the site SOPs and five years of deviation history. One way to think about this capability is that it establishes a known starting point — one whose contextual gaps can be actively monitored rather than assumed away by a test library.
Initial deployment operates with full QU review of every AI output — the same oversight applied to a new hire in her first weeks on the floor, before she signs anything independently. The review is not a checkbox. The QU representative sees what the AI considered, assesses whether the reasoning was sound given the process context, and signs with a documented basis for her conclusion. The idea here is that autonomy expands as the performance record supports it: in areas where errors have been rare and reasoning has been consistently sound, the review requirement can become periodic rather than universal. Where the record is mixed, full review continues. The governance model responds to the demonstrated record rather than to a qualification event.
AI error patterns are tracked from the first day of deployment — which scenario types produced correct outputs, which produced errors, where the system's confidence was high when it should have been low. The AI's operational scope is adjusted in response: expanded where performance is demonstrated, constrained where patterns of error are identified. One way to think about this capability is that it makes the governance model dynamic in the same way a QA manager's response to a new hire's error pattern is dynamic — identifying the knowledge gap and adjusting accordingly, rather than treating the initial qualification as permanent.
Every AI output is accompanied by a retrievable reasoning record — what the AI consulted, which prior cases it compared against, where it registered uncertainty, and what it concluded. The QU representative reviews both the output and the reasoning before signing. Their approval attests to their assessment of the AI's reasoning, not only their reading of its conclusion. The principle here is that 21 CFR 211.22(c) is satisfied by the governance structure — by an act of documented human judgment — not by the validation package that preceded deployment. The human is accountable. The system is auditable. One way to think about this is that the distinction matters precisely in the scenarios where the AI's reasoning is least reliable.
The new hire analogy is imperfect, as all analogies are. But the imperfections are interesting too — in what ways is an AI system like a new hire, and in what ways is it not? A new hire has intentions, learns from feedback in a continuous and embodied way, and develops judgment that generalises in ways that are hard to specify in advance. An AI system does none of those things in quite the same way. The governance gaps may live exactly at the points where the analogy breaks down.
The industry’s instinct to validate AI reflects something genuine — a recognition that AI is consequential enough to require governance. The question worth sitting with is whether the governance framework inherited from a different category of system is the one that fits. That is not an easy question, and this piece does not pretend to resolve it. It is, though, a question worth asking before the validation report is signed.
The validation paradigm asks “does this system perform correctly?” The management paradigm asks “does this system perform correctly right now, and how would we know if it stopped?” Which question is more useful for probabilistic systems operating in a changing environment is not obvious. But it is worth knowing which question your governance framework is designed to answer.