2026-05-19 · 8 min read · Strategy · Adoption · Risk

Where Pilots Die: The Three Reasons AI Deployments Stall Before Production

Most AI pilots in professional firms produce a demo, a report, and a quiet ending. They do not stall because the technology failed. They stall because the engagement was structured for a demo, not for a system. Three failure modes account for almost all of them.

The shape of a stalled pilot

Every senior practitioner reading this has seen the pattern at least once. A respected vendor or internal team is brought in for a three- to six-month pilot. There is a kickoff. There are weekly check-ins. There is, at month two, a confident demo — something is summarised, something is drafted, the partners present nod, somebody says “interesting.” A final report is delivered. Then, gradually and without anybody calling it a failure, the pilot stops being mentioned. Six months later the system is no longer running. A year later it is unclear who even had the login.

We have seen this in Korean law firms, in policy institutes, and in university laboratories that received AI grants. The technology in each case was not the bottleneck. The same underlying models are running production systems elsewhere. What failed was the shape of the engagement — how the pilot was specified, what it was attached to inside the firm, and what was supposed to survive its end.

Three failure modes account for almost every stalled pilot we have looked at carefully. They are independent and they often occur together. Naming them is worth doing because they are not obvious from inside an engagement that is going through them.

Failure mode #1 — No domain specification, only a use case

The most common failure starts at the very beginning. The pilot is framed by a use case — “contract review,” “literature synthesis,” “evaluation report drafting” — and the team begins building against that label. What is missing is a specification of the field in which that use case actually lives: what an acceptable contract review looks like in this firm, which clauses are evaluated against which precedents, which kinds of hedging the senior partner expects to see and which they expect to see absent.

Without that specification — the kind of document we call a domain axiom set — the pilot produces outputs that look structurally reasonable and read as wrong to anyone with authority. The senior partner glances at the demo output and says, in some politer phrasing, “I am not sure this is right.” That is the sentence that kills the pilot. Not a failed test. A failed reading by the person whose endorsement the project depended on. The trust never forms because the ground on which trust would have been built — an explicit standard of correctness — was never written down.

The remedy is unglamorous. The first two to three weeks of a pilot are spent not on code but on extracting the implicit standards of the field from the senior practitioners and writing them down. The deliverable of that phase is a short document that the senior signs. Pilots that skip this phase are running on borrowed time from week one.

Failure mode #2 — No knowledge base, only a model

The second failure is architectural. The pilot is wired up to a general model through an API, with prompts that gesture at the firm's practice but with no retrieval against the firm's actual corpus. The outputs are coherent. They sound like confident professional writing. They do not sound like the firm.

The first time a senior reads such an output, the reaction is mild surprise that the machine can write that well. The third time, the reaction is irritation. The fifth time, the reaction is to stop opening the demo. The pattern that surfaces in review after review is the sentence “we do not approach it this way.” That sentence is fatal to a pilot, and it is the inevitable consequence of running on a generic substrate. The model has nothing to draw on except the average of the internet's professional writing, and the firm is not the internet's average.

Even a small private corpus — twenty or thirty real cases, de-identified, embedded, and retrieved against — changes this behaviour completely. The retrieved fragments anchor the output in the firm's actual practice. The model is doing the same generation it was doing before, but the context it is generating from has shifted from somewhere to here. A pilot that cannot afford this shift cannot afford to be a pilot.

Failure mode #3 — No owner inside the firm

The third failure is the quietest one. The external team builds something competent. The demo is good. The senior watches the apprenticeship phase attentively. And then the external team departs, and there is nobody inside the firm whose job it is to decide which outputs pass review, which new axiom gets written, which document is added to the corpus this month and which is held back.

The system does not stop working when this happens. It begins to drift. A new edge case appears and nobody adds a constraint. An old workflow shifts and the system continues to support the old one. After three months of unattended drift, the gap between what the system does and what the firm needs has widened enough that the next senior to look at the output says, again, “I am not sure this is right.” The system is then quietly retired. Nobody is to blame because nobody owned it.

The owner does not have to be senior in the corporate sense. They do have to be senior in the domain sense — a partner, a principal investigator, a research director — or a trusted lieutenant of one. They do not write code. They make the standing decisions that keep the axiom document and the corpus current. The integrator continues to operate the infrastructure; the owner operates the judgement layer. This division is the arrangement we described in the previous piece, and it is what allows the system to outlive the engagement that produced it.

What a pilot designed to survive looks like

Read negatively, the three failure modes describe what a pilot needs to avoid. Read positively, they describe what a pilot needs to be.

A pilot designed to survive has, at its start, a short signed axiom document. The senior practitioner has spent several sessions in the room while it was written and they have put their name to it. The document is not a wishlist; it is the explicit standard against which outputs will be judged.

It has, by week four or five, a real corpus connected. The corpus may be small — twenty cases, fifty papers, two years of evaluation reports — but it is the firm's own material, de-identified at the gate and retrieved against on every query. The outputs draw from it. They begin to sound like the firm well before the pilot ends.

It has, by month two, a named internal owner. That person has watched the integrator make domain decisions on real material for several weeks. They know which signals indicate a clean output and which indicate a structural failure. They are already proposing additions to the axiom document. By the time the external team steps back, the judgement layer of the system is already in their hands.

And it has, finally, a definition of success that is not a demo. The end-of-pilot evaluation is not whether the system performed well at a single set-piece presentation. It is whether, after six months of operation, the senior is spending measurably less time on the routine portion of their review work, and whether the corpus and axiom document have grown in ways that reflect real use. Pilots evaluated on those terms either survive or yield an honest negative answer. Pilots evaluated on the demo invariably yield the slow disappearance described at the start of this piece.

The pilots that survive in our field, across Korean firms and laboratories and elsewhere, share these four traits with a consistency that is almost dull. A signed specification. A real corpus. An internal owner. A definition of success that is not a presentation. The pilots that die share their absence. That is the pattern, stated as plainly as we can state it.

← Why It Compounds All Insights