← Back to blog

The Warrant Is the Bug: Every "The Data Shows X, So Do Y" Hides an Unstated Bridge

In 1999 a courtroom convicted a mother on a number that hid one unstated premise. Every "the data shows X, so do Y" you make this week hides the same kind of bridge, and the bridge is where the bug lives.

Published June 2026 · 11 min read

In November 1999, at Chester Crown Court, a jury heard a number so large it ended the argument. Sally Clark, a solicitor, had lost two infant sons, Christopher in 1996, Harry in 1998, and stood accused of murdering them. The prosecution's expert witness, the eminent paediatrician Sir Roy Meadow, told the court that the chance of two babies dying of sudden infant death syndrome in a family like hers was one in seventy-three million. One cot death in an affluent, non-smoking household, he explained, occurs about once in 8,543 births. Two, therefore: 8,543 times 8,543. Roughly one in seventy-three million. Rarer than winning the lottery. The jury convicted.

Look closely at what Meadow did with that multiplication, because it is the subject of this essay. The 1-in-8,543 figure was data, drawn from a real epidemiological study, and nobody seriously disputed it. The claim, these deaths were not natural, was for the jury. But the step in between, the squaring, smuggled in a premise that was never spoken aloud in the courtroom: that two SIDS deaths in one family are independent events, like two spins of a roulette wheel. You may only multiply probabilities like that when the events have nothing to do with each other. Are two unexplained infant deaths in the same family, same genes, same home, same environment, unrelated? That is not a mathematical question. It is a biological one, and the answer turned out to be no. But it was never asked at trial, because it was never visible at trial. The number stood up, the assumption stayed seated.

The data was fine. The logic, multiply independent probabilities, was textbook. What failed was the bridge between them. And the terrible, useful lesson of the case is that nobody examined the bridge because nobody noticed they were standing on it.

The part of the argument that doesn't appear in the argument

Forty-one years before the trial, the philosopher Stephen Toulmin had built a model of exactly this failure. The Uses of Argument (Cambridge University Press, 1958) was Toulmin's rebellion against formal logic, which he considered a poor description of how people actually argue, in courtrooms, notably, which is where he drew his examples. Real arguments, he observed, have a structure that the tidy syllogism hides. You assert a claim. You offer data in support. And connecting them, almost always unstated, is the warrant: the general principle that makes this data relevant to that claim, in his metaphor, the bridge that lets you cross from one to the other.

His example has become the standard one. "Harry was born in Bermuda, so Harry is a British subject." Data, then claim. The warrant, a man born in Bermuda will generally be a British subject, goes without saying. That's the point: warrants usually go without saying. They are the parts of the argument that don't appear in the argument.

Toulmin's deepest observation is the one engineers need: an argument is only as strong as its weakest warrant, and in genuine disputes, the disagreement is almost never about the data. Two teams stare at the same dashboard and reach opposite conclusions; they don't disagree about the telemetry, they disagree about what the telemetry licenses, and since neither has said its warrant out loud, the dispute presents as one side being dense. We have written before about Toulmin's model as a tool for feature proposals, the warrant as the load-bearing member of "we should build X." This essay is about the other place the model bites, the place with dashboards in it: arguments of the form the data shows X, so do Y.

Persuasive because it's missing

There is an older name for an argument with a missing premise, and it comes with a warning label. Aristotle called it the enthymeme, in the Rhetoric, "the body of proof" itself: a syllogism in which a premise is left unstated because the audience already accepts it. The speaker doesn't suppress the premise to deceive. He omits it because stating it would be redundant; the audience supplies it from the stock of things everyone believes.

Now notice what that implies. The listener completes the argument with a premise drawn from her own head, and we do not cross-examine what comes out of our own heads. The enthymeme is not persuasive despite the gap. It is persuasive because of it. An argument whose every premise is written down can be attacked at every premise; an argument with a hole in it gets the hole filled by the audience, instantly, invisibly, charitably. The most convincing arguments you will hear this quarter are the ones whose weakest link you never see, because you are the one who supplied it.

"Latency dropped 20%, so users are happier." Read it slowly and find the word that isn't there: because. The sentence works only if latency drives satisfaction, a premise you brought with you. Meadow's seventy-three million was an enthymeme. The jury supplied the independence of the deaths, free of charge, and convicted a grieving mother on the strength of a premise nobody uttered.

Three bridges that aren't there

Data-driven engineering culture has audited the wrong things with admirable rigor. We check the data: is it clean, is it significant, is the sample big enough? We check the logic: does the math follow? The warrant, the bridge, gets walked on, not looked at. Here are three crossings most of us make weekly.

"Coverage is 95%, so the code is well-tested." The warrant: covered means tested. It is false, and provably so. Code coverage measures whether a line was executed under test, not whether anything about its behavior was asserted. The degenerate case is perfectly legal: a test suite that calls every function and asserts nothing achieves 100% coverage with a 0% mutation score, mutation testing being the discipline that actually checks the warrant, by seeding deliberate bugs and counting how many your suite notices. Coverage data is real data. It just supports a much weaker claim than the one we lean on it for: it can tell you what you definitely haven't tested, and almost nothing about what you have.

"The A/B test won, so ship it." Two warrants under this one. First, the test population represents production, threatened any time your experiment over-samples heavy users, early adopters, or whoever happened to be awake during the test window. Second, and nastier: the measured effect is durable and causal. The experimentation literature has a name for the way this one fails, the novelty effect: users engage with a thing because it is new, not because it is better, producing a lift that is entirely real, entirely significant, and entirely temporary. The test did everything a test can do. It measured a true effect. The warrant, that the effect belongs to the feature rather than to its newness, is exactly the part the p-value cannot see.

"Latency dropped 20%, so users are happier." The warrant: satisfaction varies with latency, smoothly, the way the graph in your head looks. The perception research says otherwise: the relationship is threshold-shaped, flat across wide ranges of delay, then changing sharply when a perceptual boundary is crossed. Cut latency from 800ms to 640ms and you may have bought precisely nothing a human can feel; the engineering was real, the dashboard moved, and the claim it was marshaled to support never followed. (The mirror-image error is worse: a "negligible" 100ms regression that happens to cross the threshold.)

Stack these next to the courtroom and the family resemblance is exact. Real data, sound arithmetic, and a silent premise doing all the work: covered=tested, test=production, latency=happiness, deaths=independent. In every case the weak link was the one component of the argument that was never written down anywhere, not in the ticket, not in the experiment doc, not in the indictment.

One question

The fix is not more data. The fix is a question, and it is the single highest-impact sentence I know for a review meeting:

What would have to be true for this data to support this claim?

That's the whole discipline. The question mechanically surfaces the warrant, you cannot answer it without stating the bridge out loud, and a stated warrant is a checkable warrant. Then comes the redirect that feels wrong the first dozen times: go verify that, not the data. Your instinct under challenge is to re-pull the numbers, tighten the confidence interval, re-run the query. Resist it. The data was rarely the weak link. Spend the skepticism on the premise you got for free.

This is precisely what eventually happened to Meadow's number. In October 2001, the Royal Statistical Society did, in public, what no one had done at trial: it stated the warrant and examined it. There was "no statistical basis" for the 73 million figure, the Society wrote, because the independence assumption was not merely unproven but likely false, there are "very strong reasons" to suppose genetic and environmental factors predispose some families to SIDS, which means a second death in an affected family is far more likely than the first. The squaring wasn't conservative arithmetic; it was the error itself. In January 2003 the Court of Appeal quashed Sally Clark's conviction; the immediate trigger was the discovery that the prosecution's own pathologist had withheld microbiology results pointing to a natural death for Harry, with the statistical evidence condemned alongside it. Note what the statisticians did not do: they did not produce better data. They read the bridge. Note also who had built it: a distinguished paediatrician, testifying outside his field, expert data, borrowed authority, unexamined warrant, the full kit.

Operationally, the question wants to live in three places. In experiment writeups, as a required field: this result supports this decision only if ___, you will be startled how often the blank refuses to fill convincingly. In design reviews, as the reviewer's opening move: attack the warrant first, because the author has already stress-tested the data and has usually never once looked at the bridge. And in your own postmortems of decisions that went sideways on "good data", hunt for the premise you supplied without noticing, because the enthymemes that fool you are by definition the ones built from your own beliefs.

One bug, filed under many names

Here is the part that reframed it for me. Run back through the canon of famous data pathologies and notice that every one of them is a named warrant failure, the same bug, filed under different CVEs.

Goodhart's law, in Marilyn Strathern's formulation, "when a measure becomes a target, it ceases to be a good measure", is the warrant the metric captures the goal, failing at the precise moment optimization pressure arrives. The ordinal-scale error, averaging ratings, adding ranks, is the warrant these intervals are meaningful, applied to numbers that only ever promised order. Correlation-to-causation is a warrant so notorious we teach it to undergraduates, then deploy it anyway every time a metric moves after a launch. Trend extrapolation is the warrant the generating process continues, flat until the regime shifts, then catastrophically wrong. Even the subtle ones fit: trusting a model because its outputs agree with each other is the warrant coherence indicates correspondence, the assumption that a system in agreement with itself is in agreement with the world.

"The data shows X, so do Y"The unstated warrant
Coverage is 95%, so it's well-testedcovered = tested (0% mutation score says no)
The A/B test won, so ship itthe effect is durable and causal (novelty effect)
Latency dropped, so users are happiersatisfaction varies smoothly with latency (it's threshold-shaped)
The metric improved, so we hit the goalthe metric captures the goal (Goodhart)
Two cot deaths, so murder (1 in 73 million)the deaths are independent events (they are not)

You can keep going down the stack. The deepest warrants aren't even in the inference, they're in the categories. By the time a number reaches your dashboard, someone has already decided what counts as an "active user," a "resolved incident," an "unexplained death," and each of those definitions is a little frozen argument with its own unexamined bridge. James C. Scott built a whole theory of state failure on this: the map's categories silently warrant the conclusion drawn from the map, and the territory declines comment.

This is why the question generalizes so well. You do not need to memorize a bestiary of fallacies. Goodhart, ordinal arithmetic, novelty effects, survivorship, extrapolation: what would have to be true for this data to support this claim? surfaces each of them, because each of them lives in the same place: the gap between what was measured and what was concluded, papered over by a premise the audience supplied.

Bridges are engineered to be unnoticed. That is what a good bridge is, infrastructure you cross without thinking about it, attention free to stay on the destination. The warrant earns the same invisibility right up until the day it doesn't hold, and then everything on it goes down at once: a feature bet, a quarter's roadmap, or eight years of a woman's life spent in prison and appeal because nobody in a courtroom asked what would have to be true for an actuarial table to prove a murder. The data shows X, so do Y. Before you do Y, walk out onto the sentence and check what it's resting on.


Sources: Toulmin, "The Uses of Argument" (Cambridge University Press, 1958); Aristotle, "Rhetoric" I.1 on the enthymeme; R v Clark, conviction November 1999, Royal Statistical Society public statement October 2001, conviction quashed January 2003; mutation testing vs. coverage (see Codecov's "ensure coverage isn't a vanity metric" and the standard mutation-testing literature); novelty effects and external validity in online experiments (Analytics Toolkit); threshold models of latency perception (user-perception studies); Strathern, "'Improving Ratings'" (European Review 5:3, 1997); Scott, "Seeing Like a State" (Yale University Press, 1998).

The data and the claim get written down. The warrant is the part that doesn't.

An agent that emits "the data shows X, so I did Y" is making exactly this argument, and the warrant, the why between the evidence and the action, is the part that usually goes unrecorded. Chain of Consciousness records it: a tamper-evident trace of an agent's reasoning, not just its outputs, so the bridge between what it saw and what it did is stated out loud and checkable, instead of supplied later by whoever reads the result.

pip install chain-of-consciousness · npm install chain-of-consciousness
Hosted Chain of Consciousness → · See it in action