What 95% Really Means

Published

March 9, 2026

In my last post on selection bias https://gforchini.github.io/posts/selection_bias/ I argued that who gets into your dataset matters more than how big the dataset is. If chance did not choose them, your statistics do not mean what you think they mean. Today I want to make the same point for confidence intervals. The phrase “95% confidence” is not a property of your one neatly printed interval; it’s a property of a procedure under a mechanism. Without the right mechanism - the same “chance over convenience” theme - the “95%” is just ink on the page.

Here is the core statement, as plainly as I can put it. A 95% confidence interval does not say there is a 95% chance the true value lies in this interval that you have calculated. It says that if you repeat your data‑generating mechanism, i.e. the same sampling design in a survey, or the same randomisation protocol in a trial, and you rebuild the interval each time with the same recipe, then about 95 out of 100 of those intervals would cover the true value. The 95% belongs to the procedure, not to this interval.

Everything turns on the mechanism. In a labour‑force survey, the mechanism is the probability design: a list of eligible households, random selection with known chances, callbacks, weights, and a way to account for nonresponse. If you could redraw samples by that same rule, the usual 95% coverage statement would apply. In a randomised trial, the mechanism is the pre‑registered assignment rule: a lottery that sends some applicants to treatment and others to control. If you could re‑run the lottery, the coverage statement has the same interpretation now across hypothetical assignments rather than samples.

Now link this back to the résumé story from the previous post. Suppose again that one analyst scrapes thousands of online résumés and another runs a probability survey. Both compute “95% confidence intervals” for the average wage. Superficially, the intervals might look similar. But only the survey interval has a clean 95% interpretation, because only there can you tell a coherent story about the samples you could have drawn but did not. In the résumé scrape there is no defensible mechanism: some groups had no path into your file; others crowded in for reasons that also move wages. You can widen or narrow bands by fiddling with formulas, but you cannot make “95%” mean “covers 95% of the time” when chance never decided who entered the dataset. This is selection bias meeting confidence interval language.

A useful way to fix ideas is to imagine the “many‑worlds” picture. Picture 100 parallel worlds in which you re‑run the same survey design. In each world you get a slightly different random sample and you compute a 95% interval by the same recipe. Stack the 100 intervals on a chart and draw a vertical line at the true mean. Most intervals cross the line; a handful miss high or low. That is what “95%” means. Your interval today is one of those 100 lines. You do not know if it crosses the true; you know the procedure that produced it would hit roughly 95 times out of 100 under the same mechanism. Try the same thought experiment with résumé scraping and the picture falls apart: re‑scraping tomorrow just gives you more self‑selected profiles, not genuine “alternative samples” in the sense the coverage statement requires.

The temptation, of course, is to make the résumé interval look more respectable by throwing models at it. You adjust for observed characteristics, maybe post‑stratify to match census margins, and report a tidy “95% CI”. Those moves can help with observables, and sometimes they are the best you can do when random samples do not exist. But they do not turn convenience into chance. The interval you print is now a statement that depends on the model you are willing to defend, not on a repeatable design. That can be perfectly honest but it is a different kind of 95% than the one a probability design or a randomisation protocol gives you.

The same distinction carries over to experiments. In a pilot where access to training is assigned by lottery, a 95% confidence interval for the effect has its usual meaning because we can imagine all the other lottery draws we might have made among the same applicants. A small imbalance in covariates, in the realised split, does not void that meaning; odd assignments happen with known probability and the procedure accounts for them. If, instead, staff “assign” treatment based on judgement or need, calling the resulting band “95%” doesn’t restore the missing counterfactual worlds. You have a comparison, not a causal contrast.

Two practical consequences follow. First, decide on the interval recipe before you look at the outcome. Post‑hoc tweaks such as switching estimators after peeking, stopping early because results look “done”, trimming awkward observations on the fly, change the game you are playing and erode the promised coverage. The 95% property belongs to a pre‑specified procedure under a stated mechanism. Second, make your mechanism explicit. In a survey: “Intervals report design‑based uncertainty; under this design they would cover the population mean about 95% of the time.” In a trial: “Intervals reflect randomisation uncertainty; across assignments generated by this protocol they would cover the average effect for applicants about 95% of the time”. In observational work that leans on an “as‑if random” story, be candid: “Coverage is nominal under the assumption that, conditional on pre‑treatment covariates, the remaining variation behaves as if random and there is overlap”.

Seen this way, “95%” is not a comfort blanket you drape over any estimate; it is a promise about behaviour across the paths you didn’t take. In the selection‑bias post I wrote: If chance didn’t choose them, your statistics don’t mean what you think they mean. The same sentence applies here. If chance didn’t choose the data, or if the interval‑building recipe does not match the mechanism, then “95%” doesn’t mean what you think it means either.