What We Could Have Observed But Didn’t: Understanding Randomness in Econometrics

Students in econometrics courses often struggle with a fundamental paradox. We spend enormous effort on a single dataset: cleaning it, running regressions, calculating standard errors, interpreting coefficients. Yet the entire validity of our statistical inference rests not on this dataset alone, but on all the datasets we could have got but did not. In other words, inference depends on what could have happened under a credible mechanism, not on the particular sample we happened to observe.

Think of Sliding Doors (1998). Gwyneth Paltrow’s character rushes to catch a London Underground train. The story splits into two parallel narratives: in one, she catches the train; in the other, she misses it by a second. Her life unfolds differently depending on that one moment. In inference, we have a similar idea, but our “timelines” are possible samples. Before we look at our data, many samples could have been drawn by the mechanism. We only ever see one. Validity comes from the mechanism and from the set of samples we did not see.

Randomness: a rule, not a look

When we calls a sample “random,” we mean that the rule that produced it used chance in a known way: simple random sampling, stratified sampling with known probabilities, random assignment in experiments. Randomness is about the process, not about whether the data look messy or patternless.

A random sample can look odd. Streaks, outliers, clusters—these happen under fair mechanisms.
A non‑random sample can look “nice.” That doesn’t make it valid for inference about a population or a causal effect.

We compute numbers from the sample we actually got. We trust those numbers because of the mechanism that could have produced many other samples we did not get.

The samples we didn’t see

Before you draw a sample, there are many samples you could draw. Each would give a different estimate: a different mean, a different regression coefficient, a different test statistic, a different p-value. The distribution of all the estimates you might have got is called the sampling distribution. Your observed estimate is one point in that distribution.

Confidence intervals and p‑values are guarantees about the procedure across the many samples the mechanism could have produced—not probabilities about this one estimate or this one interval.

If the mechanism is good (random sampling or random assignment), the procedure behaves in a predictable way across those unseen samples. If the mechanism is bad (convenience sampling, self‑selection), those guarantees vanish.

Sliding Doors: the door is the rule

In Sliding Doors, catching the train changes the path. In sampling and assignment, the “door” is the selection/assignment rule:

If the door opens by chance (everyone has a known probability to be sampled; treatment is assigned by a coin flip), then, across possible samples, we know how our methods behave.
If the door opens only for certain people (phone survey at 2 p.m.; treatment “assigned” to those who shout loudest), the mechanism is biased. Bigger samples do not solve this.

Luck affects your estimate; the rule determines what that estimate means.

Two kinds of randomness we rely on

Econometrics leans on two workhorses:

Random sampling to generalise from a sample to a population.
- Example: a labour‑force survey that draws households at random from a frame.
- If the design is genuinely random (or properly accounted for when complex), confidence intervals for population features have their advertised long‑run coverage.
Random assignment to identify causal effects.
- Example: lottery assignment to a job‑training program.
- Even if today’s treated and control groups look imbalanced by luck, across the many assignments we could have had, the estimator is unbiased and tests have the right false‑positive rate.

In both cases, the unseen timelines - the samples and assignments we didn’t observe - carry the burden of validity.

p‑values in the “many timelines” frame

A p‑value is the fraction of possible samples, under the null hypothesis and the mechanism, in which a statistic would be at least as extreme as the one we observed.

Two consequences:

Mechanism matters. If the sampling/assignment rule isn’t random or credibly “as if random,” the p‑value’s interpretation breaks.
It’s not the probability the null is true. It’s about the procedure across the samples we didn’t see, not a posterior probability statement.

What “95%” means (and doesn’t)

Students (and some former colleagues) often say: “A 95% confidence interval means there’s a 95% chance the true value lies in this interval.” Not in classical statistics!

A 95% interval means: if we replay the mechanism and draw many samples, and each time build an interval the same way, then about 95% of those intervals would contain the true value. We only see one of those intervals, the one from our actual sample. We don’t know if this one covers the truth. We know that the procedure has a 95% success rate under the mechanism.

Short version: 95% belongs to the method, not to this interval.

Mechanism beats luck: two short case studies

Case 1: Survey sampling for weekly hours

Goal: estimate average weekly hours worked.

Good mechanism: a simple random sample from the city’s registry, with callbacks at different times and reasonable nonresponse procedures. Across the samples we could have had, the CI covers the truth at the advertised rate. Today’s estimate may be high or low (this is the luck part) but the method is valid.
Bad mechanism: phone calls at 2 p.m. on weekdays, no callbacks. The unseen samples here are all skewed towards people at home at 2 p.m. The procedure’s coverage claim doesn’t hold for the population we care about. A big sample doesn’t fix this; it just narrows the wrong answer.

Case 2: Job training—lottery vs. self‑selection

Goal: estimate the causal effect of a training program.

Random assignment: a lottery. We might get unlucky today (control group has many very motivated people), but across the random assignments we could have had, the estimator is unbiased and tests have the right false‑positive rate.
Self‑selection: people choose whether to participate. Participants differ in motivation, time availability, or unobservables. The mechanism is not random; estimates conflate selection and treatment. Design is needed to recover “as if random” variation.

“As if random” designs (when true randomization is impossible)

Real life often blocks true random sampling/assignment. We then try to design the analysis so that the variation used to identify the effect behaves as if random.

Instrumental Variables (IV): an encouragement or assignment that moves treatment but is plausibly independent of outcomes except through treatment.
Regression Discontinuity (RD): a threshold assigns treatment; near the cutoff, small differences behave like coin flips.
Differences‑in‑Differences (DiD): parallel trends let the control group stand in for the treated group’s counterfactual path.

These designs are attempts to restore a credible mechanism. Their validity still hinges on the unseen samples/assignments we could have had under the design’s assumptions.

Why the observed sample is (mostly) irrelevant for validity

This can sound provocative: “The sample actually obtained is irrelevant.” Here is the nuance.

The observed sample is essential for computing our numbers. We need data to calculate a mean or run a regression.
The observed sample is irrelevant for validity if the mechanism is good. Coverage, unbiasedness, Type I error rates—these are long‑run properties across the samples we could have drawn. Whether today’s sample looks typical or odd does not change those guarantees.

If our sample looks “normal,” that does not create validity. If it looks odd, that doesn’t destroy validity, unless the oddness signals a mechanism failure (nonresponse related to outcomes, convenience sampling, attrition).

We should all spend our effort on the mechanism: how units enter the sample, how treatment is assigned, how nonresponse is handled. That’s where validity lives.

Conclusions

So,

We compute numbers from the sample we observed.
We trust those numbers because of the samples we didn’t observe but could have, under a random mechanism.
If chance didn’t choose the data, classical guarantees don’t apply—no matter how fancy the methodology looks.

Econometrics is full of doors. Some open by chance; others open for reasons that bias our view. We see one path, the sample, but our confidence must come from the many paths we could have walked if the door is genuinely random. That is the heart of statistical inference: the mechanism and the counterfactual samples we never saw.

The sample you have is one trip through the sliding doors. Statistical inference works because it accounts for all the trips you didn’t take.

Postscript: Common misconceptions

“Random = messy.” No. Random = governed by a rule with known probabilities. Messiness is in the eye of the beholder; validity is in the mechanism.
“95% means our interval is right with 95% probability.” No. 95% is the procedure’s long‑run coverage across samples we could have drawn.
“A bigger sample fixes bias.” No. It narrows the wrong target. Mechanism beats size.
“Controls in regression create randomness.” No. Controls can help, but they don’t replace a credible mechanism. Identification is still about how the variation was generated.
“Our sample looks weird, so the method failed.” Not necessarily. Weird samples occur with known frequency under good mechanisms. The method’s guarantees are about the unseen samples.