Matching, Weighting, Regression: Three Roads to the Same Promise
The point of “as‑if random” is simple to state and hard to achieve: after we condition on genuinely pre‑treatment information, the remaining differences in who received the treatment should behave like luck for the comparison we care about. I have already argued that this is a mechanism claim, not a compliment about how tidy the data is. I have then translated that claim into the linear model that most of us reach for first: the treatment coefficient is interpretable when the leftover part of the outcome is no longer tied to treatment, in a region where treated and untreated overlap. In this post, I want to show how three familiar families of methods including matching, weighting, and regression, are just different ways of trying to keep that promise. None of them manufactures chance. Each tries to remove the predictable part of treatment selection so that what remains looks “random” enough to support a causal reading.
Matching is the most literal expression of the idea. You take a treated unit, find one or more untreated units that look similar on a credible set of pre‑treatment variables, compute a local difference, and then average those local differences. You can do this by simple nearest neighbour rules, by stratifying the sample into bins that make the groups comparable, or by matching on a one‑number summary of the pre‑treatment information such as a propensity score. The details matter less than the discipline: keep outcomes out of sight while you build the comparison, and only look once the matched groups actually resemble each other on the drivers you believe sort people into treatment and also move outcomes.
This is why balance diagnostics are not decoration. You want to see, in plain plots and tables, that the matched treated and matched untreated units sit in the same territory on the key pre‑treatment variables. These may be age, tenure, prior outcomes, sector, location, anything that is upstream of both the treatment decision and the outcome in your setting. If the matched sample achieves this, the remaining difference in treatment within those strata is closer to luck; if it does not, you have not bought what matching is selling.
Two cautions keep matching honest. First, overlap is not optional. If the treated live in regions of the covariate space where there are no untreated neighbours (or vice‑versa), forcing a match invents a counterfactual that was never observed. In that case, trimming back to a region of common support is more honest than stretching the rules. Second, matching targets a particular effect by construction. If you match untreated to treated and average the local treated‑minus‑untreated differences, you are estimating the average effect for the treated in the region you matched on. That is often the relevant quantity in programme evaluation, but it is not the population average effect.
Weighting pursues the same goal with a different tool: instead of thinning the data until the groups look alike, you re‑weight observations so that the distribution of pre‑treatment variables aligns across treated and untreated. The classic route is to estimate the probability of receiving treatment given pre‑treatment characteristics and then give each observation a weight that compensates for how unusual its treatment status is relative to its characteristics. When the weights work, a weighted comparison places the treated and untreated on the same footing for those drivers, making the remaining treatment variation behave more like luck.
Two realities make weighting succeed or fail. The first is the quality of the information you use. If the variables that drive selection and outcomes are poorly measured or missing, no re‑weighting scheme can undo what you don’t see. The second is positivity, the same overlap idea in different words. When some types of units almost never receive (or almost always receive) treatment, the weights needed to compensate, become extreme and the evaluation pivots on a handful of observations. Stabilising or trimming extreme weights can help, but the straight answer is to admit where you lack a comparison and restrict claims to where you have one.
As with matching, weighting implicitly chooses a quantity to estimate. With the usual inverse‑probability weights you are reconstructing a world in which everyone “could have been” treated or untreated with similar frequency, which lines up with the population average effect. If your policy target is the effect for people who actually took up the programme, other weights are more natural. Again, say what effect your weights imply.
Regression adjustment takes aim at the same leftover in a more familiar way. You model the outcome as a function of treatment and a carefully chosen set of pre‑treatment variables, and read the treatment coefficient as the difference that remains once those variables are accounted for. In this framing, the “as‑if” claim is exactly that the part of the outcome your model does not explain is no longer related to treatment. The model does the balancing internally.
The two ways regression goes wrong are the same as in the previous post dealing with regression. If you treat downstream variables as controls, you dilute the very effect you want to measure. And if the functional form is too rigid for the setting (such as curved relationships forced into straight lines; effects that obviously vary with baseline variables but no interactions allowed) then the structure you leave out slides into the leftover, and if that structure is related to treatment you have recreated confounding. The fix is not to stack controls endlessly, but to be disciplined about what belongs (only genuine pre‑treatment drivers) and flexible about how it belongs (functional form and interactions guided by the setting).
Regression has one virtue that is easy to miss: it is often precise when you have strong predictors of the outcome. But that precision is only an asset if the identification story is credible. A very tight interval around a biased coefficient is still just a very tight mistake.
Because the three roads lead to the same point from different angles, there is value in combining them. A doubly robust estimator pairs a model of treatment assignment (for weights) with a model of the outcome (for regression adjustment) and builds an effect estimate that stays on target if either the weighting model or the outcome model is correctly specified. This is not magic. You still need information that captures the real drivers of selection and outcomes, and you still need overlap. But the hedge matters in practice, because we are rarely certain we have specified both sides perfectly.
There is a second, practical reason to blend ideas: it reveals where your conclusion rests. If a matched difference, a weighted contrast, and a flexible regression adjusted on the same, defensible pre‑treatment information all point to the same answer in the region of overlap, your reader can see that the story is not a one‑trick pony. If they diverge, you have learned that the conclusion is model‑dependent and you should either show why your preferred model wins or narrow the question until the methods agree.
The method should follow the question, not the other way around. If your policy question is “what happened to the people who actually took up the programme?”, a matched analysis anchored on the treated group and restricted to the region where good comparisons exist is a natural first answer. If your question is “what would happen if we rolled this out broadly?”, a weighting approach that targets the population average effect is coherent. If your setting offers strong pre‑treatment predictors of the outcome and good overlap, a flexible regression can be very efficient. In many studies you will show more than one road and keep them aligned: that is not redundancy; it is evidence that the conclusion does not hang on a single modelling choice.
It is tempting to believe that careful matching, clever weights, or sophisticated regressions can defeat unobserved forces that sort people into treatment and also move outcomes. They cannot. The best any of the three can do is to make the “as‑if” claim believable conditional on what you can observe, in the region where you truly have treated and untreated units to compare. When the important drivers are invisible or the groups never overlap, it is time to look for a different source of as‑if randomness.
The common thread remains the same. Matching, weighting, and regression are three roads to the same promise. They try to clear away the part of treatment selection we can explain so that what remains is close enough to luck for our comparison to speak causally. They work best when the pre‑treatment information is rich, the design discipline is explicit, and the claim stays inside the region where like can be compared with like.