Selection Bias: When Convenience Beats Chance

Imagine two economists asked to estimate the average wage in a city. The first downloads thousands of résumés posted on a popular job site. The second uses a list of residents, picks some at random, and asks them about their wages. That afternoon, both compute exactly the same statistic: the sample mean wage. Same formula, same software. And yet only one of those numbers can be read as “the average wage in this city” with any defensible measure of uncertainty. The difference is not the technique. The difference is the mechanism that brought people into the dataset.

What happens when convenience decides who appears in your data instead of chance. The point is simple. Your estimator may be the same in both settings, \(\bar w = \frac{1}{n}\sum_{i=1}^n w_i\), say, but the meaning of that number, and the quality of the inference you can make from it, depend entirely on whether units entered your sample by a chance rule you can defend.

Let’s continue the story. Résumés posted online look rich and modern. They arrive in generous quantities. They include job titles, education, skills, sometimes even current salary or desired pay. It is extremely tempting to treat such a trove as a shortcut to economic facts. But a résumé platform is a self‑selection machine. People appear there precisely because they chose to post. They differ from non‑posters in ways that matter for wages: search intensity, job satisfaction, unemployment spells, industry churn, bargaining posture, even local platform penetration. Some groups are almost absent (older workers who stay put; high earners with bespoke networks). Others are over‑represented (recent graduates; people in sectors where online search is dominant).

A labour‑force survey, by contrast, individuals by a known random scheme. Everyone in the list has a known, non‑zero probability of being sampled. Non‑response is tracked and (crucially) treated as part of the sampling mechanism through callbacks, weighting, and documentation. You may end up with fewer rows in your file than the résumé scrape, but each row represents not just itself; it stands for many similar people in the population under a rule you can write down.

Both analysts take an average. Only one can say, “my 95% interval covers the truth 95% of the time under the design.” The résumé mean is an average of the people who decided to show up. The survey mean is an average of the people selected by chance, with known probabilities. That is the difference between convenience and chance.

It is common to downplay this by saying the online data are “a bit biased but big,” and that sheer volume will make the average “close enough”. The problem is not small errors; the problem is systematic error that does not vanish with \(n\). If the platform over‑represents, say, younger, mobile, tech‑sector workers, your average will be pulled towards their wage structure no matter how many additional résumés you scrape. You will get very precise estimates of the wrong thing. The confidence interval will shrink, but it will hug a target that isn’t the population mean you promised to estimate.

Even worse, without a defensible sampling rule you cannot say what your interval means. A 95% confidence interval from a probability sample is a property of the procedure across the samples you could have drawn but didn’t. It is a guarantee about long‑run coverage under the design. A 95% interval from a convenience sample is simply a statistical decoration unless you can reconstruct the inclusion mechanism. It has no design‑based interpretation and will often understate uncertainty because it ignores selection.

A frequent response is to compare the résumé file to a census table and declare victory if the margins look similar: the same share of women, similar age distribution, a close match on education. This helps, but it is not enough. Selection bias is driven by unobserved differences too: job search intensity, reservation wages, informal networks, on‑the‑job performance, and local conditions that never make it into profiles. Matching on observables can make the dataset look tidy without touching the reason people ended up on the platform in the first place. If those unobservables correlate with wages, and they almost certainly do, your average remains biased.

Convenience also hides coverage error. Résumé platforms are not the labour force; they are a subset whose composition depends on adoption, marketing, language, and the sectors that find them useful. Coverage error means that some parts of the population had zero chance of appearing in your scrape. No amount of post‑stratification can revive groups that never had a path into your file.

The difference shows up most starkly in how we talk about uncertainty. In a probability survey, you can say: “Had we redrawn the sample using the same design, our 95% interval would cover the true average wage about 95 times out of 100.” That sentence is meaningful because the design tells you what those hundred alternative samples look like.

With résumés, what is the thought experiment? “Had we re‑scraped the same platform on different days, we would have collected… more of the same self‑selected profiles.” You can make the standard error small by harvesting more profiles, but you have not changed the fact that the underlying mechanism chooses certain people—and excludes others—for reasons linked to the outcome. The relevant uncertainty is not the sampling jitter around your mean; it is the uncertainty about the gap between your convenience mean and the population mean. Standard errors do not capture that.

Sometimes the résumé file is all that exists. Then what? The first step is honesty about the target. You can say, “This is the average wage among people who post résumés on platform X,” and that statement can be precise and useful for platform design, vacancy targeting, or trend monitoring within that group. Trouble starts when the label slides to “the average wage in the city.”

If you still want to reach towards the population, you need structure and assumptions. Post‑stratify to external population margins if you can, but recognise that this only addresses observable differences. Use multiple platforms and triangulate; if they tell different stories, that is a signal that platform‑specific selection is strong. Explore bounds: how large would the selection gap have to be to overturn your conclusion? Anchor to credible benchmarks from probability surveys, even if they are smaller and less frequent. When the outcome allows it, exploit quasi‑random variation that affects platform presence but is plausibly unrelated to wages except through presence (for instance, a sudden platform outage or a policy that changes who is required to register), but be prepared to defend that logic carefully. This is no longer a convenience exercise, it is a design argument inside observational data.

Above all, resist the urge to retrofit model‑based uncertainty on a dataset whose inclusion rule you cannot articulate. A sleek regression with many covariates and tiny p‑values is decoration if the underlying mechanism is convenience.

Selection bias is not only a design issue; it is also a modelling issue in disguise. If you insist on using the résumé file to say something about the labour force, you are implicitly adopting a model in which résumé posters and non‑posters are similar conditional on the variables you observe. That is a strong claim. It can be made more plausible by careful control sets, transparent functional forms, and sensitivity analysis; it can be made less plausible by controlling for consequences of posting (bad controls), forcing linearity where relationships are curved, or extrapolating into regions with no overlap. A model can document and partially adjust for selection, but it cannot manufacture chance where no chance operated.

A cleaner way to combine worlds is to let design lead and models follow. Use the labour‑force survey to state the population fact, with proper design‑based uncertainty. Then use the résumé data for what they are naturally good at: noise‑free measurement of covariates, finer industry resolution, faster signals, richer text. The survey gives you the anchor.

A simple question disciplines the analysis: Who had a chance to be in your dataset, and how big was that chance? In the survey, you can answer with a table. In the résumé scrape, you cannot. That is the entire story. The first lets you say what the average wage is in the population, with a statement about long‑run coverage under the design. The second lets you say what the average wage is among people who choose to post résumés on a specific platform, with no justified claim about the broader labour force unless you add a convincing design or model.

To conclude: If chance did not choose them, your statistics do not mean what you think they mean.