The trouble with doing statistics without understanding it

The average economics student takes at least one module in econometrics as an undergraduate. Master students normally take further econometric modules - now mostly covering microeconometrics (i.e. linear regression in cross-sections and panel data, dummy dependent variable models like logit, probit, and censored and truncated regression).

Since individuals are heterogeneous, one would expect heteroskedasticity both in cross-section and in panel data. Most lecturers tell their students to use robust standard errors in applications using linear models. This is now very easy to do in practice because such procedures have been coded up in several computer packages. For example, in my modules I tend to use Stata, which offers the robust or the cluster options.

As lecturers, we tend to repeat this mantra so often that our students - and sometimes even researchers - report heteroskedasticity robust standard errors even in cases where this does not make sense.

Heteroskedasticity robust standard errors for an estimator make sense when two conditions are satisfied: (1) the estimator is consistent even in the presence of heteroskedasticity and (2) the standard errors routinely reported by the statistical package are incorrectly calculated under heteroskedasticity. Hence, one needs to understand the effects that heteroskedasticity on the consistency of an estimator before deciding whether heteroskedasticity robust standard errors should be reported.

In some situations heteroskedasticity fundamentally affects an estimator by making it inconsistent. For example the standard estimator for the Tobit model, a common model for censored regressions, is inconsistent under heteroskedasticity (see Arabmazar, A. and P. Schmidt (1981) ‘Further evidence on the robustness of the Tobit estimator to heteroskedasticity’, Journal of Econometrics 17, 253-258). Despite this, it is easy to report robust standard errors for the Tobit model since the usual robust or clustered options can be ticked. By doing this, one does not solve the essential problem of inconsistency of the estimator and may also give the researchers the impression that they have actually ‘dealt with’ heteroskedasticity when in fact they do not.

If well trained students and researchers can easily fall into this trap, what happens when these tools are used by individuals without training? This is more than a speculative worry. There are lots of tools on the market that allow individuals to ‘bypass’ the experts - those who know how the methods actually work - and perform analytics or machine learning with no knowledge of statistics.