Its usefulness, limitations, and mis-understanding

The central limit theorem is one of the foundations of the modern statistics, with a wide applicability to statistical and machine learning methods. This post explains its meaning and usefulness (the good) and its limitations (the bad), as well as the problems due to its mis-use or mis-understanding (the ugly).

**Central Limit Theorem**

The central limit theorem (CLT) says that, under certain conditions, the sampling distribution of a statistic can be approximated by a normal distribution, even if the population does not follow a normal distribution.

More formally, for the case of the population mean, let (*X*1, …, Xn) be a sequence of i.i.d. (independently and identically distributed) random variable with mean μ and variance σ². Then, the random variable *Z*, written as

converges in distribution to the standard normal distribution N(0,1), as n approaches infinity. This means that when X’s are generated purely randomly from an identical population with a non-normal distribution of an unknown form, the sampling distribution of the standardized mean of X’s (Z) approaches to N(0,1) with increasing sample size.

The figures above illustrate the CLT, obtained from Monte Carlo simulations, when the underlying population is the chi-squared distribution with three degrees of freedom. The red curves are the sampling distributions of Z, and the black curves represent N(0,1). You can see that the red curves get closer to the black one as the sample size increases. The full descriptions of the Monte Carlo simulation and R code are available from this post.

**The Good**

Many inferential methods of statistics are operational when the sampling distribution of a statistic is fully known. For example, the one-sample t-test for the population mean requires the knowledge of the distribution of the sample mean. That is, the Z-statistic above follows a t-distribution when the population follows a normal distribution.

However, it is quite likely in practice that the underlying distribution of the population is unknown. In addition, it may show a substantial departure from normality, and that the assumption of normality may not be justifiable. Then, in this case, what distribution should we use for hypothesis testing or confidence interval for the population parameter?

The CLT provides us with such a distribution, although it is an approximation. That is, we employ what is called an __asymptotic approximation__*, *where * *

the limiting distribution of a statistic is used as an approximation to its sampling distribution in small samples.

For example, in the above figure, when n = 50, we do not know or observe the red curve in practice, so we use its limiting distribution N(0,1) as an approximation to it, by virtue of the CLT.

The same argument and similar asymptotic approximations are applicable to many other test statistics, such as those in linear regression and machine learning models. As a result, researchers are able to conduct a range of statistical tests, without having to observe the exact sampling distribution of a test statistic or without requiring the normality of the population.

**The bad**

A key question is whether this *asymptotic approximation* is accurate enough in practice. This is anyone’s guess. It will depend on the contexts and situations in each application.

Statistical textbooks suggest n ≥ 30 as a benchmark for the case of the population mean, but there is no theoretical basis for this as a general rule. In the above illustration, one may observe that approximation is of reasonable accuracy when n ≥ 30. But the above is a controlled experiment applicable only for a specific problem, which may not be generalized to other cases in practical applications.

Hence, the researchers should make a cautious decision when this approximation is in use, since misleading statistical results can be obtained when the approximation is poor. They should combine all other available information in their analysis, such as the effect size, practical significance of the results, and descriptive analyses, in order to reach the final decision to the problem at hand.

The bootstrap method may be used as an alternative to the asymptotic approximation, which is a non-parametric and data-dependent method based on re-sampling. A bootstrap approximation may provide a better approximation than the asymptotic one in small samples. For more details, please see this post.

Another mis-understanding of the CLT is a belief that a larger sample is always better. It is true that the approximation is better with increasing sample size as in the above plots, but again the above is a controlled experiment where all of the assumptions of the CLT are fully satisfied. In practice, as the researcher increases the sample size, it is getting more and more likely that the key assumptions of the CLT are violated.

In principle, the sample size should be determined in consideration of the level of significance (the probability of Type I error), the probability of Type II error, and the statistical power, ** before** the researcher collects the data. An arbitrarily large sample can cause an imbalance between Type I and II error probabilities (or extremely large power). As a result, a null hypothesis can be rejected even when it is violated by a negligible margin, leading to a Type I error.

**The ugly**

The most serious misconception about the CLT is that a larger sample is always better and it can be a silver bullet to all problems in statistical decisions. As mentioned above, there are two main issues here:

The assumptions of the CLT can be violated as the sample size increases. This will cause bias, which will be magnified with increasing sample size.

Under a large or massive sample, the null hypothesis can often be rejected even when it is violated by a negligible margin, due to an extreme power, which can result in Type I error.

Here are two examples:

__Sampling bias__

In his article warning against the dangers of big data, Harford (2014) provides an interesting example as to the failure in the belief that a larger sample is always a better one. For the pre-poll of the 1936 U.S. Presidential election (Landon vs. Roosevelt), a magazine called *Literary Digest* sent out 10 million questionnaires to the voters sampled from automobile registry and telephone books. Based on 2.5 million responses, the magazine predicted a landslide victory of Landon’s. This prediction was gone horribly wrong with the outcome being a landslide victory of Roosevelt’s.

It turned out that those who owned telephone or automobile in 1936 was very rich, and those 2.5 million who responded may well have been a small subset of the entire population who were intensely interested in the election.

A larger sample can often introduce a serious systematic bias and violation of the key assumption of random sampling. And the bias is magnified with increasing sample size. The promise of the central limit theorem does not work here, because the way data was collected is not, by any means, compatible with the assumptions of the theorem.

*Spurious statistical significance*

Consider a regression model: Y = α + β X + u, where the t-statistic for H0: β = 0 can be written as

where b is the regression estimate of β, s is an appropriate measure of variability and n is the sample size. It follows from the above expression that, even if the value of b is practically 0, a large enough sample size can make the value of t-statistic greater than 1.96 (in absolute value).

As an example, consider the case of daily stock return (Y) against daily sunspot numbers (X): a relationship with little economic or practical significance. Can this meaningless regression give statistically significant slope coefficient (b) by simply increasing the sample size? The answer is yes, almost always.

I considered the index returns from 24 stock markets from January 1988 to February 2016 (7345 observations), including Amsterdam, Athens, Bangkok, Brussels, Buenos Aires, Copenhagen, Dublin, Helsinki, Istanbul, Kuala Lumpur, London, Madrid, Manila, New York, Oslo, Paris, Rio de Janeiro, Santiago, Singapore, Stockholm, Sydney, Taipei, Vienna, and Zurich. The total number of pooled observations is 176,280. To highlight the effect of increasing sample size on statistical significance, I conduct pooled regressions by accumulatively pooling the data from the Amsterdam to Zurich markets (increasing the sample size from 7,345 to 176,280).

The figure above presents the t-statistic for H0: β = 0 as the sample size increases, with the red line being 1.96. The coefficient becomes statistically significant when the sample size is around 40,000: a statistically significant outcome but the result with little economic significance. For more details about this phenomenon, please see this post. The R code and data for the above plot are available from here.

A point related to the CLT is that as the sample covers more and more data points, the condition of i.i.d. may well be violated. That is, it is getting more and more unlikely that the stock returns are generated from the population with the same mean and variance, because all stock markets are different.

The above research design follows the study of Hirshleifer and Shumway (2003), who claimed a systematic relationship between stock return and weather in New York. There are many research papers in finance, which have claimed the evidence against market efficiency in this way. But it is most likely that they represent spurious statistical significance obtained from a large sample size, with a belief that a larger sample will always do good.

This post has reviewed the central limit theorem which is a fundamental to many statistical methods in statistical and machine learning. The usefulness of the theorem and its limitations are discussed, along with the possible problems when it is mis-used or mis-interpreted. The key messages include

the theorem provides a basis for approximating the unknown sampling distribution of a statistic using its known limiting distribution, when the sample size is small (

*asymptotic approximation*); andthe theorem does not mean that a larger sample size should be preferred where possible. In principle, the sample size should be chosen in consideration of Type I and II probabilities before the data is collected.

## コメント