How Useful is F-test in Linear Regression?

Not very much, but we can improve it.

The F-test statistic for joint significance of the slope coefficients of a regression is routinely reported in regression outputs, along with other key statistics such as R² and t-ratio values. The question is whether it is useful or informative as a key statistic. Does it add any value to your regression results? While it is routinely reported, one may observe that the F-statistic almost always rejects H0 in practical applications. What does it tell us about the goodness-of-fit of a regression? You will often find the value of R² very low, but the F-test says the model has an explanatory power with statistical significance. Isn’t this a conflicting outcome? How can we reconcile this?

In this post, I explain the problems associated with the F-test and how it can be modified so that it can serve as a useful tool. I should like to thank Venkat Raman for his LinkedIn post that has motivated this article. The R code, data, and a supporting document are available from here.

The contents are as below:

What is the F-test in linear regression?
Critical values in response to sample size (T) and the number of explanatory variables (K)
F-statistics in response to T and K
Example
Why is this phenomenon happening?
How can the F-test be modified?

What is the F-test in linear regression?

Consider a linear regression model

where Y is the dependent variable, X’s are the independent variables, and u is the error term that follows a normal distribution with 0 mean and a fixed variance. The null hypotheses of the test is

against H1 that at least one of these β’s ≠ 0. Let P² be the population value of the coefficient of determination while R² is its sample estimator.

· Under H0, the X variables have no explanatory power for Y and P² = 0.

· Under H1, at least of one of X’s have explanatory power for Y and P² > 0.

It is well-known that R² is an increasing function of K. That is, it increases as more explanatory variables are added to the model.

The F-test statistic is written as

where SSR0 is the residual sum of squares under H0 and SSR1 is the same under H1, while T is the sample size. The F-test statistic can also be written in terms of R², as given above.

The statistic follows the (central) F-distribution with (K, T-K-1) degrees of freedom, denoted as F(K, T-K-1). The null hypothesis is rejected at the α-level of significance, if F > Fc(α) where Fc(α) is the α-level critical value from F(K, T-K-1).

Critical values in response to K and T

Let us first see how the critical value Fc(α) changes in response to the values of sample size and the number of explanatory variables.

Figure 1 above shows that the 5% critical value declines as the value of K or as the value of T increases. This means that, with a larger sample size or a larger number of explanatory variables, the bar to reject H0 gets lower. Note that this property is also evident for other α-level critical values.

F-test statistic in response to T and K

It is clear from its F-statistic formula above that the value of F- statistic is determined by T, K, and R². More specifically,

the F-statistic is an increasing function of T, given a fixed value of K, as long as the value of R² does not decrease with T;
when R² value decreases with T, the F-statistic still increases with T, if the effect of increasing T outpaces that of decreasing R²/(1-R²);
the F-statistic is an increasing function of K, given a fixed value of T, because the value of R² always increases with the value of K as stated above.

The above observations indicate that it is highly likely in practice that the F-statistic is an increasing function of T and K. However, the F-critical values declines with the increasing values of T and K, as reported in Figure 1. Hence, in modern days where the value of T and K are large, it is frequently the case that F > Fc(α), often rejecting the null hypothesis.

An example

I consider the data set with sunspot numbers (Y) and stock returns of different stock markets (X1, …, XK), daily from January 1988 to February 2016 (7345 observations). This is intended to be a non-sense regression for a relationship with little economic justification. If the F-test is useful and effective, it should almost always fail to reject H0, while the value of R² is expected to be close to 0.

The stock returns are from 24 stock markets (K = 24), including Amsterdam, Athens, Bangkok, Brussels, Buenos Aires, Copenhagen, Dublin, Helsinki, Istanbul, Kuala Lumpur, London, Madrid, Manila, New York, Oslo, Paris, Rio de Janeiro, Santiago, Singapore, Stockholm, Sydney, Taipei, Vienna, and Zurich.

I run the regression of Y on (X1, …, XK), by progressively increasing the sample size and the number of stock markets, i.e., increasing the value of T and K. That is, the first regression starts with (T = 50, K =1), and then (T = 50, K =2), …, (T = 50, K = 24), followed by (T = 198, K =1), …, (T = 198, K = 24), and so on, and the process continues until the last set of regressions with (T = 7345, K = 1), …, (T = 7345, K = 24).

As we can from Figure 2 above, the value of F-test statistic in general increases with sample size, for most of the values of K. They are larger than the 5% critical values Fc (which are well below 2 in most cases), rejecting H0 in most cases. In contrast, the values of R² approach 0 as the sample size increases, for all K values.

This means that R² is telling us effectively that the regression model is meaningless, but the F-test is doing otherwise by failing to reject H0 in most cases. Two key statistics show two conflicting outcomes.

Why is this phenomenon happening?

This does not mean that the theory of F-test developed by Ronald Fisher is wrong. The theory is correct, but it works only when H0 is true exactly and literally. That is, when P² = 0 or all slope coefficients are 0, exactly without any deviations. However, such a situation will not occur in the real world where researchers use observational data: the values of R² can get close to 0, but it cannot be zero exactly. Hence, the theory works only in statistical textbooks or computationally under a controlled Monte Carlo experiment.

We should also remember that the F-test was developed in the 1920’s where the values of T and K were as small as 20 and 3, respectively. The values of T and K we encounter in the modern days were something unimaginable then.

How can the F-test be modified?

The main problems with the F-test are identified above:

the critical value of the test decreases while the test statistic increases, in response to increasing values of T and K.

As mentioned above, this occurs because the F-test is for H0: P² = 0, but its sample estimate R² will never get to 0 exactly and literally. As a result, the F-test statistic increases with sample size in general, even if R² decreases to a practically negligible value.

How do we fix this? In fact, the solution is quite simple. Instead of testing for H0: P² = 0 as in the conventional F-test, we should test for a one-tailed test of the following form:

H0: P² ≤ P0; H1: P² > P0.

This is based on the argument that, for a model to be statistically important, its R² value should be at least P0. Suppose P0 is set at 0.05. Under H0, any R² value less than 0.05 is practically negligible and the model is regarded as being substantively unimportant. The researcher can choose other values of P0, depending on the context of the research.

Under H0: P² ≤ P0, the F-statistic follows a non-central F-distribution F(K,T-K-1; λ) where λ is the non-centrality parameters given by

Obviously, when P0 = 0 as in the conventional F-test, the value of λ = 0 and F-statistic follows the central F-distribution F(K,T-K-1). As it clear from the above expression that λ is an increasing function of sample size T for P0 > 0. As a result, the critical value Fc(α) is also an increasing function of sample size.

Figure 3 above illustrates the non-central distributions F(K,T-K-1:λ) when K = 5 and P0 = 0.05, under a range of increasing values of T from 100 to 2000. The increasing value of λ pushes the distributions away from 0, as well as their 5% critical values.

Figure 4 above demonstrates the property as a function of T and K when P0 = 0.05. For example, when T = 1000 and K = 25, Fc(α) = 4.27; and when T = 2000 and K = 25, Fc(α) =6.74, where α = 0.05.

Further details of this test can be found in the working paper (currently under review for publication) whose pdf copy is available from here.

Getting back to our example for the sunspot regression, a test for H0: P² ≤ 0.05; H1: P² > 0.05 can be conducted. The results of the selected cases are summarized as below, where α = 0.05:

Except when T = 50, the F-statistics are greater than the critical values from the central F-distributions, which means that H0: P² = 0 is rejected at the 5% level of significance, despite negligible R² values. However, the F-statistics are less than the critical values from the non-central F-distributions, which means that H0: P² ≤ 0.05 cannot be rejected at the 5% level of significance, consistent with negligible R² values.

To conclude, the F-test has serious issues as a means of testing for the goodness-of-fit of a regression model, especially when the sample size or the number of explanatory variables is large. It often conflicts with low R² values, indicative of the negligible effect of the model. Hence, as in the present form, the F-test is not useful as a test for goodness-of-fit. However, with a simple modification, the test can become useful, which has been introduced in this post with an example.

Jae H. Kim