IID: Meaning and Interpretation for Beginners

Independent and Identically Distributed

In statistics, data analysis, and machine learning topics, the concept of IID frequently appears as a fundamental assumption or condition. It stands for “independent and identically distributed”. An IID random variable or sequence is an important component of a statistical or machine models, also playing a role in time series analysis.

In this post, in an intuitive way, I explain the concept of IID in three different contexts,: sampling, modelling, and predictability.

IID in Sampling

The notation X ~ IID(μ,σ²) represents sampling of (X1, …, Xn) in a purely random way from the population with the mean μ and variance σ². That is,

each successive realization of X is independent, showing no association with the previous one or with the one after; and
each successive realization of X is obtained from the same distribution with identical mean and variance.

Examples

Suppose a sample (X1, …, Xn) is collected from the distribution of annual incomes of individuals of a country.

A researcher has selected income of a male for X1, a female for X2, a male of X3, then a female for X4, and this pattern is kept to Xn. This is not an IID sampling, because a predictable or systematic pattern in sampling is non-random, in violation of the condition of independence.
A researcher has selected (X1, … X500) from the poorest group of individuals and then (X501, … X1000) from the richest group. This is not an IID sampling, because the two groups have different income distributions with different means and variances, in violation of the condition of identicality.

IID in Modelling

Suppose Y is the variable of interest you want to model or explain. Then, it can be decomposed into two parts: namely,

Y = Systematic Component + Unsystematic Component.

The systematic component is the part of Y driven by the fundamental relationship with other factors. It is the component that can be explained or expected from theories, common sense, or stylized facts. It is the fundamental part of Y that is associated with substantive and practical importance.

The unsystematic component is the part of Y that is not driven by the fundamentals, which cannot be explained nor expected by theories, reasoning, or stylized facts. It captures variations of Y that cannot be explained by its systematic component. It should be purely random and idiosyncratic, without any systematic or predictable pattern. It is referred to as an error term in a statistical model, which is often represented as an IID random variable.

For example, consider a linear regression model of the following form:

Here, α + βX in (1) is the systematic component and the error term u in (1) is the unsystematic component.

If the value of β is close 0 or practically negligible, then the variable X has a low explanatory power (measured by R²) for Y, indicating that it is cannot satisfactorily explain the fundamental variation of Y.

The error term u is assumed to be an IID random variable with zero mean and fixed variance, denoted as u ~ IID(0, σ²), which is purely random representing the unsystematic or unexpected variation in Y.

If u is not purely random and has a noticeable pattern, then the systematic component may not correctly specified because it is missing something substantive or fundamental.

Example: Autocorrelation

Suppose that the error term has a following pattern:

This is a linear dependence (or autocorrelation), which is a systematic pattern. This predictable pattern should be incorporated into the model part, which will in turn better explain the systematic component of Y. One way of achieving this is to include a lagged term of Y in (2). That is,

The lag of Yt included in (3) is able to capture the autocorrelation of error term in (2), so that the error term e in (3) is an IID.

Example: Heteroskedasticity

Suppose that the error term shows the following systematic pattern:

This pattern of error term is called heteroskedasticity where the variability of error term changes as a function of X variable. For example, suppose Y is food expenditure and X is disposable income for individuals. The equation (4) means that high-income earners show a higher variability in food expenditure.

This is a predictable pattern, and the error term with the property of (4) violates the assumption of IID, because the variance of the error term is not a constant. To incorporate this pattern into the systematic component, the generalized or weighted least-squares estimation can be conducted in the following way:

The equation (5) is a regression with transformed variables, which can be written as

where

The above transformations of Y and X provide the transformed error term (ut* ) in (6), which is an IID and no longer heteroskedastic. That is,

This means that a systematic pattern in the error term is now effectively incorporated into the systematic component by the above transformation.

The above plots present the effect of the transformation in an intuitive way. Before the transformation (plot on the left), the variable Y shows an increasing variability as a function of X, which is a reflection of the heteroskedasticity. The transformation effectively incorporates the heteroskedastic pattern into the systematic component of Y, and the transformed error term is now an IID random variable, as the right-hand side plot shows.

Many of the model diagnostic tests in regression or machine learning models are designed to check if the error term follows an IID random variable, using the residuals from the estimated model. This is also called the residual analysis. Through the residual analysis and diagnostic checks, specification of the systematic component of the model can be improved.

IID and predictability

Being purely random, an IID sequence shows no predictable pattern at all. That is, its past history provides no information about the future course of the sequence.

Example: Autoregressive model

Consider an autoregressive model

where ut ~ IID(0,σ²) and -1 < ρ < 1 (ρ ≠ 0).

If ρ = 0, the time series Yt is an IID and non-predictable, since it shows no dependence on its own past, driven only by unpredictable shocks.

For simplicity, let us assume that Y0 = 0 and ρ ≠ 0 and conduct the following continual substitution:

Y1 = u1;

Y2 = ρY1 + u2 = ρu1 + u2;

Y3 = ρY2 + u3 = ρ²u1 + ρ u2 + u3;

Y4 = ρY3 + u4 = ρ³u1 + ρ²u2 + ρu3 + u4;

with the general expression being

This means that a time series model (such as an autoregression) can be expressed as a moving-average of the past and current IID errors (or shocks), with exponentially declining weights.

Note that that the distant shocks such as u1 and u2 in (8) have little impact on Yt, because their weights are negligible. For example, when ρ = 0.5 and t = 100, ρ⁹⁹ and ρ⁹⁸ are practically 0. Only the current or recent shock such as u100, u99, and u98 may matter practically.

Hence, if a researcher at time t has a good estimate of ρ (from data) and observed the current and recent shocks such as ut, ut-1, ut-2, and ut-3, she or he may be able to predict the value of Yt+1, with a reasonable accuracy, by projecting the moving-average in (8) into the future.

Example: Random walk

When ρ = 1, the time series in (7) become a random walk where the current change of Y is an IID shock. That is,

In this case, from (8) with ρ = 1, we have

That is, a random walk is sum of all past and current IID shocks with an equal weight of 1. That is, distant shocks are equally important as the recent and current shocks. For example, if t = 100, the shock u1 has the same impact on Y100 as u100.

As a sum of all past and current shocks, a random walk time series is purely unpredictable. It also shows a high degree of uncertainty and persistence (dependence on past), with the analytical results that

This means that the variability of a random walk increases with time, indicative of high degree of uncertainty and low degree of predictability over time.

In addition the correlation between Yt and Yt-k are almost equal to 1, for nearly all values of k. For example, Y100 and Y99 are correlated with the correlation coefficient of 99/100 = 0.99, when t = 100.

Examples of random walk include speculative asset prices such as stock prices, exchange rates, and oil price, which are extremely difficult to predict with a reasonable accuracy.

Conclusion

The concept of IID is fundamental in statistical analysis and machine learning models. This post has reviewed the IID in three different contexts: sampling, modelling, and predictability in time series analysis.

Jae H. Kim