Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

6.3 Moving Average Models (Under Construction)

Moving Average Models

Recall that we might construct a three-point moving average as

vt=13(wt1+wt+wt+1),v_t = \frac{1}{3}(w_{t-1} + w_t + w_{t+1}),

which will result in vtv_t having the same expectation value as wtw_t, but will introduce autocorrelation at lags h=1,2h=1,2.

Moving average (MA) models of order q>1q>1 generalize this idea with the model

xt=wt+θ1wt1+θ2wt2++θqwtq,θq0.x_t = w_t + \theta_1 w_{t-1} + \theta_2 w_{t-2} +\ldots+\theta_q w_{t-q}, \qquad \theta_q\neq0.

By convention, the number of lags included in a given MA model is represented as qq, and the overall model is referred to as an MA(qq) model. Unlike a true moving average, MA models do not require the θ\theta’s to sum to unity, or even to be positive.

Parameter Estimation

At this point, you may well be wondering how we could ever expect to estimate the θ\theta’s in Eq. (2) in real life. AR models can be calculated by lining up past observations and running a least squares regression (though in practice related methods such as Yule-Walker are often favored). In contrast, how are we supposed to know the wtw_t’s necessary to estimate an MA model? We observed the values xtx_t, not the noise wtw_t!

To answer this, let us take a step back and imagine we knew for a fact that a given process was described by the MA(1) model

xt=wt+θwt1.x_t = w_t + \theta w_{t-1}.

In this case, our estimate of w^t\widehat{w}_t would be given by xtθw^t1x_t-\theta \widehat{w}_{t-1}. By the same token we would estimate w^t1=xt1θw^t2\widehat{w}_{t-1}=x_{t-1}-\theta \widehat{w}_{t-2}, w^t2=xt2θw^t3\widehat{w}_{t-2}= x_{t-2}-\theta \widehat{w}_{t-3}, etc. We can flip this around the use maximum likelihood estimation (MLE) to estimate the value θ^\hat{\theta} most consistent with our observations. This procedure can be extended to higher level MA(qq) processes using the same logic. Sources such as Shumway & Stoffer (2025) chapter 3.5 and Brockwell & Davis (1991) chapter 5.1-5.2 go into some detail regarding methods such as Newton-Raphson, Gauss-Newton, and the innovations algorithm. Understanding the exact algorithms used for implementing MLE for MA (and ARMA) models is not as important for a practicing data scientist as understanding that MLE is being used. The use of MLE has two important ramifications you should be aware of:

  1. MA models (and the MA component of ARMA models) are generally less numerically stable than pure AR models, resulting in them potentially being less reliable.

  2. MA models are less interpretable than AR models. An AR model is purely determined by previous observations and its meanings and ramifications can be easily explained to clients. In contrast, an MA model relies on inferring an unseen noise term which can be challenging to explain to a less technical audience.

For these reasons, AR models are often preferred over MA models when they give comparable results with a similar number of parameters[1].

Moving Average Operator

As with AR models, we define the moving average operator as:

θ(B)=1+θ1B+θ2B2++θqBq,\theta(B) \stackrel{\triangle}=1 + \theta_1 \mathbb{B}+ \theta_2 \mathbb{B}^2 + \ldots + \theta_q \mathbb{B}^q,

allowing us to express Eq. (2) as

xt=wt+θ1wt1+θ2wt2++θqwtq=θ(B)wt.\begin{split} x_t &= w_t + \theta_1 w_{t-1} + \theta_2 w_{t-2} +\ldots+\theta_q w_{t-q}\\ &= \theta(\mathbb{B})\,w_t. \end{split}

Note the change in sign convention from the autoregressive operator.

MA(1) Mean & Autocovariance

Provided that wtw_t has a finite second moment, all MA processes will be stationary. The mean of a pure MA process will always be zero. The autocovariance is calculated by matching wtw_t terms, for an MA(1) process we have:

γ(h)={(1+θ2)σw2h=0θσw2h=10h>1.\gamma(h) = \begin{cases} (1+\theta^2)\,\sigma_w^2 & h=0\\ \theta \sigma_w^2 & h=1\\ 0 & h>1.\\ \end{cases}

The autocorrelation for an MA(1) process is:

ρ(h)={θ1+θ2h=10h>1\rho(h) = \begin{cases} \frac{\theta}{1+\theta^2} & h=1\\ 0 & h>1\\ \end{cases}

Sign of θ\theta

Similar to what we observed with AR models in the previous section, an MA(1) model with θ<0\theta<0 will introduce serial negative correlation, resulting in a jagged time series. Unlike AR models, a negative θ\theta will not introduce oscillations in the autocorrelation for any lag greater than qq. The following tool helps demonstrate the effects of different θ\theta values for MA(1) and MA(2) processes.

If the above fails to render correctly in your browser you can also open the demo as a new browser window using the Open Demo in a New Tab ↗ button at the top of the frame. Note that you may need to enable popups for this to work.

Autocovariance of Higher Order MA Processes

An MA(qq) process can be written as xt=j=0qθjwtjx_t = \sum_{j=0}^{q}\theta_j w_{t-j}, where θ0=1\theta_0=1. The mean of an MA process is E[j=0qθjwtj]=j=0qθjE[wtj]=0\mathbb{E}\Big[\sum_{j=0}^{q}\theta_j w_{t-j}\Big]=\sum_{j=0}^{q}\theta_j \mathbb{E}[w_{t-j}]=0.The autocovariance γ(h)\gamma(h) will depend on the number of wtw_t terms in common between xtx_{t} and xt+hx_{t+h}:

γ(h)=Cov(xt+h,xt)=Cov(j=0qθjwt+hj,k=0qθkwtk)={σw2j=0qhθjθj+h,0hq0h>q.\begin{split} \gamma(h) &= \text{Cov}{(x_{t+h}, x_t)}\\ &= \text{Cov}{\Big(\sum_{j=0}^{q} \theta_j w_{t+h-j}, \sum_{k=0}^{q}\theta_k w_{t-k}\Big)}\\ &=\begin{cases} \sigma_w^2 \sum_{j=0}^{q-h}\theta_j\theta_{j+h}, & 0\leq h\leq q\\ 0 & h>q. \end{cases} \end{split}

Autocorrelation of Higher Order MA Processes

From Eq. (12), we see that

γ(0)=σw2(1+θ12++θq2),\gamma(0) = \sigma_w^2 (1 + \theta_1^2 + \ldots +\theta_q^2),

thus giving us the autocorrelation

ρ(h)=σw2j=0qhθjθj+hσw2(1+θ12++θq2)={j=0qhθjθj+h1+θ12++θq2,1hq0h>q.\begin{split} \rho(h) &= \frac{\sigma_w^2\sum_{j=0}^{q-h}\theta_j\theta_{j+h}}{\sigma_w^2 (1 + \theta_1^2 + \ldots +\theta_q^2)}\\ &= \begin{cases} \frac{\sum_{j=0}^{q-h}\theta_j\theta_{j+h}}{1 + \theta_1^2 + \ldots +\theta_q^2}, & 1\leq h\leq q\\ 0 & h>q. \end{cases} \end{split}

When analyzing time series in real-life, observing a γ(h)\gamma(h) that is statistically significant for qq terms and then drops to insignificance is a strong indicator of an MA(qq) process.

Non-uniqueness of MA Models

While we theoretically understand MA processes as being the sum of noise terms, we do not directly observe the noise. We use functions such as the autocovariance and autocorrelation of observed values to derive the form of an MA process. MA models are, in general, not unique. Let us examine an MA(1) model (higher order qq’s are derived analogously); for an MA(1) model θ\theta and 1θ\frac{1}{\theta} yield the same ρ\rho:

ρ(1)=θ1+θ2=θ1+θ2×1θ21θ2=1θ(1θ)2+1\begin{split} \rho(1) &= \frac{\theta}{1 + \theta^2}\\ &= \frac{\theta}{1 + \theta^2}\times\frac{\frac{1}{\theta^2}}{\frac{1}{\theta^2}}\\ &= \frac{\frac{1}{\theta}}{(\frac{1}{\theta})^2 + 1} \end{split}

Using the autocovariance γ(h)\gamma(h) won’t help, either. To demonstrate this for an MA(1) model, let us return to Eq. (6), where we found that for an MA(1) process

γ(h)={(1+θ2)σw2h=0θσw2h=10h>1.\gamma(h) = \begin{cases} (1+\theta^2)\,\sigma_w^2 & h=0\\ \theta \sigma_w^2 & h=1\\ 0 & h>1.\\ \end{cases}

To use an example from Shumway & Stoffer (2025), consider an MA(1) model σw2=1\sigma_w^2=1 and θ=5\theta=5. In this case, γ(0)=1+52=26\gamma(0)=1+5^2=26, and γ(1)=5\gamma(1)=5. Now consider a model with σw2=25\sigma_w^2=25 and θ=15\theta=\frac{1}{5}. Here too, γ(0)=(1+152)25=(2625)25=26\gamma(0)=(1+\frac{1}{5^2})25=\big(\frac{26}{25}\big)25=26, and γ(1)=(15)25=5\gamma(1)=\big(\frac{1}{5}\big)25=5.

If we could somehow directly observe the noise wtw_t, we would be able to differentiate the models by looking at the variance of the noise. Unfortunately, we can only infer the variance of wtw_t from looking at γ(h)\gamma(h). We thus see that two distinct MA(1) models can equally well describe the same underlying process with no way to distinguish the “true” model.

Invertibility of MA Process

As seen in Eq. (15), models with θ=5\theta=5 and θ=15\theta=\frac{1}{5} yield the same ρ\rho. Moreover, θ=5,σw2=1\theta=5, \sigma_w^2 =1 and θ=15,σw2=25\theta=\frac{1}{5}, \sigma_w^2=25 yield the same γ\gamma.We choose the model with θ<1|\theta|<1 as this model is invertible. In statsmodels, this is enforced by the argument enforce_invertibility with default=True. This allows the infinite AR representation

wt=j=0(θ)jxtjw_t = \sum_{j=0}^{\infty} (-\theta)^j x_{t-j}

where the negative sign arises from defining θ(B)=1+θB=1(θ)B\theta(\mathbb{B})=1+\theta \mathbb{B}=1-(-\theta) \mathbb{B}.

Where do MA Processes Arise?

Pure MA(qq) models with a finite qq are sometimes referred to as “short-memory” models to contrast them with the longer memory of AR models. Pure MA models are somewhat less prevalent but do arise in scenarios such as items with a shelf-life which inherently have a short “memory.” As an example, a “shock” in the prices of dairy will quickly die off as we approach the end of the products’ shelf-lives. A glut in production will become irrelevant once the extra products expire, whereas a scarcity of production will rapidly reset as future purchases return to their baseline (as there is no need to refill long term stockpiles)[1].

Arguably, however, MA models truly shine in the context of analyzing AR (or ARMA) models in their MA(\infty) representations. The MA(\infty) representation—referred to as the impulse response or impulse response function in disciplines such as signal processing—allows us to immediately determine how long a noise (or “impulse”) continues to generate observations outside of the system’s normal behavior. In the case of an AR(1) model who’s MA(\infty) is simply

j=0ϕjwtj,\sum_{j=0}^{\infty}\phi^j w_{t-j},

it is straightforward in both the AR(1) and MA(\infty) representations to determine that for, say, ϕ=±0.75\phi=\pm0.75, the influence of an anomalous wtw_t will decay to roughly 10%10\% of its initial value after 8 timesteps, and roughly 1%1\% after 16. For AR(pp) processes higher pp values, extracting this information directly from the AR representation becomes far more challenging. Representing the process in its MA(\infty) form allows to quickly determine how a shock will decay by examining the ψ\psi weights. Moreover, for p2p\geq2, there is no guarantee that the ψ\psi weights will decay monotonically. An AR(pp) process with complex roots will exhibit correlations (and hence ψ\psi weights) that decay both exponentially and sinusoidally. The sinusoidal nature will result in ψ\psi weights that appear to reawaken at regular intervals and change the direction of their influence between positive and negative. Analyzing this decay of the ψ\psi weights allows us to avoid being surprised when we thought a shock had completely died off.

Footnotes
  1. This is something of an oversimplification as in the United States the government maintains long-term cold cheese storage, in part to help smooth out supply chain shocks by absorbing excess production and providing relief during scenarios such as disaster relief. Nevertheless, our model is reasonable for local markets that may not participate in government cheese programs.

References
  1. Shumway, R. H., & Stoffer, D. S. (2025). Time Series Analysis and Its Applications. In Springer Texts in Statistics. Springer Nature Switzerland. 10.1007/978-3-031-70584-7
  2. Brockwell, P. J., & Davis, R. A. (1991). Time Series: Theory and Methods. In Springer Series in Statistics. Springer New York. 10.1007/978-1-4419-0320-4