6.3 Moving Average Models (Under Construction)

Moving Average Models¶

Recall that we might construct a three-point moving average as

v_t = \frac{1}{3}(w_{t-1} + w_t + w_{t+1}),

(1)

which will result in $v_t$ having the same expectation value as $w_t$ , but will introduce autocorrelation at lags $h=1,2$ .

Moving average (MA) models of order $q>1$ generalize this idea with the model

x_t = w_t + \theta_1 w_{t-1} + \theta_2 w_{t-2} +\ldots+\theta_q w_{t-q}, \qquad \theta_q\neq0.

(2)

By convention, the number of lags included in a given MA model is represented as $q$ , and the overall model is referred to as an MA( $q$ ) model. Unlike a true moving average, MA models do not require the $\theta$ ’s to sum to unity, or even to be positive.

Parameter Estimation¶

At this point, you may well be wondering how we could ever expect to estimate the $\theta$ ’s in Eq. (2) in real life. AR models can be calculated by lining up past observations and running a least squares regression (though in practice related methods such as Yule-Walker are often favored). In contrast, how are we supposed to know the $w_t$ ’s necessary to estimate an MA model? We observed the values $x_t$ , not the noise $w_t$ !

To answer this, let us take a step back and imagine we knew for a fact that a given process was described by the MA(1) model

x_t = w_t + \theta w_{t-1}.

(3)

In this case, our estimate of $\widehat{w}_t$ would be given by $x_t-\theta \widehat{w}_{t-1}$ . By the same token we would estimate $\widehat{w}_{t-1}=x_{t-1}-\theta \widehat{w}_{t-2}$ , $\widehat{w}_{t-2}= x_{t-2}-\theta \widehat{w}_{t-3}$ , etc. We can flip this around the use maximum likelihood estimation (MLE) to estimate the value $\hat{\theta}$ most consistent with our observations. This procedure can be extended to higher level MA( $q$ ) processes using the same logic. Sources such as Shumway & Stoffer (2025) chapter 3.5 and Brockwell & Davis (1991) chapter 5.1-5.2 go into some detail regarding methods such as Newton-Raphson, Gauss-Newton, and the innovations algorithm. Understanding the exact algorithms used for implementing MLE for MA (and ARMA) models is not as important for a practicing data scientist as understanding that MLE is being used. The use of MLE has two important ramifications you should be aware of:

MA models (and the MA component of ARMA models) are generally less numerically stable than pure AR models, resulting in them potentially being less reliable.
MA models are less interpretable than AR models. An AR model is purely determined by previous observations and its meanings and ramifications can be easily explained to clients. In contrast, an MA model relies on inferring an unseen noise term which can be challenging to explain to a less technical audience.

For these reasons, AR models are often preferred over MA models when they give comparable results with a similar number of parameters^[1].

Moving Average Operator¶

As with AR models, we define the moving average operator as:

\theta(B) \stackrel{\triangle}=1 + \theta_1 \mathbb{B}+ \theta_2 \mathbb{B}^2 + \ldots + \theta_q \mathbb{B}^q,

(4)

allowing us to express Eq. (2) as

\begin{split} x_t &= w_t + \theta_1 w_{t-1} + \theta_2 w_{t-2} +\ldots+\theta_q w_{t-q}\\ &= \theta(\mathbb{B})\,w_t. \end{split}

(5)

Note the change in sign convention from the autoregressive operator.

MA(1) Mean & Autocovariance¶

Provided that $w_t$ has a finite second moment, all MA processes will be stationary. The mean of a pure MA process will always be zero. The autocovariance is calculated by matching $w_t$ terms, for an MA(1) process we have:

\gamma(h) = \begin{cases} (1+\theta^2)\,\sigma_w^2 & h=0\\ \theta \sigma_w^2 & h=1\\ 0 & h>1.\\ \end{cases}

(6)

The autocorrelation for an MA(1) process is:

\rho(h) = \begin{cases} \frac{\theta}{1+\theta^2} & h=1\\ 0 & h>1\\ \end{cases}

(7)

Problem

Derive Eqs. (6) and (7).

Sign of $\theta$ ¶

Similar to what we observed with AR models in the previous section, an MA(1) model with $\theta<0$ will introduce serial negative correlation, resulting in a jagged time series. Unlike AR models, a negative $\theta$ will not introduce oscillations in the autocorrelation for any lag greater than $q$ . The following tool helps demonstrate the effects of different $\theta$ values for MA(1) and MA(2) processes.

If the above fails to render correctly in your browser you can also open the demo as a new browser window using the Open Demo in a New Tab ↗ button at the top of the frame. Note that you may need to enable popups for this to work.

Autocovariance of Higher Order MA Processes¶

An MA( $q$ ) process can be written as $x_t = \sum_{j=0}^{q}\theta_j w_{t-j}$ , where $\theta_0=1$ . The mean of an MA process is $\mathbb{E}\Big[\sum_{j=0}^{q}\theta_j w_{t-j}\Big]=\sum_{j=0}^{q}\theta_j \mathbb{E}[w_{t-j}]=0$ .The autocovariance $\gamma(h)$ will depend on the number of $w_t$ terms in common between $x_{t}$ and $x_{t+h}$ :

\begin{split} \gamma(h) &= \text{Cov}{(x_{t+h}, x_t)}\\ &= \text{Cov}{\Big(\sum_{j=0}^{q} \theta_j w_{t+h-j}, \sum_{k=0}^{q}\theta_k w_{t-k}\Big)}\\ &=\begin{cases} \sigma_w^2 \sum_{j=0}^{q-h}\theta_j\theta_{j+h}, & 0\leq h\leq q\\ 0 & h>q. \end{cases} \end{split}

(12)

Autocorrelation of Higher Order MA Processes¶

From Eq. (12), we see that

\gamma(0) = \sigma_w^2 (1 + \theta_1^2 + \ldots +\theta_q^2),

(13)

thus giving us the autocorrelation

\begin{split} \rho(h) &= \frac{\sigma_w^2\sum_{j=0}^{q-h}\theta_j\theta_{j+h}}{\sigma_w^2 (1 + \theta_1^2 + \ldots +\theta_q^2)}\\ &= \begin{cases} \frac{\sum_{j=0}^{q-h}\theta_j\theta_{j+h}}{1 + \theta_1^2 + \ldots +\theta_q^2}, & 1\leq h\leq q\\ 0 & h>q. \end{cases} \end{split}

(14)

When analyzing time series in real-life, observing a $\gamma(h)$ that is statistically significant for $q$ terms and then drops to insignificance is a strong indicator of an MA( $q$ ) process.

Non-uniqueness of MA Models¶

While we theoretically understand MA processes as being the sum of noise terms, we do not directly observe the noise. We use functions such as the autocovariance and autocorrelation of observed values to derive the form of an MA process. MA models are, in general, not unique. Let us examine an MA(1) model (higher order $q$ ’s are derived analogously); for an MA(1) model $\theta$ and $\frac{1}{\theta}$ yield the same $\rho$ :

\begin{split} \rho(1) &= \frac{\theta}{1 + \theta^2}\\ &= \frac{\theta}{1 + \theta^2}\times\frac{\frac{1}{\theta^2}}{\frac{1}{\theta^2}}\\ &= \frac{\frac{1}{\theta}}{(\frac{1}{\theta})^2 + 1} \end{split}

(15)

Using the autocovariance $\gamma(h)$ won’t help, either. To demonstrate this for an MA(1) model, let us return to Eq. (6), where we found that for an MA(1) process

\gamma(h) = \begin{cases} (1+\theta^2)\,\sigma_w^2 & h=0\\ \theta \sigma_w^2 & h=1\\ 0 & h>1.\\ \end{cases}

(16)

To use an example from Shumway & Stoffer (2025), consider an MA(1) model $\sigma_w^2=1$ and $\theta=5$ . In this case, $\gamma(0)=1+5^2=26$ , and $\gamma(1)=5$ . Now consider a model with $\sigma_w^2=25$ and $\theta=\frac{1}{5}$ . Here too, $\gamma(0)=(1+\frac{1}{5^2})25=\big(\frac{26}{25}\big)25=26$ , and $\gamma(1)=\big(\frac{1}{5}\big)25=5$ .

If we could somehow directly observe the noise $w_t$ , we would be able to differentiate the models by looking at the variance of the noise. Unfortunately, we can only infer the variance of $w_t$ from looking at $\gamma(h)$ . We thus see that two distinct MA(1) models can equally well describe the same underlying process with no way to distinguish the “true” model.

Invertibility of MA Process¶

As seen in Eq. (15), models with $\theta=5$ and $\theta=\frac{1}{5}$ yield the same $\rho$ . Moreover, $\theta=5, \sigma_w^2 =1$ and $\theta=\frac{1}{5}, \sigma_w^2=25$ yield the same $\gamma$ .We choose the model with $|\theta|<1$ as this model is invertible. In statsmodels, this is enforced by the argument enforce_invertibility with default=True. This allows the infinite AR representation

w_t = \sum_{j=0}^{\infty} (-\theta)^j x_{t-j}

(17)

where the negative sign arises from defining $\theta(\mathbb{B})=1+\theta \mathbb{B}=1-(-\theta) \mathbb{B}$ .

Where do MA Processes Arise?¶

Pure MA( $q$ ) models with a finite $q$ are sometimes referred to as “short-memory” models to contrast them with the longer memory of AR models. Pure MA models are somewhat less prevalent but do arise in scenarios such as items with a shelf-life which inherently have a short “memory.” As an example, a “shock” in the prices of dairy will quickly die off as we approach the end of the products’ shelf-lives. A glut in production will become irrelevant once the extra products expire, whereas a scarcity of production will rapidly reset as future purchases return to their baseline (as there is no need to refill long term stockpiles)^[1].

Arguably, however, MA models truly shine in the context of analyzing AR (or ARMA) models in their MA( $\infty$ ) representations. The MA( $\infty$ ) representation—referred to as the impulse response or impulse response function in disciplines such as signal processing—allows us to immediately determine how long a noise (or “impulse”) continues to generate observations outside of the system’s normal behavior. In the case of an AR(1) model who’s MA( $\infty$ ) is simply

\sum_{j=0}^{\infty}\phi^j w_{t-j},

(20)

it is straightforward in both the AR(1) and MA( $\infty$ ) representations to determine that for, say, $\phi=\pm0.75$ , the influence of an anomalous $w_t$ will decay to roughly $10\%$ of its initial value after 8 timesteps, and roughly $1\%$ after 16. For AR( $p$ ) processes higher $p$ values, extracting this information directly from the AR representation becomes far more challenging. Representing the process in its MA( $\infty$ ) form allows to quickly determine how a shock will decay by examining the $\psi$ weights. Moreover, for $p\geq2$ , there is no guarantee that the $\psi$ weights will decay monotonically. An AR( $p$ ) process with complex roots will exhibit correlations (and hence $\psi$ weights) that decay both exponentially and sinusoidally. The sinusoidal nature will result in $\psi$ weights that appear to reawaken at regular intervals and change the direction of their influence between positive and negative. Analyzing this decay of the $\psi$ weights allows us to avoid being surprised when we thought a shock had completely died off.

Footnotes¶

This is something of an oversimplification as in the United States the government maintains long-term cold cheese storage, in part to help smooth out supply chain shocks by absorbing excess production and providing relief during scenarios such as disaster relief. Nevertheless, our model is reasonable for local markets that may not participate in government cheese programs.
↩↩

References¶

Shumway, R. H., & Stoffer, D. S. (2025). Time Series Analysis and Its Applications. In Springer Texts in Statistics. Springer Nature Switzerland. 10.1007/978-3-031-70584-7
Brockwell, P. J., & Davis, R. A. (1991). Time Series: Theory and Methods. In Springer Series in Statistics. Springer New York. 10.1007/978-1-4419-0320-4