Moving Average Models¶
Recall that we might construct a three-point moving average as
which will result in having the same expectation value as , but will introduce autocorrelation at lags .
Moving average (MA) models of order generalize this idea with the model
By convention, the number of lags included in a given MA model is represented as , and the overall model is referred to as an MA() model. Unlike a true moving average, MA models do not require the ’s to sum to unity, or even to be positive.
Parameter Estimation¶
At this point, you may well be wondering how we could ever expect to estimate the ’s in Eq. (2) in real life. AR models can be calculated by lining up past observations and running a least squares regression (though in practice related methods such as Yule-Walker are often favored). In contrast, how are we supposed to know the ’s necessary to estimate an MA model? We observed the values , not the noise !
To answer this, let us take a step back and imagine we knew for a fact that a given process was described by the MA(1) model
In this case, our estimate of would be given by . By the same token we would estimate , , etc. We can flip this around the use maximum likelihood estimation (MLE) to estimate the value most consistent with our observations. This procedure can be extended to higher level MA() processes using the same logic. Sources such as Shumway & Stoffer (2025) chapter 3.5 and Brockwell & Davis (1991) chapter 5.1-5.2 go into some detail regarding methods such as Newton-Raphson, Gauss-Newton, and the innovations algorithm. Understanding the exact algorithms used for implementing MLE for MA (and ARMA) models is not as important for a practicing data scientist as understanding that MLE is being used. The use of MLE has two important ramifications you should be aware of:
MA models (and the MA component of ARMA models) are generally less numerically stable than pure AR models, resulting in them potentially being less reliable.
MA models are less interpretable than AR models. An AR model is purely determined by previous observations and its meanings and ramifications can be easily explained to clients. In contrast, an MA model relies on inferring an unseen noise term which can be challenging to explain to a less technical audience.
For these reasons, AR models are often preferred over MA models when they give comparable results with a similar number of parameters[1].
Moving Average Operator¶
As with AR models, we define the moving average operator as:
allowing us to express Eq. (2) as
Note the change in sign convention from the autoregressive operator.
MA(1) Mean & Autocovariance¶
Provided that has a finite second moment, all MA processes will be stationary. The mean of a pure MA process will always be zero. The autocovariance is calculated by matching terms, for an MA(1) process we have:
The autocorrelation for an MA(1) process is:
Sign of ¶
Similar to what we observed with AR models in the previous section, an MA(1) model with will introduce serial negative correlation, resulting in a jagged time series. Unlike AR models, a negative will not introduce oscillations in the autocorrelation for any lag greater than . The following tool helps demonstrate the effects of different values for MA(1) and MA(2) processes.
If the above fails to render correctly in your browser you can also open the demo as a new browser window using the Open Demo in a New Tab ↗ button at the top of the frame. Note that you may need to enable popups for this to work.
Autocovariance of Higher Order MA Processes¶
An MA() process can be written as , where . The mean of an MA process is .The autocovariance will depend on the number of terms in common between and :
Autocorrelation of Higher Order MA Processes¶
From Eq. (12), we see that
thus giving us the autocorrelation
When analyzing time series in real-life, observing a that is statistically significant for terms and then drops to insignificance is a strong indicator of an MA() process.
Non-uniqueness of MA Models¶
While we theoretically understand MA processes as being the sum of noise terms, we do not directly observe the noise. We use functions such as the autocovariance and autocorrelation of observed values to derive the form of an MA process. MA models are, in general, not unique. Let us examine an MA(1) model (higher order ’s are derived analogously); for an MA(1) model and yield the same :
Using the autocovariance won’t help, either. To demonstrate this for an MA(1) model, let us return to Eq. (6), where we found that for an MA(1) process
To use an example from Shumway & Stoffer (2025), consider an MA(1) model and . In this case, , and . Now consider a model with and . Here too, , and .
If we could somehow directly observe the noise , we would be able to differentiate the models by looking at the variance of the noise. Unfortunately, we can only infer the variance of from looking at . We thus see that two distinct MA(1) models can equally well describe the same underlying process with no way to distinguish the “true” model.
Invertibility of MA Process¶
As seen in Eq. (15), models with and yield the same . Moreover, and yield the same .We choose the model with as this model is invertible. In statsmodels, this is enforced by the argument enforce_invertibility with default=True. This allows the infinite AR representation
where the negative sign arises from defining .
Where do MA Processes Arise?¶
Pure MA() models with a finite are sometimes referred to as “short-memory” models to contrast them with the longer memory of AR models. Pure MA models are somewhat less prevalent but do arise in scenarios such as items with a shelf-life which inherently have a short “memory.” As an example, a “shock” in the prices of dairy will quickly die off as we approach the end of the products’ shelf-lives. A glut in production will become irrelevant once the extra products expire, whereas a scarcity of production will rapidly reset as future purchases return to their baseline (as there is no need to refill long term stockpiles)[1].
Arguably, however, MA models truly shine in the context of analyzing AR (or ARMA) models in their MA() representations. The MA() representation—referred to as the impulse response or impulse response function in disciplines such as signal processing—allows us to immediately determine how long a noise (or “impulse”) continues to generate observations outside of the system’s normal behavior. In the case of an AR(1) model who’s MA() is simply
it is straightforward in both the AR(1) and MA() representations to determine that for, say, , the influence of an anomalous will decay to roughly of its initial value after 8 timesteps, and roughly after 16. For AR() processes higher values, extracting this information directly from the AR representation becomes far more challenging. Representing the process in its MA() form allows to quickly determine how a shock will decay by examining the weights. Moreover, for , there is no guarantee that the weights will decay monotonically. An AR() process with complex roots will exhibit correlations (and hence weights) that decay both exponentially and sinusoidally. The sinusoidal nature will result in weights that appear to reawaken at regular intervals and change the direction of their influence between positive and negative. Analyzing this decay of the weights allows us to avoid being surprised when we thought a shock had completely died off.
This is something of an oversimplification as in the United States the government maintains long-term cold cheese storage, in part to help smooth out supply chain shocks by absorbing excess production and providing relief during scenarios such as disaster relief. Nevertheless, our model is reasonable for local markets that may not participate in government cheese programs.
- Shumway, R. H., & Stoffer, D. S. (2025). Time Series Analysis and Its Applications. In Springer Texts in Statistics. Springer Nature Switzerland. 10.1007/978-3-031-70584-7
- Brockwell, P. J., & Davis, R. A. (1991). Time Series: Theory and Methods. In Springer Series in Statistics. Springer New York. 10.1007/978-1-4419-0320-4