Chapter 4 - ARMA Models

Chapter 4 - ARMA Models


Previous chapter →
Chapter 3 - Regression Models
Chapter 3 - Regression Models

The previous chapter introduced what a model is and how it can be used in time series analysis. Next, the simple linear regression model was described, which is a particularly helpful model when dealing with trends. Furthermore, it is efficient, simple and easily interpretable, making it an ideal first choice. However, the linear regression model does come with certain disadvantages such as the linearity assumption and does not take autocorrelation into account. Autoregressive Moving Average (ARMA) models are more powerful and, in the context of time series analysis, they can be used to understand and predict data that varies over time (i.e. data that does not exhibit a trend per se).
 
At their core, ARMA models can help to capture and model the dependencies in a time series by considering both past values of the series (i.e. autocorrelation) and past random errors (often called shocks) that might have influenced its behavior. This dual approach makes ARMA models especially effective for capturing patterns in time-dependent data where simple trend lines from linear regression models or averages fall short. This can be easily linked to real-world applications, where many types of time series such as stock prices, weather patterns, economic indicators and social media activity exhibit behavior that isn’t purely random, but is also not predictable. The patterns found in these applications often have both a persistent trend (e.g. temperatures over seasons or stock market trends over years) and random fluctuations (such as day-to-day weather changed or daily trading noise).
 
ARMA models allow to break down time series into two components; the predictable and seemingly random component. This opens the possibility to create models that balance historical dependencies (autocorrelation) and unpredictable shifts (noise). This makes these types of models both flexible and simplistic. The focus on solely autocorrelation and noise allows them to adapt to a wide variety of time series patterns without the need of excessive complexities.
 

4.1 White Noise & Random Walks

In chapter one, a stochastic process was described as a sequence of stochastic variables. In turn, a time series was then described as observing the sequence from some up to . Similarly, a white noise process can be described. In what follows, such a process will be denoted by .
page icon
Def 4.1 - White Noise Process
A white noise process is a sequence of i.i.d. observations with zero mean and variance .
In this way, a white noise process can be seen as a simple type of time series that consists of only random values that adhere to specific properties (i.e. zero mean and fixed variance) and where each value is independent of others. One important remark is that due to these properties, at any given point in time, a value from a white noise series is completely unpredictable, based on previous values, which makes it a purely random process.
 
The white noise process or just white noise in general, is essential for time series analysis since it acts as a building block for more complicated models. Many time series models, including ARMA models, rely on this white noise to capture the random fluctuation aspect that can not be explained by underlying patterns or trends. For example, in a random walk (without drift), the model can be defined using the first difference operator:
Understanding white noise creates a baseline for randomness. More sophisticated models can be introduced that exhibit structured dependencies and patterns, overlaying on top of the random noise to capture meaningful trends, cycles or seasonal effects in the data. In general, it can be said that white noise serves as a fundamental concept, creating contrast between randomness and predictability in time series analysis.
page icon
Example 4.1 - White Noise & Autocorrelation
The following image illustrates a white noise process and its autocorrelation function. The ACF of a white noise process is relatively straightforward.
  • At a lag of , the autocorrelation is 1, since each data point perfectly correlates with itself.
  • For all other lags, the ACF is approximately zero, since each value within the white noise process is independent of the others. This means there is no correlation between values at different time points.
notion image
page icon
Example 4.2 - Random Walk & Autocorrelation
The following image illustrates a random walk defined by the following model:
where is the previous value in the series and is the white noise value. Unlike plain white noise, which has no memory of past values, each step in a random walk depends directly on the prior value. This makes the walk non-stationary (i.e. its mean and variance change over time). The ACF for a random walk is distinctive because it shows both strong and persistent correlation at all positive lags.
  • At low lags, there is a high autocorrelation. Since each value in a random walk is closely related to the one just before it, the ACF at low lags is high and often close to one.
  • There is a gradual decay in the ACF. This pattern reflects the fact that a random walk has a “memory” of all past values, as each step represent the cumulative sum of all previous noise terms.
notion image
Furthermore, it is easy to create a forecast for a random walk. The following expression holds for a forecast horizon
Since the expected value of white noise is zero , the expected value of each is zero. The expression then becomes
This means that for any forecast horizon , all predicted values will be equal to the last value in the random walk. This is called a naïve prediction and is often used as a benchmark for more complex models. Interestingly, this naïve prediction is in most cases not too bad (for small forecast horizons).

4.2 Moving Average (MA) Models

Stationarity Check
Before moving average (MA) models can be described, it is important to recall the concepts of both stationarity and a stochastic process. A stochastic process was described in the last section. Stationarity however required the following conditions to be true over a stochastic process:
  • The mean must be the same for all
  • The variance must be the same for all
  • The covariance must be the same for all and
To check for stationarity, several methods can be employed. One includes simple graphical analysis of the time series itself and the (partial) autocorrelation functions (statistical tests and transformations will be considered in a future chapter). Graphical analysis is quite intuitive. If the plot shows clear trends or seasonal fluctuations, the series is likely non-stationary. If this is not the case and the series revolves around a constant mean and seems to exhibit constant variance, the series might be stationary.
 
For a stationary time series, the autocorrelation at all lags should decay quickly and become insignificant (close to zero). This can be explained through the fact that correlated observations in the time series will tend to skew the mean and variance of the time series. This of course leads to non-stationary behavior.
 
Moving Average of Order 1
page icon
Def 4.2 - Moving Average of Order 1
A stationary stochastic process is a moving average of order 1, denoted as , if it satisfies:
where and are unknown parameters.
This is a time series model where the current value in the series is the weighed average between the current and the previous value of a random shock or noise. In an process, each value in the series depends on the noise at the current time step and the previous time step. The noise is assumed to be white noise (i.e. i.i.d. with mean 0 and constant variance . The (unknown) parameter controls how much influence the previous shock has on the current value. Note that, for and , the result is plain white noise as described in the previous section.
 
Autocorrelations of
The autocorrelations of an process are given by
Proof Of Autocorrelations for
  1. Given is the model equation of an process
  1. Variance is described as where . The constant does not contribute to variance so it can be ignored.
  1. The rules of variance to the sum of random variables can be applied. Since and are independent, their covariance is 0 and thus
  1. Next, the following variance properties can be used:
      • , since this is a property of white noise
      which leads to
  1. The covariance of order one is given by
    1. which can be expanded to
  1. There only exists a dependence in all these terms between and itself, meaning all other terms are 0. The covariance in this case is equal to the variance of the white noise, . This leads to
  1. From here, it follows that
  1. Similar to this reasoning, it can be shown that and . To give an example, the covariance at order 2 should be considered:
    1. which can be expanded to
      Since all are i.i.d. white noise, the covariance between any two different and for is zero:
      This is because in an process, the noise terms are all independent.
In theory, this results in a correlogram where all correlations after lag are equal to zero. In practice, this will not be the case due to randomness (i.e. very small but non-significant autocorrelations might be present).
 
Moving Average of Order
page icon
Def 4.3 - Moving Average of Order
A stationary stochastic process is a moving average of order , denoted as , if it satisfies:
where and are unknown parameters.
Autocorrelation of
The autocorrelations of an process are equal to zero for lags larger than . If the correlogram shows a strong decline and becomes non-significant after lag , there is evidence the series was generated by an process.
page icon
Example 4.3 - Moving Average MA(4) & Autocorrelation
The following image illustrates a moving average process of order 4 and its ACF. As mentioned, the ACF becomes negligible for lags larger than 4.
notion image
Parameter Estimation & Validation
When a time series is expected to be generated by an model, the parameters and all should be estimated. This can be done using any of the techniques described in chapter 2. Often a simple estimator such as the maximum likelihood estimator will suffice. If , then the maximum likelihood estimator is given by
The estimated parameters will lead to residuals. The residuals should behave close to a white noise distribution. In order to validate an model, it is often a good idea to make a correlogram of the residuals. The ACF function should ideally contain a spike at a lag of order 0 (as is usual), but not contain any significant values at other lags.
page icon
Example 4.4 - Simulation
In this example four graphs are shown. They have the following meaning:
  • The first graph visualizes the result of a simulated process. This can be mathematically expressed as
    • This requires to estimate the parameters
  • The second graph shows the predicted values against the actual values, together with the upper and lower confidence intervals. These represent a range around each predicted value that reflects the uncertainty of the prediction. They give a margin within which the true future values are expected to fall, based on the variability of the data and the uncertainty in parameter estimation (see
    Chapter 2 - Estimators
    Chapter 2 - Estimators
    )
  • The third graph shows the value of the residuals between the predicted and observed data.
  • The final graph shows the correlogram of the ACF. As expected, there is an autocorrelation of 1 for a lag of order , but at other lags the autocorrelation is negligible, indicating a good fit. The reason for this is that the obtained residuals behave close to a white noise ACF.
notion image
Q-Statistic / Ljung-Box Test
Besides the ACF of the residuals behaving like the ACF of a white noise process, some statistics can also be used to test if some time series is the result of a white noise stochastic process (i.e. a sequence of i.i.d. stochastic variables). In this case, the Q-Statistic, sometimes called the Ljung-Box statistic can be used.
page icon
Def 4.4 - Q-Statistic / Ljung-Box Statistic
The Q-Statistic or Ljung-Box statistic is testing the (null) hypothesis that
for a specified value of .
Lets break down what is happening here and how it can be used. The null hypothesis says that all autocorrelations of the residuals are zero, for all lags from 1 to . In other words, there is no autocorrelation present in the residuals, which implies that the residuals behave like white noise. The alternative hypothesis can then be said to be
This alternative hypothesis suggest that there is in fact a significant autocorrelation in the residuals at one or more of the specified lags. The test statistic which will be defined later can be used to accept or reject the null hypothesis. The -statistic will follow a -distribution with degrees of freedom under the null hypothesis . The computed value of the -statistic should be compared to a critical value of for a chosen significance level (e.g. often ). If the null hypothesis should be rejected, concluding that there is a significant autocorrelation in the residuals up to a lag .
 
Remember that is the number of samples available in the time series. The predecessor of the Ljung-Box test called the Box-Pierce Statistic can than be written as
However, this test is considered to be outdated, especially compared to the newer Ljung-Box test for a couple of reasons. One of those is that there is a lack of adjustment for small samples. The Box-Pierce test assumes that the sample size is large enough for the -distribution approximation to hold. This is not the case for smaller sample sizes. The Ljung-Box test improves the Box-Pierce test by adding a correction factor that account for the diminishing number of terms in the autocorrelation calculation as the lag increases. This improves accuracy for small to moderate sample sizes . The Ljung-Box test can be expressed as
The main assumptions that hold here are relatively forgiving, having been introduced before as well. The residuals are assumed to be derived from a well-specified time series model, the observations have to be independent and the test is generally applied to stationary series (after differencing if necessary).
 
The null hypothesis that is being tested by the Q-Statistic can be said to be a joint null hypothesis. It combines several null hypotheses to eliminate the multiple testing problem. This problem arises when multiple null hypotheses have to not be rejected, which becomes increasingly difficult. The null hypotheses are of the form
Since it is possible on of these autocorrelations will have a statistical significance, the model might not be statistically valid. Therefore, the joint null hypothesis eliminates this multiple testing problem. The only difficulty remains in choosing , but often the following rule is used:

4.3 Autoregressive (AR) Models

Autoregressive of Order 1
page icon
Def 4.5 - Autoregressive of Order 1
A stationary stochastic process is an autoregressive of order 1, denoted as , if it satisfies:
where and are unknown parameters.
This is a time series model where the current value in the series is linearly dependent on the previous value, modulated by the parameter and on a random shock described by white noise . The noise is, just as with an model, assumed to be white noise (i.e. i.i.d. with mean 0 and constant variance ). The parameter controls how much influence the previous value has on the current value. Note that for and , the result is plain white noise as described in the first section.
 
Autocorrelations of
The autocorrelations of an process are given by
In general, the ACF for an process can be expressed as
Intuition Of Autocorrelations for
The reason for this is that the time series, at a given time directly depends on its previous value . This dependence creates a cascading effect where past values continue to influence the current and future values, albeit with a diminishing strength over time. This can easily be shown when expanding recursively (and setting :
Continuing this expansion process, the current value of the time series can be expressed as an infinite sum as
This means that the current value is influenced by all past shocks where the influence decays exponentially by as the lag increases. The following propositions are then true:
  • The autocorrelation at lag then measures the linear dependence between and .
  • Since depends directly on through the parameter , and on using this same parameter, the correlation between and persists over multiple lags but diminishes as the lag increases.
  • The strength of this correlation diminishes geometrically because each successive lag multiplies the influence by .
At a lag , the relationship between and includes -stepds of dependence, each scaled by . Therefore, the correlation is proportional to .
This exponential decay occurs because (in a stationary process) and repeated multiplication by reduces the magnitude of the dependency with increasing . Furthermore, the random shocks are uncorrelated over time. However, their influence is transmitted through the recursive structure of the process, which introduces correlation into the series. The further back in time a shock occurs, the weaker its influence, as it is scaled by successive powers of .
 
Properties of an model with
  • The expected value of an autoregressive time series of order 1 can be computed by the fact that and (due to stationarity). Hence
    • This can be rearranged to solve for as follows
      which holds as long as .
  • The variance of an autoregressive time series of order 1 is a bit more tricky to get right. The full derivation will be given here. For stationarity, the variance should be constant over time and thus, should resolve to be a constant.
    • First, the deviation from the mean can be expressed as . Since it is known that , this can be substituted into the equation.
      • Here, represents the deviation from the mean.
    • Next, the variance of denoted as should satisfy
      • since . Using the properties of variance for the constant and the independence of from , the variance can be expressed as
        where is the variance of the white noise process.
    • This can be rearranged as follows
      • which holds as long as .
  • The autocovariance at lag , denoted by or measures the linear dependence between and . In general, autocovariance can be found using the following equation:
    • The autocovariance can be found using induction on the lag order .
    • Base Case
      • In case of lag order , the autocovariance is the variance of the process:
    • Recursive Relation
      • For any lag order , the equation governing an process can be used.
        Since it is know that and , the covariance can be expanded as follows:
        Since is a constant, it does not contribute to covariance. Furthermore, is independent of all past values . Using these facts, the expression for covariance can be simplified to
        Thus, the covariance at lag is related to the covariance at lag by
    • Generalization
      • Using the recursive relation from the previous step, a general formula for the covariance of any order can be derived. Since the recursive relationship shows that the autocovariance at lag is just time the autocovariance at lag , this formula becomes
        Finally, the value for can be substituted, since this is simply the variance of the process.
  • The autocorrelation at lag denotes the normalized autocovariance. It measures the strength and direction of the linear relationship between and . It is defined by the ACF, seen in previous chapters. Especially in this case, it simplifies to:
    • Since the expression for both and is known, they can be substituted in the formula. This results in
page icon
Expected Value & Variance of
  • The expected value of an process is given by
  • The variance of an process is given by
  • The autocovariance of an process is given by
  • The autocorrelation of an process s given by
Before moving on, some intuition behind these values is given to help better understand their meaning and uses.
  • The autocovariance quantifies the absolute strength of the linear relationship between and , while factoring in the variance of the process. It does so by measuring how much the values of the time series at different time points move together. The larger the autocovariance, the stronger the linear dependence.
  • The autocorrelation on the other hand provides a dimensionless measure of dependence between and , scaled relative to the total variability of the process.
  • For , both the autocovariance and autocorrelation decay exponentially as the lag order increases. This reflects the “memory” of the process. The influence of past values diminishes over time, but larger values of retain information longer than smaller values of .
    • In case of approaching 1, the series has a strong persistence (i.e. the degree to which past values influence current values is high).
    • If is closer to 0 on the other hand, the decay is rather rapid, indicating weak dependence and weak persistence.
 
Partial Correlations
page icon
Def 4.6 - Partial Correlation
A partial correlation is defined as
for
Partial correlations measure the relationship between two variables, while controlling for the effect of one or more additional variables. In other words, it quantifies the direct correlation between two variables after removing the influence of other variables that might affect both. This results in the partial autocorrelation function , which measures the correlation between and after removing the influence of all intermediate variables (i.e. lag orders 1 through ). The autocorrelations of an process are given by
The partial correlation from lags starting at lag order remove the influence of the previous lags. In an process, since the relationship between and is already explained by the first lag, the partial correlation will be zero. This analogy holds for higher lag orders as well. This also means that, in an process, the process only has a first-order dependency, meaning that the current value is directly influenced only by the previous value (and not further lags). In principal, this also means that a time series can be or was generated by an process if there is a strict cutoff to values near zero for the PACF.
 
Autoregressive of Order
page icon
Def 4.7 - Autoregressive of Order
A stationary stochastic process is autoregressive of order , denoted as , if it satisfies:
where and are unknown parameters.
In general, for models, the partial correlations of a lag larger than are equal to zero. The autocorrelations tend more slowly towards zero and sometimes even exhibit a sinusoidal form.
 
Parameter Estimation & Validation
When a time series is expected to be generated by an model, the parameters and all should be estimated. This can be done using any of the techniques described in chapter 2. Often a simple estimator such as the maximum likelihood estimator will suffice. If , then the maximum likelihood estimator is given by
The estimated parameters will lead to residuals. The residuals should behave close to a white noise distribution. In order to validate an model, it is often a good idea to make a correlogram of the residuals. The ACF function should ideally contain a spike at a lag of order 0 (as is usual), but not contain any significant values at other lags.
page icon
Example 4.4 - Simulation
In this example four graphs are shown. They have the following meaning:
  • The first graph visualizes the result of a simulated process. This can be mathematically expressed as
    • This requires to estimate the parameters
  • The second graph shows the predicted values against the actual values, together with the upper and lower confidence intervals. These represent a range around each predicted value that reflects the uncertainty of the prediction. They give a margin within which the true future values are expected to fall, based on the variability of the data and the uncertainty in parameter estimation (see
    Chapter 2 - Estimators
    Chapter 2 - Estimators
    )
  • The third graph shows the value of the residuals between the predicted and observed data.
  • The final graph shows the correlogram of the ACF. As expected, there is an autocorrelation of 1 for a lag of order , but at other lags the autocorrelation is negligible, indicating a good fit. The reason for this is that the obtained residuals behave close to a white noise ACF.
notion image

4.3 Autoregressive Moving Average (ARMA) Models

In some cases, when inspecting a time series, it is necessary to connect the foundational concepts of both autoregressive and moving average models. These models are building blocks that will be used in this section to construct autoregressive moving average (ARMA) models. Given the (partial) correlogram of a time series, an ARMA model specification may be appropriate if the correlogram does not “implode” to near-zero or non-significant values as seen in the separate AR or MA models.
page icon
Def 4.7 - Autoregressive Moving Average of Order
A stationary stochastic process is autoregressive moving average of order , denoted as , if it satisfies:
where , and are unknown parameters.
Parameter Estimation & Validation
When a time series is expected to be generated by an model, the parameters , all and all should be estimated. This can be done using any of the techniques described in chapter 2. Often a simple estimator such as the maximum likelihood estimator will suffice. If , then the maximum likelihood estimator is given by