We will make heavy use of both variance and covariance (in particular the autocovariance) throughout the book. This chapter presents a refresher on these topics laying the foundation for the forms used in time series analysis discussed in the next chapter.
There are several operators we will encounter in this book. Values such as mean, variance, and covariance can all be cast in the operator formalism. In later chapters we will introduce the backshift and Fourier operators. So what is an operator?
An operator O is defined as a rule that maps a member of a set to another member. Thus, an operator could just be a function such as 5, defined as multiplying by 5 (i.e. f(x)=5x)[1]. However, the most common operators we will use map one function to another function. Two such operators are differentiation dxd and integration ∫dx.
Linear operators are of particular interest. An operator is a linear operator if it fulfills the following two conditions:
OaF(x)=aOF(x) for any constant a
O(F(x)+G(y))=OF(x)+OG(y)
We may write the two conditions more succinctly as
The most important operator in statistics and data science is the expectation operator E[F(x)], usually first encountered in the context of the arithmetic mean.
In general, we will not explicitly reference both the discrete and continuous cases in this book. Instead, we will use the notation E or one of the two methods in Eq. (3) with an understanding that a reference to either one implicitly refers to both unless specified otherwise.
An important property of expectation is that it is a linear operator. We can demonstrate this fact by proving that E[aF(x)+bG(y)]=aE[F(x)]+bE[G(y)] as follows:
An important theorem states that if the kth moment E[xk] is finite, then all moments j<k must also be finite. As a corollary, if E[xk] is infinite, all moments m>k must also be infinite.
Proof: Let E[xk] be finite and j<k (∀ is read as “for all”)
E[xj]=∫−∞∞xjP(x)dx=∫−∞−1xjP(x)dx+∫−11xjP(x)dx+∫1∞xjP(x)dxnote that ∣xj∣≤∣xk∣∀∣x∣≥1and ∣xj∣≤1∀∣x∣≤1≤∣∣∫−∞−1xkP(x)dx∣∣+∫−111P(x)dx+∫1∞xkP(x)dx≤∣∣∫−∞−1xkP(x)dx∣∣+1+∫1∞xkP(x)dx<∞
where we have used the fact that ∫−111P(x)dx≤1, with equality only occurring if the entire probability mass is contained in the interval [−1,1].
and is often denoted as σx2. The variance gives us a measure of how widely the distribution is spread about the mean. In practice, we more commonly make use of the standard deviationσx, which is simply the square root of the variance.
As written, the variance is slightly different than our definition of the second moment.
By exploiting the linearity of expectation we can express Eq. (17) using the first and second moments exclusively:
Covariance is also written as σx,y. Note that Cov(X,X)=V(X).
Unlike variance, which is never negative, covariance can be negative, zero, or positive.Following the same logic used in Eq. (18), we can also express the covariance as
One of the most fundamental aspects of time series analysis is understanding the variance of the sum of random variables. Let us begin with the variance of a sum of two random variables, X and Y.
The above suggests (though does not prove) that the variance of a sum of variables is the sum of all covariance combinations. For random variables X0,X1,X2,…,Xn−1:
where the last term sums i to n−2 and j to n−1. Eq. (24) can be proven in the same manner as Eq. (21), though the algebra gets rather intricate. We present a more direct proof in the following problem.
From Eq. (24) we can see that if all variables have zero covariance (most commonly due to independence), the variance of a sum of variables is the sum of the variances of each variable
While it is very tempting to simply assume that Eq. (25) holds, in real life we must justify its use, either from theoretical analysis and/or empirical evidence.
It should also be recalled that knowing that random variables have zero covariance does not inherently prove independence. As a simple counterexample, consider a random variable X with zero mean and third moment and let Y=X2. An example might be X∼N(0,1) and Y=X2∼χ12. Clearly, X and Y and highly dependent; for example, knowing that Y>4 tells us ∣X∣>2. Nevertheless, they still have zero covariance:
Warning: The math in this section can get rather heavy. Feel free to skip this section if you’re having difficultly. While the material below does add to the overall understanding of future material, it is not absolutely necessary.
As variances are by definition non-negative, by combining the above two equations we arrive at the inequality 2∣Cov(X,Y)∣≤V(X)+V(Y). We can express this in terms of the arithmetic mean of the variances:
In order to better understand the relation between variance and covariance, we must first define the Cauchy-Schwarz inequality, a valuable inequality from linear algebra. In words, it states that the square of the inner product of two vectors must always be less than or equal to the norm of the first vector squared times the second vector’s norm squared.
Eq.s (29) and (30) will be equalities if and only if u and v can be expressed as scalar multiples of one another (i.e. lie on the same line).
If u and/or v is the zero vector, the inequality is trivially true, let us prove the inequality when neither is:
Let w=△∥u∥u±∥v∥v
w⋅w01∥u∥∥v∥=(∥u∥u±∥v∥v)⋅(∥u∥u±∥v∥v)=∥u∥2u⋅u±2∥u∥∥v∥u⋅v+∥v∥2v⋅v=1±2∥u∥∥v∥u⋅v+1=2±2∥u∥∥v∥u⋅vas with any inner product w⋅w≥0≤2±2∥u∥∥v∥u⋅v≥∥u∥∥v∥∣u⋅v∣≥∣u⋅v∣
The Cauchy-Schwarz inequality may be extended to integrals by viewing functions as infinite dimensional vectors living in Hilbert space. Let us imagine we have two continuous functions, f(x) and g(x) that are square-integrable on the interval [a,b]. Let us create n-dimensional vectors by sampling the value of the function at n evenly spaced points along the interval, producing the vectors
We can use the Cauchy-Schwarz inequality to derive a tighter upper bound on the covariance of two variables than that found in Eq. (28). We will only explicitly prove the bound for the discrete case of sample covariance, but by Eq. (34) the bound will also hold for the continuous case. Let
We thus arrive at the tighter bound that the absolute value of the covariance must always be less than or equal to the geometric mean of the variances.
There is actually an even stricter requirement for covariance, namely that the covariance matrix by positive semidefinite. We will defer discussion of postive semidefiniteness until we encounter it in the specific time series application of covariance.
This quantity is referred to as the correlation. Note that the Cauchy-Schwarz inequality as applied in Eq. (42) guarantees that Eq. (43) will always lie in [−1,1]. Of course, even when using correlation determining if a value such as 0.7 should be considered a high correlation or not will be context and situation dependent.
In some disciplines such as quantum mechanics, operators are often denoted by a “hat” as O^, not to be confused with the hat used to denote a statistical estimator.