Skip to main content

Section 4.3 The Spectral Theorem

The set of eigenvalues of a linear transformation \(T : V \to V\) is called its spectrum. This vocabulary comes from physics, where the operator in question is a quantized Hamiltonian \(T = H\) and the eigenvalues are energies. Such energies for photons, the fundamental particles for light (and radiowaves), can be boiled down to their frequency and thus the term `spectrum’. However, this operator \(H\) in quantum mechanics has to satisfy a special property - one that ensures it is diagonalizable. To even be able to state the property, we need our vector space \(V\) to be an inner product space.
In fact, while we introduced inner product spaces as a way to work geometrically with vector spaces, the existence of an inner product gives us a huge amount of extra structure of which we have only scratched the surface. The following definition is an example of such structure.

Definition 4.3.1.

Let \(V\) and \(W\) be inner product spaces over \(K\) with pairings \(\left\langle, \right\rangle_V\) and \(\left\langle , \right\rangle_W\) respectively. Given a linear transformation
\begin{equation*} T : V \to W \end{equation*}
we say a linear transformation
\begin{equation*} T^*: W \to V \end{equation*}
is its adjoint if, for any two vectors \(\mb{v} \in V\) and \(\mb{w} \in W\) we have
\begin{equation*} \left\langle T (\mb{v} ) , \mb{w} \right\rangle_W = \left\langle \mb{v} , T^* (\mb{w}) \right\rangle_V. \end{equation*}
Let’s consider a few standard examples.

Example 4.3.2. Conjugate transpose as adjoint.

Suppose \(V = \mathbb{C}^n\) and \(W = \mathbb{C}^m\) are inner product spaces with the product given in Example 3.1.12. Then any linear transformation from \(V\) to \(W\) can be seen as multiplying by an \(m \times n\) matrix \(A\text{.}\) In this case, we have
\begin{equation*} A^* = \overline{A^T} \end{equation*}
which is the conjugate of the transpose of \(A\text{.}\)

Example 4.3.3. Transpose as adjoint.

Suppose \(V = \mathbb{R}^n\) and \(W = \mathbb{R}^m\) are inner product spaces with the dot product. Then it is not hard to show that
\begin{equation*} A^* = A^T. \end{equation*}
In fact, in the finite dimensional setting there is always an adjoint as the following proposition shows.

Proof.

Choose orthonormal bases \(\mathcal{B} = \{\mb{v}_1, \ldots, \mb{v}_n \}\) and \(\mathcal{C} = \{\mb{w}_1, \ldots, \mb{w}_m \}\) for \(V\) and \(W\) respectively. Suppose \(A = \cob{T}{\mathcal{B}}{\mathcal{C}}\) represents \(T\) with respect to these bases. Exercise 4.3.3.1 then shows that the conjugate transpose \(A^*\) represents \(T^*\text{.}\)
To state and prove the Spectral Theorem in finite dimensions, we need one more notion.

Definition 4.3.5.

Let \(V\) be an inner product space, \(T : V \to V\) a linear transformation and \(T^* : V \to V\) its adjoint. We say that \(T\) is self-adjoint if \(T = T^*\text{.}\)
Let us make a few notes about this definition. First, this can be applied to infinite and finite dimensional vector spaces. Second, in the finite dimensional setting, we call \(T\) Hermitian if \(V\) is defined over \(\mathbb{C}\) and symmetric if \(V\) is defined over \(\mathbb{R}\text{.}\) This connects well to matrices where \(A\) is said to be symmetric if \(A^T = A\) (which would mean that it is self-adjoint as a transformation).
As it turns out, there are many situations in which the linear operator we are considering is self-adjoint. So it is perhaps a pleasant surprise that, in these cases, we have everything we could possibly desire!

Proof.

To see the claim on eigenvalues holds, we may assume that \(V\) is defined over \(\mathbb{C}\text{.}\) If not, we represent \(T\) as an \(n \times n\)-matrix \(A\) using an orthonormal basis as in the proof of Proposition 4.3.4 and observe that \(A\) is symmetric with real entries and thus Hermitian. But then if \(\mb{v}\) is a \(\lambda\)-eigenvector we have that
\begin{align*} \lambda \left\langle \mb{v} , \mb{v} \right\rangle \amp = \left\langle \lambda \mb{v} , \mb{v} \right\rangle , \\ \amp = \left\langle T (\mb{v} ) , \mb{v} \right\rangle , \\ \amp = \left\langle \mb{v} , T ( \mb{v} ) \right\rangle, \\ \amp = \left\langle \mb{v} , \lambda \mb{v} \right\rangle , \\ \amp = \bar{\lambda} \left\langle \mb{v}, \mb{v} \right\rangle . \end{align*}
Dividing by \(\|\mb{v} \|^2\) we see that \(\lambda = \bar{\lambda}\) so that \(\lambda\) is real.
To see that \(T\) has an eigenbasis, we need only show that any generalized eigenvector is actually an eigenvector. So suppose \(\lambda\) is an eigenvalue and \(\mb{v}\) is a generalized \(\lambda\)-eigenvector for which \((\lambda I - T)^r (\mb{v} ) = \mb{0}\) but \((\lambda I - T)^{r - 1} (\mb{v} ) \ne \mb{0}\text{.}\) If this only happens for \(r = 1\) then we are done. So suppose \(r > 1\) and let \(\mb{w} = (\lambda I - T)^{r - 2} (\mb{v} )\text{.}\) Then \((\lambda I - T )(\mb{w}) \ne \mb{0}\) so that \(\| (\lambda I - T )(\mb{w}) \|^2 > 0\) but
\begin{align*} 0 \amp \ne \left\langle (\lambda I - T )(\mb{w}), (\lambda I - T )(\mb{w}) \right\rangle, \\ \amp = \left\langle (\lambda I - T )^2(\mb{w}) , \mb{w} \right\rangle, \\ \amp = \left\langle \mb{0} , \mb{w} \right\rangle, \\ \amp = 0. \end{align*}
This is a contradiction, thus every generalized eigenvector is an eigenvector and there is an eigenbasis (or equivalently, \(T\) is diagonalizable).
Now, suppose \(\lambda_1\) and \(\lambda_2\) are distinct eigenvalues with eigenvectors \(\mb{v}_1\) and \(\mb{v}_2\text{.}\) Then
\begin{align*} \lambda_1 \left\langle \mb{v}_1 , \mb{v}_2 \right\rangle \amp = \left\langle \lambda \mb{v}_1 , \mb{v}_2 \right\rangle , \\ \amp = \left\langle T (\mb{v}_1 ) , \mb{v}_2 \right\rangle , \\ \amp = \left\langle \mb{v}_1 , T ( \mb{v}_2 ) \right\rangle, \\ \amp = \left\langle \mb{v}_1 , \lambda_2 \mb{v}_2 \right\rangle , \\ \amp = \bar{\lambda}_2 \left\langle \mb{v}_1, \mb{v}_2 \right\rangle , \\ \amp = {\lambda}_2 \left\langle \mb{v}_1, \mb{v}_2 \right\rangle . \end{align*}
Here the last line uses the fact that \(\lambda_2\) is real. Since \(\lambda_1 \ne \lambda_2\) we must have that \(\left\langle \mb{v}_1 , \mb{v}_2 \right\rangle = 0\) or that the distinct eigenspaces are in fact orthogonal. This means that our generalized eigenspace decomposition
\begin{equation*} V \amp = V_{\lambda_1} \oplus V_{\lambda_2} \oplus \cdots \oplus V_{\lambda_m} \end{equation*}
is actually an eigenspace decomposition and any two such spaces are orthogonal. Choosing an orthonormal basis for each eigenspace and taking their union then produces an orthonormal eigenbasis of \(V\) for \(T\text{.}\)
Let us now consider two important applications of this theorem.

Subsection 4.3.1 Quadratic Forms

We start with a definition.

Definition 4.3.7.

Let \(V\) be an \(n\)-dimensional vector space over \(K\text{.}\) A quadratic form on \(V\) is a function
\begin{equation*} Q : V \to K \end{equation*}
for which there exists numbers \(a_{ij} \in K\) and a basis \(\mathcal{B}\) of \(V\) for which
\begin{equation*} Q \left( \left[ \begin{matrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{matrix} \right]_{\mathcal{B}} \right) = \sum_{i = 1}^n \sum_{j = 1}^n a_{ij} x_i x_j . \end{equation*}
There are alternative definitions that do not depend on choosing a basis, but in any definition, one will be able to write out a quadratic form as above. Note that if \(A\) is the \(n \times n\) matrix with entries \((a_{ij})\) and \(\mb{x}\) is the vector \(\coord{\mb{v}}{\mathcal{B}}\) to be put in \(Q\text{,}\) we obtain the formula
\begin{equation} Q (\mb{x} ) = \mb{x}^T A \mb{x}. \tag{4.3.1} \end{equation}
Like the determinant, a quadratic form is not a linear function, but rather strongly related to multilinear functions (as you will see in the proof of Sylvester’s Law of Inertia). However, it is one you have certainly seen before in another guise called a conic section.

Example 4.3.8. Quadratic forms in two dimensions.

There are two examples of quadratic forms for \(\mathbb{R}^2\) that you see early in your mathematical education. We will write these as functions of two variables. The first is
\begin{equation*} Q_{++} (x,y) = Q_{++} \left( \twovec{x}{y} \right) = \frac{x^2}{a^2} + \frac{y^2}{b^2}. \end{equation*}
The level set of this quadratic form, which is defined as the set of points in the plane satisfying \(Q (x, y) = c\text{,}\) is an ellipse. Another quadratic form you see is the similar looking
\begin{equation*} Q_{+-} (x,y) = Q_{+-} \left( \twovec{x}{y} \right) = \frac{x^2}{a^2} - \frac{y^2}{b^2}. \end{equation*}
The level set here being a hyperbola. Of course, there are others like
\begin{equation*} Q(x, y) = 5x^2 - 6xy + 5y^2 \end{equation*}
but, as we will see using the Spectral Theorem, this can be made to be like one of our two prototypes above by changing coordinates.
We note here that instead of considering the level curves of these quadratic forms, we could consider their graphs
\begin{equation*} \{ (x, y, z) : z = Q (x,y) \} \subset \mathbb{R}^3 . \end{equation*}
In this case, we would obtain what are known as the paraboloid and hyperbolic paraboloid. We will revisit these graphs later on when we consider multivariable scalar functions.
Representing a quadratic form by a matrix as in equation (4.3.1) depends on the basis \(\mathcal{B}\text{,}\) however, we can restrict our attention to certain types of matrices which change only in certain ways.

Proof.

For the first statement, we need only show there is a symmetric matrix \(A_{\mathcal{B}}\) as equation (4.3.1) gives us the existence of some matrix. But \(\mb{x}^T A \mb{x}\) is in fact a symmetric matrix (because it is \(1 \times 1\)), so
\begin{equation*} \mb{x}^T A \mb{x} = \left( \mb{x}^T A \mb{x} \right)^T = \mb{x}^T A^T \mb{x} . \end{equation*}
Another symmetric matrix that we can obtain from \(A\) is
\begin{equation*} A_\mathcal{B} = \frac{1}{2} ( A + A^T ). \end{equation*}
Indeed, we see that
\begin{align*} Q ( \mb{x} ) \amp = \mb{x}^T A \mb{x}, \\ \amp = \frac{1}{2} \left( \mb{x}^T A \mb{x} + \mb{x}^T A \mb{x} \right), \\ \amp = \frac{1}{2} \left( \mb{x}^T A \mb{x} + \mb{x}^T A^T \mb{x} \right), \\ \amp = \mb{x}^T \left[ \frac{1}{2} (A + A^T) \right] \mb{x}, \\ \amp = \mb{x}^T A_\mathcal{B} \mb{x}. \end{align*}
To see that \(A_\mathcal{B}\) is the only symmetric matrix that will produce \(Q\) relative to the coordinates given by \(\mathcal{B}\text{,}\) one observes that the diagonal entries are obtained by \(Q ( \mb{e}_i)\) and the off diagonal ones by \(\frac{1}{2} [Q ( \mb{e}_i + \mb{e}_j) - Q ( \mb{e}_i) - Q (\mb{e}_j)]\text{.}\)
To see the last statement about relating \(A_\mathcal{B}\) to \(\mathcal{C}\text{,}\) just note that if \(\mb{v}\) is a vector in \(V\) with coordinates \(\coord{\mb{v}}{\mathcal{B}}\text{,}\) \(\coord{\mb{v}}{\mathcal{C}}\) with respect to \(\mathcal{B}\) and \(\mathcal{C}\text{,}\) then, letting \(P = \cob{1_V}{\mathcal{C}}{\mathcal{B}}\) be the change of coordinate matrix, we have
\begin{equation*} \coord{\mb{v}}{\mathcal{B}} = P \coord{\mb{v}}{\mathcal{C}} \end{equation*}
which gives
\begin{align*} Q (\mb{v} ) \amp = \left(\coord{\mb{v}}{\mathcal{B}}\right)^T A_\mathcal{B} \coord{\mb{v}} {\mathcal{B}}, \\ \amp = \left( P \coord{\mb{v}}{\mathcal{C}} \right)^T A_{\mathcal{B}} \left( P \coord{\mb{v}}{\mathcal{C}} \right), \\ \amp = \left( \coord{\mb{v}}{\mathcal{C}}\right)^T \left( P^T A_\mathcal{B} P \right) \coord{\mb{v}}{\mathcal{C}}. \end{align*}
Now, by exercise Exercise 4.3.3.2 the matrix \(P^T A_\mathcal{B} P\) is symmetric and since \(A_\mathcal{C}\) is the unique symmetric matrix for which
\begin{equation*} Q (\mb{v} ) = \left(\coord{\mb{v}}{\mathcal{C}}\right)^T A_\mathcal{C} \coord{\mb{v}}{\mathcal{C}} \end{equation*}
we must have
\begin{equation*} P^T A_\mathcal{B} P = A_\mathcal{C} . \end{equation*}
Two matrices like \(A_{\mathcal{B}}\) and \(A_{\mathcal{C}}\) satisfying equation (4.3.2) are called congruent. This lemma has a shocking corollary, which we will state as a theorem.

Proof.

By Lemma 4.3.9, there is a unique symmetric matrix \(A\) which represents \(Q\) relative to the standard basis. But the Spectral Theorem then says that
  1. \(A\) is diagonalizable,
  2. the eigenvalues are all real,
  3. there is an eigenbasis which is an orthonormal basis \(\mathcal{B} = \{\mb{v}_1, \ldots, \mb{v}_n\}\text{.}\)
By Proposition 4.1.17 the fact that \(A\) is diagonalizable means
\begin{equation*} P^{-1} A P = \textnormal{Diag} (\lambda_1 , \ldots, \lambda_n ) \end{equation*}
where the columns of the matrix \(P\) are the eigenvectors \(\mb{v}_1, \ldots, \mb{v}_n\text{.}\) But since these are orthonormal, we have that the matrix \(P^{-1} = P^T\) (check this by writing the rows of \(P^T\) as \(\mb{v}_1^T, \ldots, \mb{v}_n^T\) and observing that the \((i,j)\)-entry of \(P^T P\) is the dot product of the \(i\)-th row of \(P^T\) with the \(j\)-th column of \(P\)). Thus
\begin{equation*} P^T A P = \textnormal{Diag} (\lambda_1 , \ldots, \lambda_n ) \end{equation*}
and Lemma 4.3.9 implies that \(Q\text{,}\) relative to the basis \(\mathcal{B}\text{,}\) is represented by this diagonal matrix. Thus in this coordinate system,
\begin{equation*} Q(x_1, \ldots, x_n) = \mb{x}^T \textnormal{Diag} (\lambda_1 , \ldots, \lambda_n ) \mb{x} = \lambda_1 x_1^2 + \cdots + \lambda_n x_n^2 . \end{equation*}
The last statement is not directly implied by what we have above, since it is a statement about changing to any basis, not just orthonormal bases (and is the main point of Sylvester’s law of inertia). To prove this last statement, consider the pairing
\begin{equation*} B( \mb{x}, \mb{y} ) = \frac{1}{2} \left[ Q ( \mb{x} + \mb{y} ) - Q ( \mb{x} ) - Q ( \mb{y} ) \right]. \end{equation*}
It is not hard to show that \(B\) satisfies conjugate symmetry (although the conjugate part is not important here because we are working with a real vector space) and linearity. It does not, however, always satisfy the positive definite property.
Suppose exactly \(n_+\) of the eigenvalues \(\lambda_1, \ldots, \lambda_n\) are positive and order them so that these occur as the first ones \(\lambda_1, \ldots, \lambda_{n_+} > 0\) and define the subspace \(V = \textnormal{span} \, \{ \mb{v}_1, \ldots, \mb{v}_{n_+} \}\text{.}\) Observe that \(B\) is in fact a positive definite pairing on \(V\) because if \(\mb{v} = a_1 \mb{v}_1 + \cdots + a_{n_+} \mb{v}_{n_+}\) is non-zero then
\begin{equation*} B ( \mb{v} , \mb{v} ) = \lambda_1 a_1^2 + \cdots + \lambda_{n_+} a_{n_+}^2 > 0 . \end{equation*}
Note also that \(V^\perp = \span \{ \mb{v}_{n_+ + 1} , \ldots, \mb{v}_n \}\) (because \(\mathcal{B}\) is an orthonormal basis).
We claim that any vector subspace \(W\) of \(\mathbb{R}^n\) on which \(B\) is positive definite has dimension less than or equal to \(n_+\text{.}\) Indeed, restricting the projection \(\text{proj}_V\) to \(W\) gives a linear transformation
\begin{equation*} \text{proj}_V : W \to V . \end{equation*}
Since \(\dim V = n_+\text{,}\) the Rank-Nullity Theorem says that if \(\dim W \gt n_+\) then there is a non-zero element \(\mb{w}\) in the kernel of \(\textnormal{proj}_V\text{.}\) But the kernel of the projection map is \(V^\perp\) so that \(\mb{w}\) is in the span of \(\{ \mb{v}_{n_+ + 1} , \ldots, \mb{v}_n \}\text{.}\) Thus
\begin{equation*} \mb{w} = b_{n_+ + 1} \mb{v}_{n_+ + 1} + \cdots + b_n \mb{v}_n \end{equation*}
and
\begin{equation*} B ( \mb{w} , \mb{w} ) = b_{n_+ + 1}^2 \lambda_{n_+ + 1} + \cdots b_n^2 \lambda_n \leq 0 . \end{equation*}
This would contradict that \(B\) is positive definite on \(W\text{.}\) Thus \(n_+\) is indeed the maximal dimension of a subspace on which \(B\) is positive definite. Likewise, we can show that the number \(n_-\) of negative \(\lambda\)’s is the largest dimension of a subspace on which \(B\) is `negative definite’ (i.e. \(B( \mb{v} , \mb{v} ) \lt 0\) on non-zero vectors). Finally, \(n_0\) can be calculated as \(\dim V - n_+ - n_-\text{.}\)
This surprising theorem has diverse and far reaching consequences as we will see later with second derivative tests. However, we can also use it to observe that often complicated looking formula are in fact just simple formulas made complicated by a change of coordinates.

Example 4.3.11. Nice coordinates for a quadratic form.

In Example 4.3.8, we wrote the quadratic form
\begin{equation*} Q(x, y) = 5x^2 - 6xy + 5y^2 \end{equation*}
on \(\mathbb{R}^2\text{.}\) The matrix for this quadratic form with respect to the standard basis is
\begin{equation*} A = \left[ \begin{matrix} 5 \amp -3 \\ -3 \amp 5 \end{matrix} \right]. \end{equation*}
Calculating the characteristic polynomial, we obtain
\begin{align*} p_A (t) \amp = \det \left( \left[ \begin{matrix} t - 5 \amp 3 \\ 3 \amp t - 5 \end{matrix} \right] \right) , \\ \amp = (t - 5) (t - 5) - 9, \\ \amp = t^2 - 10t + 16, \\ \amp = \left( t - 2 \right) \left( t - 8 \right). \end{align*}
Finding eigenvectors we solve \((tI - A) \mb{x} = \mb{x}\) and obtain
\begin{equation*} \mathcal{B} = \left\{ \twovec{1}{1} , \twovec{1}{-1} \right\} \end{equation*}
which is orthogonal. Normalizing gives
\begin{equation*} \mathcal{B} = \left\{ \twovec{\sqrt{2}/2}{\sqrt{2}/2} , \twovec{\sqrt{2}/2}{-\sqrt{2}/2} \right\} \end{equation*}
which can be used to obtain the change of basis matrix
\begin{equation*} P = \left[ \begin{matrix} {\sqrt{2}/2} \amp {\sqrt{2}/2} \\ {\sqrt{2}/2} \amp -{\sqrt{2}/2} \end{matrix} \right]. \end{equation*}
In this case, we can observe that \(P = P^{-1} = P^T\text{.}\) Now, in the orthonormal basis \(\mathcal{B}\text{,}\) we have coordinates
\begin{equation*} \twovec{u}{v}_{\mathcal{B}} = u \twovec{\sqrt{2}/2}{\sqrt{2}/2} + v \twovec{\sqrt{2}/2}{- \sqrt{2}/2} = \twovec{\sqrt{2}/2 \left( u + v \right) }{\sqrt{2}/2 \left( u - v \right)}. \end{equation*}
Putting these into \(Q\) gives
\begin{align*} Q\left(\twovec{u}{v}_{\mathcal{B}} \right) \amp = 5 \left(\sqrt{2}/2 ( u + v ) \right)^2 - 6\left(\sqrt{2}/2 ( u + v ) \right)\left(\sqrt{2}/2 ( u - v ) \right) + 5\left(\sqrt{2}/2 ( u - v ) \right)^2, \\ \amp = \frac{5}{2} \left( u^2 + 2uv + v^2 \right) - 3 \left( u^2 - v^2 \right) + \frac{5}{2} \left( u^2 - 2uv + v^2 \right), \\ \amp = 2 u^2 + 8 v^2. \end{align*}
What have we done? Well, the \((u,v)\) coordinate system is just the ordinary Cartesian coordinates rotated by \(45\) degrees. In this system, we have a standard ellipse with major axis the rotated \(y\)-axis and minor axis the rotated \(x\)-axis.
This last example shows a practical way of obtaining good coordinates when faced with a quadratic form, namely ones where the form is the sum of multiples of squares.

Subsection 4.3.2 Covariance Matrices

When modelling a data set, one often will be considering a sample space with several variables. For example, if maple trees were our our data set, each maple tree may have a height, age, trunk diameter etc. We pick an order of such quantities once and for all so that we can numerically record any such sample as a vector in our vector space \(\mathbb{R}^n\text{.}\) Doing so with a large data set gives us a large number of vectors \(S\) in \(\mathbb{R}^n\) and this is where we leave trees behind and math takes over.
Now, there are reasons to believe (in certain situations) that our data set \(S\) should be distributed like an ellipse (or an ellipsoid for higher dimensions) around a central point. Such a distribution would be called a multivariate normal distribution. What is meant by `distributed’ is that there is some (non-linear) function \(\rho\) on \(\mathbb{R}^n\) which tells us of the probability that a sample point will be in a specific region of \(\mathbb{R}^n\text{.}\) Then for any probability \(p\) between \(0\) and \(1\text{,}\) there will be ellipsoid ball around the mean in \(\mathbb{R}^n\) for which a sample occuring in that ball will have probability \(p\text{.}\)
Let us first focus on the central point \(\mb{\mu}\text{.}\) This point, if it is the center, should be the average of all of our samples
\begin{equation*} \mb{\mu} := \frac{1}{|S|} \sum_{\mb{X} \in S} \mb{X}. \end{equation*}
If you have some familiarity with basic statistics, you will know that finding the average (or mean) of a data set is only the first step in understanding your distribution of points. The second step is to find the variance, which when \(n = 1\text{,}\) is the average of the squares of the deviations from the mean. In other words, it can be calculated as
\begin{equation} \frac{\sum (X - \mu)^2}{|S|}.\tag{4.3.3} \end{equation}
This is actually the population variance, which will do for our purposes in one dimension. What it measures is how far away (or more precisely, the square of the distance away) a generic sample point is from the mean. However, we have by now realized our world is much richer with many dimensions and we should also note that the words `far away’ indicate an inner product computation in many dimensions. So our notion of variance should also certainly be much richer in many dimensions and indeed it is.
To compute it, we simply take our sample space of shifted vectors, or vectors with respect to the mean, and make it into a very large matrix
\begin{equation*} \mathcal{S} = \left[ \begin{matrix} | \amp | \amp \cdots \amp | \\ \, \mb{X}_1 - \mb{\mu} \amp \, \,\mb{X}_2 - \mb{\mu} \amp \cdots \amp \, \,\mb{X}_N - \mb{\mu} \\ | \amp | \amp \cdots \amp | \end{matrix} \right]. \end{equation*}
and write
\begin{equation} \mathbf{\Sigma} = \frac{1}{|S|} \mathcal{S} \mathcal{S}^T.\tag{4.3.4} \end{equation}
Let us observe a few things about this matrix. First, it is often written as \(K_{XX}\) rather than \(\mb{\Sigma}\text{.}\) We also note that the one dimensional case reproduces formula (4.3.3). Finally, one can check that this is an \(n \times n\) real symmetric matrix so that
\begin{equation*} \mathbf{\Sigma}^T = \mathbf{\Sigma} . \end{equation*}
If this weren’t good enough, it also has a positive semi-definite property. Namely, if \(\mb{v}\) is any vector in \(\mathbb{R}^n\) then
\begin{equation} \left( \mathbf{\Sigma} \mb{v} \right) \cdot \mb{v} \geq 0 . \tag{4.3.5} \end{equation}
So indeed, this covariance matrix is self-adjoint and, as you will show in an exercise, has only non-negative eigenvalues. This in particular means one can take unambiguous square roots of these eigenvalues (which are viewed as multivariable standard deviations). Thus there is an orthonormal eigenbasis for \(\mathbf{\Sigma}\text{.}\) The eigenvectors of this basis can be ordered so that the \(\mathcal{B} = \{\mb{v}_1, \ldots, \mb{v}_n\}\) so that \(\mb{v}_1\) has the largest eigenvalue and the remaining eigenvalues are of decreasing values. Then in the \(\coord{}{\mathcal{B}}\) coordinate system, our original data set can be seen as normally distributed about the mean with an ellipsoid with equation
\begin{equation*} \left( \frac{x_1}{\lambda_1} \right)^2 + \cdots + \left( \frac{x_n}{\lambda_n} \right)^2 = 1 . \end{equation*}
In this coordinate system, we can reduce dimensions to those eigenvectors with the largest eigenvalues (by projecting). Understanding and interpreting the data set in the eigenbasis coordinate system is an area in statistical applications known as ‘principal component analysis’.

Exercises 4.3.3 Exercises

1.

Verify the claim in the proof of Proposition 4.3.4. In particular, show that if \(A^*\) is the conjugate transpose of the matrix \(A = \cob{T}{\mathcal{B}}{\mathcal{C}}\) representing \(T\) with respect to the orthonormal bases \(\mathcal{B}\) of \(V\) and \(\mathcal{C}\) of \(W\text{,}\) then
\begin{equation*} []_\mathcal{B} \circ A^* \circ \coord{}{\mathcal{C}}: W \to V \end{equation*}
is adjoint to \(T\text{.}\)
Hint.
Verify the adjoint equation for basis vectors \(\mb{v}_i\) and \(\mb{w}_j\text{.}\) Then appeal to linearity.

2.

Two real \(n \times n\)-matrices \(A\) and \(B\) are called congruent if there is an invertible matrix \(P\) with \(P^T A P = B\text{.}\) Show that if \(A\) is symmetric then so is every matrix congruent to \(A\text{.}\)

3.

Suppose one has the (extremely small) data set with vectors
\begin{equation*} S = \left\{ \twovec{1}{2}, \twovec{1}{-3}, \twovec{0}{1}, \twovec{-1}{4}, \twovec{4}{1}. \right\}. \end{equation*}
(a)
Compute the mean of this data set.
(b)
Compute the covariance matrix of this data set.
(c)
Find eigenvalues and eigenvectors for the covariance matrix.
(d)
Sketch an ellipse describing the bivariate distribution of this data set.

4.

Show that \(\mathbf{\Sigma}\) is positive semi-definite as in equation (4.3.5). Use this to show that it has only non-negative eigenvalues.

5.

Let
\begin{equation*} Q (x,y) = 3x^2 + 8xy - 3y^2. \end{equation*}
Find an orthonormal basis \(\mathcal{B}\) for \(\mathbb{R}^2\) so that in these coordinates
\begin{equation*} Q (u, v) = \lambda_1 u^2 + \lambda_2 v^2 . \end{equation*}
Describe the conic section \(Q (x,y) = 1\text{.}\)