avcopan.github.io by avcopan

I put together some notes on the singular value decomposition, which is really the fundamental theorem of linear algebra. I write these things mainly to teach myself, and as a quick refresher for the next time I confuse myself on a topic, but maybe someone else out there will find them useful. For a more engaging and pedagogical approach, I recommend Jeremy Kun’s post over at Math \(\cap\) Programming.

For inspiration, start by listening to Cornelius Lanczos’s birds-eye view of linear algebra, including a slight variation of the singular value decomposition, which he independently rediscovered in 1958.

To check your understanding after working through this post, see if you can figure out how arrive at his decomposition¹ from the standard singular value decomposition.

Definition 1. The right singular vectors of an \( m\times n \) complex matrix² \( \mathbf{M} \) form a series of \(n\) unit vectors in \( \mathbb{C}^n \), each of which has maximum overlap with \( \mathbf{M} \)’s row vectors in the space orthogonal to its predecessors. \[ \mathbf{v}_k = \underset{\mathbf{v}}{\arg\max} \left\{ |\mathbf{M}\mathbf{v}| : \mathbf{v}_1^\dagger \mathbf{v} = \cdots = \mathbf{v}_{k-1}^\dagger \mathbf{v} = 0 , |\mathbf{v}| = 1 \right\} \] The actual value of the overlap for a given singular vector is its singular value, \( \rho_k \equiv |\mathbf{M}\mathbf{v}_k| \).

This definition implies that singular values come in descending order. \[ \rho_1 \geq \rho_2 \geq \cdots \geq \rho_n \geq 0 \] It also implies that the number of nonzero singular values is equal to the rank of \( \mathbf{M} \).³

Definition 2. \( \mathbf{M} \)’s left singular vectors \( \mathbf{u}_1 , \ldots , \mathbf{u}_m \) maximize overlap with its columns. The corresponding singular values are \( \lambda_k \equiv |\mathbf{M}^\dagger\mathbf{u}_k| \).

Left singular vectors are right singular vectors of the adjoint matrix, \( \mathbf{M}^\dagger \). Since adjoints have the same rank, the number of nonzero left and right singular values is always the same. Corollary 2 below will show that the left and right singular values are in fact equal.

Claim 1. \(\mathbf{M}^\dagger \mathbf{M} \mathbf{v}_k = \rho_k^2 \mathbf{v}_k\)

Proof: The \( k^\mathrm{th} \) right singular vector can be determined as a constrained optimization with the following Lagrangian \[ \mathcal{L}(\mathbf{v}_k, \boldsymbol{\epsilon}) = \mathbf{v}_k^\dagger \mathbf{M}^\dagger \mathbf{M} \mathbf{v}_k - \sum_{i=1}^k \epsilon_i ( \mathbf{v}_i^\dagger \mathbf{v}_k - \delta_{ik} ) \] where the first term equals \( |\mathbf{M}\mathbf{v}_k|^2 \) and the epsilons are Lagrange multipliers for the contraints. Optimizing with respect to \( \mathbf{v}_k^\dagger \) shows that \( \mathbf{v}_k \) is an eigenvector of \( \mathbf{M}^\dagger\mathbf{M} \) with eigenvalue \( \epsilon_k \). Left-multiplying both sides of the eigenvalue equation by \( \mathbf{v}_k^\dagger \) shows that \( \epsilon_k \) is \( \rho_k^2 \).

This eigenvalue equation is also known as a principal component analysis of the data stored in the row vectors of \( \mathbf{M} \). In this context, the singular vectors are called principal components. If \( \mathbf{V}^{(k)} \) contains the first \( k \) principal components, then \( \mathbf{M} \mathbf{V}^{(k)} \) yields the best⁴ possible \( k \)-dimensional compression of the data, which can be approximately recovered as \( \mathbf{M} \approx \mathbf{M} \mathbf{V}^{(k)} \mathbf{V}^{(k)\dagger} \). If the higher singular values are all zero, this tells us that the set of vectors actually lives in a \( k \)-dimensional space and the compression will be lossless.

Corollary 1. \(\mathbf{M} \mathbf{M}^\dagger \mathbf{u}_k = \lambda_k^2 \mathbf{u}_k\)

Since the left singular vectors of \( \mathbf{M} \) are right singular vectors of \( \mathbf{M}^\dagger \), this follows from claim 1.

Corollary 2. \(\mathbf{M}\mathbf{v}_k=\sigma_k\mathbf{u}_k\) where \(\sigma_k=\rho_k=\lambda_k\) for \(k\leq n\)

Proof: Left-multiplying claim 1 by \( \mathbf{M} \) gives \[ \mathbf{M} \mathbf{M}^\dagger (\mathbf{M}\mathbf{v}_k) = \rho_k^2 (\mathbf{M}\mathbf{v}_k) \] which, by corollary 1, implies that \( \mathbf{M}\mathbf{v}_k \propto \mathbf{u} \) where \( \mathbf{u} \) is a left singular vector with singular value \( \lambda = \rho_k \). Since this holds for all \( k \leq n \) and the number of nonzero left and right singular values is the same, we can identify \( \lambda \) as \( \lambda_k \). Expressing the proportionality relationship as \( \mathbf{M} \mathbf{v}_k = \sigma_k \mathbf{u}_k \) and recalling that \( \mathbf{u}_k \) is a unit vector, we can identify the ratio \( \sigma_k \) as the length of \( \mathbf{M}\mathbf{v}_k \), which equals \( \rho_k \). Finally, note that when \( \sigma_k = 0 \) any vector in the null space of \( \mathbf{M}^\dagger \) can stand in for \( \mathbf{u}_k \).

Corollary 3. \(\mathbf{M}^\dagger\mathbf{u}_k=\sigma_k\mathbf{v}_k\) for \(k\leq n\)

This proof is similar to the one for corollary 2.

For the next result, define \( \mathbf{U} \equiv (\mathbf{u}_1 \cdots\, \mathbf{u}_m) \) and \( \mathbf{V} \equiv (\mathbf{v}_1 \cdots\, \mathbf{v}_n) \), and let \( \boldsymbol{\Sigma} \) be a diagonal \( m\times n \) matrix of singular values. Note that \( \mathbf{U} \) and \( \mathbf{V} \) are unitary, since their columns are orthonormal.

Corollary 4. \(\mathbf{M}=\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\dagger\)

Corollary 2 implies that \( \mathbf{M} \mathbf{V} = \mathbf{U} \boldsymbol{\Sigma} \). Right-multiplying with \( \mathbf{V}^\dagger = \mathbf{V}^{-1} \) completes the proof.

He describes it in detail starting 18 minutes in. ↩
Let’s assume a “tall” matrix with \(m\geq n\). Our results will generalize to wide matrices via \(\mathbf{M}^\dagger\). ↩
Since the singular vectors are also orthogonal coordinate vectors over the columns of \(\mathbf{M}\). ↩
“Best” in the least-squares sense, because it minimizes the norm of the error. ↩

2017/07/06 17:21