# 2. Linear Algebra

This entry is part 1 of 1 in the series Math for ML
• 2. Linear Algebra

This series of posts is intended to summarize the contents or note memorable things in Mathematics for ML book which I’ve got some insights from the contents. The book is『MATHEMATICS for MACHINE LEARNING, Marc Peter Deisenroth, A. Aldo Faisal and Cheng Soon Ong, Cambridge University Press. https://mml-book.com』.

This post is about 2nd Chapter of the book. “Linear Algebra”.

2.2.2. Inverse and Transpose

Unfortunately, not every matrix $\bm{A}$ possesses an inverse $\bm{A}^{-1}$. If this inverse does exist, $\bm{A}$ is called regular / invertible / nonsingular, otherwise singular / noninvertible.

Remark (Existence of the Inverse of a $2 \times 2$ – matrix ). Consider a matrix

$\bm{A} := \begin{bmatrix}a_{11} & a_{12} \\a_{21} & a_{22} \end{bmatrix}$,

$\bm{A}^{-1} := \dfrac{1}{a_{11}a_{22} - a_{12}a_{21}} \begin{bmatrix}a_{22} & -a_{12} \\-a_{21} & a_{11} \end{bmatrix}$

2.3.1. Particular and General Solution

Consider the system of equations

$\begin{bmatrix}1 & 0 & 8 & -4 \\0 & 1 & 2 & 12 \end{bmatrix} \begin{bmatrix} x_1 \\x_2\\x_3\\x_4 \end{bmatrix} = \begin{bmatrix}42 &8 \end{bmatrix}$

The system has two equations and four unknowns. Therefore, in general we would expect infinitely many solutions.

A solution is $[42, 8, 0, 0]^T$.

$\bm{b} = \begin{bmatrix} 42\\8\end{bmatrix} = 42 \begin{bmatrix} 1\\0 \end{bmatrix} + 8\begin{bmatrix} 0\\1 \end{bmatrix}$.

Adding $\bold{0}$ to our special solution does not change the special solution. To do so, we express the third column using the first two columns

$\begin{bmatrix} 8\\2 \end{bmatrix} = 8\begin{bmatrix} 1\\0 \end{bmatrix} + 2\begin{bmatrix} 0\\1 \end{bmatrix}$

so that

$\bold{0}=8\bm{c}_1 + 2\bm{c}_2 -1\bm{c}_3 + 0\bm{c}_4$ and $(x_1, x_2, x_3, x_4)=(8, 2, -1, 0)$. In fact, any scaling of this solution by $\lambda_1 \in \R$ produces the $\bm{0}$ vector, i.e.,

$\begin{bmatrix} 1&0&8&-4\\0&1&2&12\end{bmatrix} \Biggr( \lambda_1 \begin{bmatrix} 8\\2\\-1\\0\end{bmatrix} \Biggr) = \lambda_1 (8\bm{c}_1 + 2\bm{c}_2 - \bm{c}_3) = \bold{0}$

Following the same line of reasoning, we express the fourth column of the matrix using the first two columns and generate another set of non-trivial versions of $\bold{0}$ as

$\begin{bmatrix} 1&0&8&-4\\0&1&2&12\end{bmatrix} \Biggr( \lambda_2 \begin{bmatrix} -4\\12\\0\\-1\end{bmatrix} \Biggr) = \lambda_2 (-4\bm{c}_1 + 12\bm{c}_2 - \bm{c}_4) = \bold{0}$.

Putting everything together, we obtain all solutions of the equation system, which is called the general solution, as the set

$\Biggr\lbrace \bm{x} \in \R^4 : \bm{x} = \begin{bmatrix} 42\\8\\0\\0\end{bmatrix} + \lambda_1 \begin{bmatrix} 8\\2\\-1\\0\end{bmatrix} + \lambda_2 \begin{bmatrix} -4\\12\\0\\-1\end{bmatrix}, \lambda_1, \lambda_2 \in \R \Biggr \rbrace$

[Remark] The general approach we followed consisted of the following three steps:
1. Find a particular solution to $\bm{A}\bm{x} = \bm{b}$.
2. Find all solutions to $\bm{A}\bm{x} = \bold{0}$.
3. Combine the solutions from steps 1. and 2. to be general solution.

2.3.2. Elementary Transformations

The leading coefficient of a row (first nonzero number from the left) is called the pivot and is always strictly to the right of the pivot of the row above it.

Definition 2.6 (Row-Echelon Form). A matrix is in row-echelon form if
– All rows that contain only zeros are at the bottom of the matrix; correspondingly, all rows that contain at least one nonzero element are on top of rows that contain only zeros.
– Looking at nonzero rows only, the first nonzero number from the left ( also called the pivot or the leading coefficient ) is always strictly to the right of the pivot of the row above it.

[Remark] (Reduced Row Echelon Form). An equation system is in reduced row-echelon form if.
– It is in row-echelon form.
– Every pivot is 1.
– The pivot is the only nonzero entry in its column.

[Remark] (Gaussian Elimination). Gaussian elimination is an algorithm that performs elementary transformations to bring a system of linear equations into reduced row-echelon form.

2.3.3. The Minus-1 Trick

A practical trick for reading out the solutions $\bm{x}$ of a homogeneous system of linear equations $\bm{A}\bm{x} = \bold{0}$, where $\bm{A} \in \R^{k \times n}, \bm{x} \in \R^n$.

To start, we assume that $\bm{A}$ is in reduced row-echelon form without any rows that just contain zeros, i.e.,

$\bm{A}=\begin{bmatrix}0& \cdots &0 &\bold{1}& *& \cdots & *& 0& *& \cdots &*& 0& *& \cdots & * \\\vdots & \quad &\vdots &0& 0& \cdots & 0& \bold{1} & *& \cdots &*& \vdots & \vdots& \quad & \vdots \\\vdots & \quad &\vdots &\vdots& \vdots& \quad & \vdots& 0 & \vdots& \quad &\vdots& \vdots & \vdots& \quad & \vdots \\\vdots & \quad &\vdots &\vdots& \vdots& \quad & \vdots& \vdots & \vdots& \quad &\vdots& 0 & \vdots& \quad & \vdots \\0 & \cdots &0 &0& 0& \cdots & 0& 0 & 0& \cdots &0& \bold{1} & *& \cdots & * \end{bmatrix}$

where $*$ can be an arbitrary real number, with the constraints that the first nonzero entry per row must be 1 and all other entries in the corresponding column must be 0. The columns $j_1, ..., j_k$ with the pivots (marked in bold) are the standard unit vectors $\bm{e}_1, ..., \bm{e}_k \in \R^k$. We extend this matrix to an $n \times n$-matrix $\tilde{\bm{A}}$ by adding $n - k$ rows of the form

$\begin{bmatrix} 0&\cdots&0&-1&0&\cdots&0\end{bmatrix}$

so that the diagonal of the augmented matrix $\tilde{\bm{A}}$ contains either 1 or -1. Then, the columns of  $\tilde{\bm{A}}$ that contain the -1 as pivots are solutions of the homogeneous system $\bm{A}\bm{x}=\bold{0}$, which we will later call the kernel or null space.

<Example 2.8> (Minus-1 Trick)

Let us revisit the matrix in (2.49), which is already in REF:

$\bm{A} = \begin{bmatrix} 1&3&0&0&3\\0&0&1&0&9\\0&0&0&1&-4\end{bmatrix}$

we now augment this matrix to a $5 \times 5$ matrix by adding rows of the form at the places where the pivots on the diagonal are missing and obtain

$\tilde{\bm{A}}= \begin{bmatrix} 1&3&0&0&3\\0&-1&0&0&0\\0&0&1&0&9\\0&0&0&1&-4\\0&0&0&0&-1 \end{bmatrix}$

From this form, we can immediately read out the solutions of $\bm{A}\bm{x}=\bold{0}$ by taking the columns of $\tilde{\bm{A}}$, which contain -1 on the diagonal:

$\Biggr\lbrace \bm{x} \in \R^5 : \bm{x} = \lambda_1 \begin{bmatrix} 3\\-1\\0\\0\\0\end{bmatrix} + \lambda_2 \begin{bmatrix} 3\\0\\9\\-4\\-1\end{bmatrix}, \lambda_1, \lambda_2 \in \R \Biggr \rbrace$.

2.3.4. Algorithms for Solving a System of Linear Equations

Under mild assumptions we can use the transformation

$\bm{A}\bm{x} = \bm{b} \iff \bm{A}^T\bm{A}\bm{x} =\bm{A}^T\bm{b} \iff \bm{x} = (\bm{A}^T\bm{A})^{-1}\bm{A}^T\bm{b}$

and use the Moore-Penrose pseudo-inverse $(\bm{A}^T\bm{A})^{-1}\bm{A}^T$ to determine the solution that solves $\bm{A}\bm{x} =\bm{b}$, which also corresponds to the minimum norm least-squares solution. A disadvantage of this approach is that it requires many computations for matrix-matrix product and computing the inverse of $\bm{A}^T\bm{A}$.

Gaussian elimination is an intuitive and constructive way to solve a system of linear equations with thousands of variables. However, for systems with millions of variables, it is impractical as the required number of arithmetic operations scales cubically in the number of simultaneous equations.

In practice, systems of many linear equations are solved indirectly, by either stationary iterative methods, such as the Richardson method, the Jacobi method, the Gauss-Seidel method, and the successive over-relaxation method, or Krylov subspace methods, such as conjugate gradients, generalized minimal residual, or biconjugate gradients.

Let $\bm{x}_*$ be a solution of $\bm{A}\bm{x} = \bm{b}$. The key idea of these iterative methods is to set up an iteration of the form

$\bm{x}^{(k+1)} = \bm{C}\bm{x}^{(k)} + \bm{d}$

for suitable $\bm{C}$ and $\bm{d}$ reduces the residual error $\|\bm{x}^{(k+1)} - \bm{x}_* \|$ in every iteration and converges to $\bm{x}_*$. We will introduce norms $\| \cdot \|$, which allow us to compute similarities between vectors, in Section 3.1.

2.4.1 Groups

Definition 2.7 (Group). Consider a set $\mathcal{G}$ and an operation $\otimes: \mathcal{G} \times \mathcal{G} \rightarrow \mathcal{G}$ defined on $\mathcal{G}$. Then $G := (\mathcal{G}, \otimes)$ is called a group if the following hold:

1. Closure of $\mathcal{G}$ under $\otimes: \forall x, y \in \mathcal{G}: x \otimes y \in \mathcal{G}$
2. Associativity: $\forall x, y, z \in \mathcal{G}: (x \otimes y) \otimes z = x \otimes (y \otimes z)$
3. Neutral element: $\exist e \in \mathcal{G}\quad \forall x \in \mathcal{G} : x \otimes e = x \quad \text{and}\quad e \otimes x = x$
4. Inverse element: $\forall x \in \mathcal{G}\quad \exist y \in \mathcal{G}: x \otimes y = e\quad \text{and}\quad y \otimes x = e$. we often write $x^{-1}$ to denote the inverse element of $x$.

If additionally $\forall x, y \in \mathcal{G} : x \otimes y = y \otimes x$, then $G = (\mathcal{G}, \otimes)$ is an Abelian group.

2.4.2. Vector Spaces

Definition 2.9 (Vector Space). A real-valued vector space $V = (\mathcal{V}, +, \cdot)$ is a set $\mathcal{V}$ with two operations

\begin{aligned} + &: \mathcal{V} \times \mathcal{V} \rightarrow \mathcal{V} \\\cdot &: \R \times \mathcal{V} \rightarrow \mathcal{V} \end{aligned}

where
1. $(\mathcal{V}, +)$ is an Abelian group
2. Distributivity:
$\quad$1. $\forall \lambda \in \R, \bm{x}, \bm{y} \in \mathcal{V}: \lambda \cdot (\bm{x} + \bm{y}) = \lambda \cdot \bm{x} + \lambda \cdot \bm{y}$
$\quad$2. $\forall \lambda, \psi \in \R, \bm{x} \in \mathcal{V}: (\lambda + \psi) \cdot \bm{x} = \lambda \cdot \bm{x} + \psi \cdot \bm{x}$
3. Associativity (outer operation): $\forall \lambda, \psi \in \R, \bm{x} \in \mathcal{V}: \lambda(\psi \cdot \bm{x}) = (\lambda \psi) \cdot \bm{x}$
4. Neutral element with respect to the outer operation: $\forall \bm{x} \in \mathcal{V}: 1 \cdot \bm{x} = \bm{x}$

The elements $\bm{x} \in V$ are called vectors.

2.4.3. Vector Subspaces

Definition 2.10 (Vector Subspace). Let $V = (\mathcal{V}, +, \cdot)$ be a vector space and $\mathcal{U} \subseteq \mathcal{V}, \mathcal{U} \not = \emptyset$. Then $U = (\mathcal{U}, +, \cdot)$ is called vector subspace of $V$ if $U$ is a vector space with the vector space operations $+$ and $\cdot$ restricted to $\mathcal{U} \times \mathcal{U}$ and $\R \times \mathcal{U}$. We write $U \subseteq V$ to denote a subsapce $U$ of $V$.

To determine whether $(\mathcal{U}, +, \cdot)$ is a subspace of $V$ we still do need to show
1. $\mathcal{U} \not = \emptyset$, in particular: $\bold{0} \in \mathcal{u}$
2. Closure of $U$:
$\quad$ a. With respect to the outer operation: $\forall \lambda \in \R \quad \forall \bm{x} \in \mathcal{U}: \lambda \bm{x} \in \mathcal{U}$.
$\quad$ b. With respect to the inner operation: $\bm{x}, \bm{y} \in \mathcal{U}: \bm{x} + \bm{y} \in \mathcal{U}$.

<Example 2.12>

• For every vector space $V$, the trivial subspaces are $V$ itself and $\{ \bold{0} \}$

2.5 Linear Independence

Definition 2.11 (Linear Combination). Consider a vector space V and a finite number of vectors $x_1, ..., x_k \in V$. Then, every $v \in V$ of the form

$v = \lambda_1\bm{x}_1 + \cdots + \lambda_k\bm{x}_k = \sum^k_{i=1} \lambda_i \bm{x}_i \in V$

with $lambda_1 , ... , lambda_k \in \R$ is a linear combination of the vectors $\bm{x}_1, ..., \bm{x}_k$

Definition 2.12 (Linear (In)dependence). Let us consider a vector space $V$ with $k \in \N$ and $\bm{x}_1, ..., \bm{x}_k \in V$. If there is a non-trivial linear combination, such that $\bold{0} = \sum^k_{i=1} \lambda_i\bm{x}_i$ with at least one $\lambda_i \not = 0$, the vectors $\bm{x}_1, ..., \bm{x}_k$ are linearly dependent. If only the trivial solution exists, i.e., $\lambda_1 = ... = \lambda_k = 0$ the vectors $\bm{x}_1, ..., \bm{x}_k$ are linearly independent.

All column vectors are linearly independent if and only if all columns are pivot columns.

(From 2.6, the summary will be written in the following post… )