I am currently working through the Mathematics for Machine Learning textbook by Deisenroth et al. and use this space to explain things in my own words, which I find to be the best test of understanding. If you find any mistakes, please let me know!
A system of linear equations can be compactly represented by \(A \boldsymbol{x} = \boldsymbol{b}\) and will either have no solutions, exactly one solution, or infinitely many solutions. For example, the system $$ \begin{aligned} x_1 + x_2 + x_3 &= 3 \\ x_1 - x_2 + 2x_3 &= 2 \\ 2x_1 + 3x_3 &= 5 \end{aligned} $$ can be written as $$ \begin{bmatrix} 1 & 1 & 1 \\ 1 & -1 & 2 \\ 2 & 0 & 3 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} = \begin{bmatrix} 3 \\ 2 \\ 5 \end{bmatrix} $$ Gaussian elimination is an algorithm that enables us to solve these systems by putting the matrix \(A\) into row echelon form or (better yet) reduced row echelon form. In this example, we would construct the augmented matrix $$ \left[\begin{array}{rrr|r} 1 & 1 & 1 & 3 \\ 1 & -1 & 2 & 2 \\ 2 & 0 & 3 & 5 \end{array}\right] $$ and perform elementary row operations until we obtain $$ \left[\begin{array}{rrr|r} 1 & 0 & \frac{3}{2} & \frac{5}{2} \\ 0 & 1 & -\frac{1}{2} & \frac{1}{2} \\ 0 & 0 & 0 & 0 \end{array}\right], $$ which tells us that \(x_1+\frac{3}{2}x_3 = \frac{5}{2}\) and \(x_2-\frac{1}{2}x_3 = \frac{1}{2}\). In this case, we have only two equations with three unknowns, so we introduce a free variable \(t=x_3\). Solving for \(x_1\) and \(x_2\), we obtain $$ \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} = \begin{bmatrix} \frac{5-3t}{2} \\ \frac{1+t}{2} \\ t \end{bmatrix} = \begin{bmatrix} \frac{5}{2} \\ \frac{1}{2} \\ 0 \end{bmatrix} + t \begin{bmatrix} \frac{-3}{2} \\ \frac{1}{2} \\ 1 \end{bmatrix} $$ where \(t \in \mathbb{R}\). Therefore, this system has infinitely many solutions. Recognizing that this solution describes a line in \(\mathbb{R}^3\), we can imagine that there is an infinite line of solutions that satisfy our equation.
For a more intuitive understanding of what that means, it is helpful to remember that \(A \in \mathbb{R}^{m \times n} \) represents a linear transformation \( T(\boldsymbol{x}): \mathbb{R}^n \rightarrow \mathbb{R}^m \). In other words, the equation \(A\boldsymbol{x} = \boldsymbol{b}\) is really asking, "what vector \(\boldsymbol{x}\) will land on \(\boldsymbol{b}\) after the transformation \(A\) is applied?" Our solution says that an infinite line of vectors will land on the vector \(\boldsymbol{b}\).
Let's dig a little deeper by thinking for a moment about just an arbitrary vector \([x_1,x_2,x_3]^\top \in \mathbb{R}^3\). Another way to express this vector is as a linear combination of three more fundamental vectors: $$ \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} = x_1 \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} + x_2 \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix} + x_3 \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix} $$
These vectors \(\boldsymbol{e}_1=[1,0,0]^\top, \boldsymbol{e}_2=[0,1,0]^\top, \boldsymbol{e}_3=[0,0,1]^\top \) are called standard basis vectors. Going back to our example above, the matrix \(A \in \mathbb{R}^3\) represents a linear transformation \(T(\boldsymbol{x}):\mathbb{R}^3 \rightarrow \mathbb{R}^3\). In other words, \(A\) maps every vector in \(\mathbb{R}^3\) to a new vector in \(\mathbb{R}^3\), including the standard basis vectors. Because the transformation is linear, we can write that $$ T\left(\begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} \right) = x_1 T\left( \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} \right) + x_2 T\left( \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix} \right) + x_3 T\left( \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix} \right) $$
This tells us that the transformed version of any vector \([x_1,x_2,x_3]^\top\) can be found by scaling the transformed standard basis vectors by \(x_1, x_2, x_3\) and adding them up. Therefore, all the information about \(T(\boldsymbol{x})\) is encoded by the transformed versions of the standard basis vectors. Given that \(T(\boldsymbol{x})\) is also described by the matrix multiplication \(A\boldsymbol{x}\), it follows from the above that the columns of \(A\) are precisely the transformed standard basis vectors.
In our example above, therefore, the transformation matrix $$ A = \begin{bmatrix} 1 & 1 & 1 \\ 1 & -1 & 2 \\ 2 & 0 & 3 \end{bmatrix} $$ maps the standard basis vectors to \([1,1,2]^\top, [1,-1,0]^\top, [1,2,3]^\top \) and, as we found in the example above, it maps an infinite line of vectors onto \(\boldsymbol{b}=[3,2,5]^\top\).
You might be wondering, why did we get infinitely many solutions? When would we have no solutions, or exactly one solution? To answer these questions, it will be helpful to define a number of concepts, including vector spaces, linear (in)dependence, span, and bases.
Let's start with vector spaces. A real-valued vector space \(V\) is a set of vectors with two operations: addition between two vectors, and multiplication between a vector and a scalar.
Consider a set of vectors \(\boldsymbol{v}_1,\boldsymbol{v}_2,\dots \boldsymbol{v}_k \in V\). These vectors are said to be linearly independent if $$ c_1 \boldsymbol{v}_1 + c_2 \boldsymbol{v}_2 + \dots + c_k \boldsymbol{v}_k = \boldsymbol{0} $$ only holds for \(c_1=c_2=\dots=c_k=0\). If, however, any of the vectors can be expressed as a linear combination of the others, then the vectors are said to be linearly dependent.
Returning to our example, let's check whether the columns of \(A\) are linearly independent. To do this, we have to solve yet another system of equations. In particular, we want to know whether $$ \begin{bmatrix} 1 & 1 & 1 \\ 1 & -1 & 2 \\ 2 & 0 & 3 \end{bmatrix} \begin{bmatrix} c_1 \\ c_2 \\ c_3 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix}, $$ is solved only by \(c_1=c_2=c_3=0\). By definition, this would imply that the columns of \(A\) are linearly independent. Using Gaussian elimination, we found earlier that $$ \left[\begin{array}{rrr|r} 1 & 0 & \frac{3}{2} & 0 \\ 0 & 1 & -\frac{1}{2} & 0 \\ 0 & 0 & 0 & 0 \end{array}\right] \implies \begin{bmatrix} c_1 \\ c_2 \\ c_3 \end{bmatrix} = t \begin{bmatrix} \frac{-3}{2} \\ \frac{1}{2} \\ 1 \end{bmatrix} $$ where \(t \in \mathbb{R}\). Therefore, we do not require \(c_1=c_2=c_3=0\) to solve this system, and so the columns of \(A\) are linearly dependent.
This brings us to our next key concept: span. The span of a set of vectors \(\boldsymbol{v}_1,\boldsymbol{v}_2,\dots \boldsymbol{v}_k \in V\), denoted \(\text{span}[\boldsymbol{v}_1,\boldsymbol{v}_2,\dots \boldsymbol{v}_k]\), is simply the set of all linear combinations of those vectors. Imagine scaling and adding these vectors every which way and think about the geometries that these linear combinations would "sweep out."
Looking at the columns of the reduced row echelon version of \(A\), we can see that any column of \(A\) can be obtained by a linear combination of the other two columns. Therefore, we can imagine that the three column vectors exist on a single two-dimensional plane in \(\mathbb{R}^3\). In other words, the vectors span a two-dimensional plane.
Another way to think about this is that there is a kind of "redundancy" among these three vectors - if all three of them exist in the same 2D plane, then we really only need two of them to span that plane. In particular, we can say that $$ \text{span}\big[ \begin{bmatrix} 1 \\ 1 \\ 2 \end{bmatrix}, \begin{bmatrix} 1 \\ -1 \\ 0 \end{bmatrix} \big] = \text{span}\big[ \begin{bmatrix} 1 \\ 1 \\ 2 \end{bmatrix}, \begin{bmatrix} 1 \\ -1 \\ 0 \end{bmatrix}, \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \big], $$ and that the two vectors on the left-hand side of this equality form a basis for the 2D plane that they span. In general, a basis of a vector space \(V\) is a set of linearly independent vectors that span \(V\). A basis is also called a minimal generating set because it contains the minimum number of vectors required to reach any element in the set through linear combinations.
As we said earlier, \(\boldsymbol{e}_1=[1,0,0]^\top, \boldsymbol{e}_2=[0,1,0]^\top, \boldsymbol{e}_3=[0,0,1]^\top \) are the standard basis vectors for \(\mathbb{R}^3\). Indeed, they are linearly independent and span the space. However, they are by no means the only \(\mathbb{R}^3\) basis possible.
In fact, if you randomly pick three vectors in \(\mathbb{R}^3\) (i.e., you randomly select their coordinates), the three vectors will almost surely (yes, that is the technical term) be linearly independent and be a basis for \(\mathbb{R}^3\). Conversely, they will almost never just happen to lie in the same plane, or on the same line.
The standard basis vectors are important because they capture our implicit assumptions about the coordinate system we are working in. Consider the vector \([3,2]^\top \in \mathbb{R}^2\). Implicitly, we know that this vector can be drawn by connecting the origin to a point three units to the right and two units up, i.e., $$ \begin{bmatrix} 3 \\ 2 \end{bmatrix} = 3 \begin{bmatrix} 1 \\ 0 \end{bmatrix} + 2 \begin{bmatrix} 0 \\ 1 \end{bmatrix}. $$ In \(\mathbb{R}^2\), the standard basis vectors \(\boldsymbol{e}_1=[1,0]^\top, \boldsymbol{e}_2=[0,1]^\top\) define these units and directions and, by extension, the entire coordinate system.
But this coordinate system is arbitrary. Space has no intrinsic "grid," and another observer might choose two different standard basis vectors that are equally valid (so long as they are linearly independent). In other words, a basis simply provides a frame of reference with which to define coordinate vectors.
Let's say, for example, that your friend Chris decides to use two different vectors, \(\boldsymbol{b}_1\) and \(\boldsymbol{b}_2\), as his standard basis vectors. And let's say that in our coordinate system, these vectors have coordinates \(\boldsymbol{b}_1=[2,1]^\top, \boldsymbol{b}_2=[-1,1]^\top\). Then the vector that Chris thinks of as \([3,2]^\top\) will also be different from the vector that we think of.
To figure out what vector Chris thinks of in our coordinate system (using our basis vectors, rather than Chris's basis vectors), we simply need to compute $$ 3\boldsymbol{b}_1 + 2\boldsymbol{b}_2 = \begin{bmatrix} 3 \\ 2 \end{bmatrix} \begin{bmatrix} 2 & -1 \\ 1 & 1 \end{bmatrix}, $$ where \(\boldsymbol{b}_1\) and \(\boldsymbol{b}_2\) are Chris's basis vectors, represented in our coordinate system. As a side note, it is important to remember that from Chris's perspective, \(\boldsymbol{b}_1, \boldsymbol{b}_2\) have the coordiantes \([1,0]^\top, [0,1]^\top\), because that is how standard basis vectors are defined.
The way to think about this is that the vectors are fixed in space, and we are simply drawing different gridlines that change the coordinates used to represent those vectors. Just like we can use different languages to label the same object, we can use different coordinate systems to represent the same vector.
Let's return to the question that motivated all these definitions: why did we get infinitely many solutions to our example equation \(A\boldsymbol{x}= \boldsymbol{b}\)? Remember that the columns of \(A\) tell us where the standard basis vectors land after applying the linear transformation described by \(A\). Therefore, the standard basis vectors all land on a two-dimensional plane, along with every other vector in \(\mathbb{R}^3\). You can imagine that \(A\) "squashes" every vector in \(\mathbb{R}^3\) down into a plane.
Up next: other important definitions including image, kernel, column space, null space (and connection to previous concepts).