These are some rough notes I am writing for personal use/later revision ONLY.
None of this information is in any way verified, or correct… probably.
Singular Value Decomposition
This is a good way of representing a matrix that gives us insight into the
range, null space, rank, and 2-norm condition number. Geometrically, the SVD
posits that any matrix can be broken down into a rotation, a stretch, and a
rotation, in that order. Furthermore, for any matrix \(A\), writing out
\(A^\top A\) and \(AA^\top\) in terms of the SVD of \(A\) yields the
eigendecomposition of \(A^\top A\) and \(AA^\top\), respectively. Hence, the
eigenvectors of \(A^\top A\) and \(AA^\top\) make up the \(U\) and \(V\)
vectors, respectively.
Also, taking the first column of \(U\), the first eigenvalue in \(\Sigma\), and
the first row of \(V^\top\) gives us a rank 1 matrix (column rank and row rank
are both 1, since they must be equal) with the greatest variance.
Principal Components Analysis
Supposing we have a set of vectors, the PCA algorithm follows:
- Find mean vector.
- Create a mean-adjusted (subtract mean vector from every data vector) matrix,
\(S_{mean}\).
- Compute a covariance matrix, \(Cov_{i,j} = (S_i-\bar{S})(S_j-\bar{S})\)
where \(\bar{S}\) is the mean vector and \(S\) is some vector in the set.
- Compute the SVD of the covariance matrix. The rows of \(V^\top\) are the
principal axes and the columns of \(U\Sigma\) contain the principal scores.
Support Vector Machine (Linear)
Support vectors are points in the data that, if removed, would change the
hyperplane. We use these support vectors in a constrained optimization problem
to define a separating hyperplane. In fact, we first define to two hyperplanes
(aka. the margins) such that:
\(wx_i+b \geqslant 1\) when \(y_i = 1,\)
\(wx_i+b \geqslant -1\) when \(y_i = -1.\)
A key feature of these margins, which we incorporate more explicitly below, is
that no points can exist within the the boundaries of the margins. It follows
that the median of these hyperplanes is the separating hyperplane with the
maximum distance between any two members of different groups.
This is done using a constrained optimization problem. It is defined as
follows:
\(\min f: \frac{1}{2} \|w\|^2\) s.t. \(g: y_i(wx_i)-b=1.\)
We solve this using the Lagrangian Dual Problem.
Support Vector Machine (Non-Linear)
Realize that the dual problem involves computing inner products and that
kernels are, by definition, inner products in a reproducing kernel Hilbert
space. Hence, we just compute the feature map of each vector and compute inner
products just as in the linear case.