Acknowledgement

The Elements of Statistical Learning, chapters 4 and 12

http://uc-r.github.io/svm

Linearly discriminant analysis

We first revisit linear methods for classification. Let \(G(x)\) be our predictor that takes values in a discrete set of K classes \(\{1, 2, \dots, K\}\). Here we consider a member of a class of methods that model discriminant functions \(\delta_k(x)\) for each class, and classify \(x\) to the class with the largest value for its discriminant function, \(\hat G(x) = \mathop{\mathrm{arg\,max}}_k \delta_k(x)\).

Suppose \(f_k(x) = \Pr(X=x | G = k)\) is the class-conditional density of \(X\) in class \(G=k\), and let \(\pi_k = \Pr(G = k)\) be the prior probability of class k, with \(\sum_{k=1}^{K} \pi_k = 1\). Bayes theorem gives us

\[ \Pr(G=k|X=x) = \frac{f_k(x)\pi_k}{\sum_{l=1}^K f_l(x)\pi_l} \]

Suppose we model each class density as multivariate Gaussian \[ f_k(x) = \frac{1}{(2\pi)^{p/2} |\mathbf{\Sigma_k}|^{1/2}} e^{-\frac{1}{2}(x-\mu_k)^T\mathbf{\Sigma}_k^{-1}(x-\mu_k)} \]

Linear discriminant analysis (LDA) arises in the special case when we assume all classes have a common covariance matrix \(\mathbf{\Sigma_k}=\mathbf{\Sigma}\). Therefore, comparing two classes \(k\) and \(l\), we see \[ \begin{aligned} \log \frac{\Pr(G=k|X=x)}{\Pr(G=l|X=x)} &= \log \frac{f_k(x)}{f_l(x)} + \log \frac{\pi_k}{\pi_l} \\ &= \log \frac{\pi_k}{\pi_l} - \frac{1}{2}(\mu_k + \mu_l)^T\mathbf{\Sigma}_k^{-1}(\mu_k - \mu_l) + x^T \mathbf{\Sigma}^{-1}(\mu_k - \mu_l), \end{aligned} \] is an equation linear in \(x\). This linear log-odds function implies that the decision boundary between classes \(k\) and \(l\) is linear in x. This means if we divide \(\mathbb{R}^p\) into regions that are classified as class 1, class 2, etc., these regions will be separated by hyperplanes.

The linear discriminant functions are \[ \delta_k(x) = x^T\mathbf{\Sigma}^{-1}\mu_k - \frac{1}{2}\mu_k^T\mathbf{\Sigma}^{-1}\mu_k + \log \pi_k \]

In practice, we need to estimate parameters of the Gaussian distributions using training data

For more general case where \(\mathbf{\Sigma}_k\) are not assumed to be equal, use quadratic discriminant functions (QDA), \[ \delta_k(x) = -\frac{1}{2}\log|\mathbf{\Sigma}_k| -\frac{1}{2}(x-\mu_k)^T\mathbf{\Sigma}_k^{-1}(x-\mu_k) + \log \pi_k \]

optimal separating hyperplane

Consider a training dataset of \(N\) pairs \((x_1, y_1)\), \((x_2, y_2)\), …, \((x_N, y_N)\), with \(x_i \in \mathbb{R}^p\) and \(y_i \in \{-1, 1\}\)。 Define a hyperplane by \[ \{x:f(x) = x^T\beta + \beta_0 = 0\}, \] where \(\beta\) is a unit vector: \(||\beta|| = 1\). And the classification rule induced by \(f(x)\) is \[ G(x) = \mathop{\mathrm{sign}}[x^T\beta + \beta_0] \]

For separable classes, we can find a function \(f(x) = x^T\beta + \beta_0\) such that \(y_if(x_i) > 0, \forall i\). And we can find the hyperplane that creates the biggest margin between the training points for class \(1\) and \(-1\) with the optimization problem \[ \max\limits_{\beta, \beta_0, ||\beta|| = 1}M \\ \mbox{subject to } y_i(x_i^T\beta + \beta_0) \ge M, i = 1, \dots, N. \]

And it can be more conveniently rephrased as

\[ \min\limits_{\beta, \beta_0}||\beta|| \\ \mbox{subject to } y_i(x_i^T\beta + \beta_0) \ge 1, i = 1, \dots, N, \] where \(M = 1/||\beta||\). This expression is the usual way of writing the support vector criterion for separated data.

Support vector classifier

Suppose now that classes overlap in feature space. One way to deal with the overlap is to still maximize \(M\), with some modification: \[ y_i(x_i^T\beta + \beta_0) \ge M - \xi_i \\ \mbox{or} \\ y_i(x_i^T\beta + \beta_0) \ge M(1 - \xi_i), \\ \] \(\forall i, \xi_i \ge 0, \sum_{i = 1}^N \xi_i \le \mbox{constant}\).

Support vector machines and Kernels

The support vector classifier finds linear boundaries in the input feature space. The support vector machine classifier extends the support vector classifier by feeding it with \[ h(x_i)=\left(h_1(x_i), h_2(x_i), \dots, h_M(x_i)\right), i = 1, \dots, N \] , and produce the (nonlinear) function \[ \hat f(x) = h(x)^T \hat \beta + \beta_0 \]

\[ L_D = \sum \limits_{i = 1}^{N}\alpha_i - \frac{1}{2}\sum \limits_{i = 1}^N \sum \limits_{i' = 1}^N \alpha_i \alpha_{i'}y_i y_{i'} \left\langle {h(x_i), h(x_{i'})} \right\rangle. \]

\[ \begin{aligned} f(x) &= h(x)^T \beta + \beta_0\\ &= \sum \limits_{i = 1}^{N} \alpha_i y_i \left\langle h(x), h(x_{i}) \right\rangle + \beta_0. \end{aligned} \] And \(\alpha_i, \beta_0\) can be determined by solving \(y_if(x_i)=1, \forall x_i, 0<\alpha_i<C\).

\[ K(x, x') = \left\langle h(x), h(x') \right\rangle \]

Support vector machine example

Rmd, html.