Linear SVMs are used with linearly separable data; this means that the data do not need to undergo any transformations to separate the data into different classes. The decision boundary and support vectors form the appearance of a street, and Professor Patrick Winston from MIT uses the analogy of “fitting the widest possible street“2 to describe this quadratic optimization problem. Mathematically, this separating hyperplane can be represented as:
wx + b = 0
where w is the weight vector, x is the input vector, and b is the bias term.
There are two approaches to calculating the margin, or the maximum distance between classes, which are hard-margin classification and soft-margin classification. If we use a hard-margin SVMs, the data points will be perfectly separated outside of the support vectors, or “off the street” to continue with Professor Hinton’s analogy. This is represented with the formula,
(wxj + b) yj ≥ a,
and then the margin is maximized, which is represented as: max ɣ= a / ||w||, where a is the margin projected onto w.
Soft-margin classification is more flexible, allowing for some misclassification through the use of slack variables (`ξ`). The hyperparameter, C, adjusts the margin; a larger C value narrows the margin for minimal misclassification while a smaller C value widens it, allowing for more misclassified data3.