Support vector machines are machine learning algorithms whereby a model ‘learns’ to categorize data around a linear classifier. The linear classifier is, quite simply, a line that classifies. It’s a line that that distinguishes between 2 ‘types’ of data, like positive sentiment and negative language. This gives you control over data, allowing you to easily categorize and manage different data points in a way that’s useful too.
This tutorial is an extract from Statistics for Machine Learning.
But support vector machines do more than linear classification – they are multidimensional algorithms, which is why they’re so powerful. Using something called a kernel trick, which we’ll look at in more detail later, support vector machines are able to create non-linear boundaries. Essentially they work at constructing a more complex linear classifier, called a hyperplane.
Support vector machines work on a range of different types of data, but they are most effective on data sets with very high dimensions relative to the observations, for example:
- Text classification, in which language has the very dimensions of word vectors
- For the quality control of DNA sequencing by labeling chromatograms correctly
Different types of support vector machines
Support vector machines are generally classified into three different groups:
- Maximum margin classifiers
- Support vector classifiers
- Support vector machines
Let’s take a look at them now.
Maximum margin classifiers
People often use the term maximum margin classifier interchangeably with support vector machines. They’re the most common type of support vector machine, but as you’ll see, there are some important differences.
The maximum margin classifier tackles the problem of what happens when your data isn’t quite clear or clean enough to draw a simple line between two sets – it helps you find the best line, or hyperplane out of a range of options. The objective of the algorithm is to find furthest distance between the two nearest points in two different categories of data – this is the ‘maximum margin’, and the hyperplane sits comfortably within it.
The hyperplane is defined by this equation:
So, this means that any data points that sit directly on the hyperplane have to follow this equation. There are also data points that will, of course, fall either side of this hyperplane. These should follow these equations:
You can represent the maximum margin classifier like this:
Constraint 2 ensures that observations will be on the correct side of the hyperplane by taking the product of coefficients with x variables and finally, with a class variable indicator.
In the diagram below, you can see that we could draw a number of separate hyperplanes to separate the two classes (blue and red). However, the maximum margin classifier attempts to fit the widest slab (maximize the margin between positive and negative hyperplanes) between two classes and the observations touching both the positive and negative hyperplanes. These are the support vectors.
It’s important to note that in non-separable cases, the maximum margin classifier will not have a separating hyperplane – there’s no feasible solution. This issue will be solved with support vector classifiers.
Support vector classifiers
Support vector classifiers are an extended version of maximum margin classifiers. Here, some violations are ‘tolerated’ for non-separable cases. This means a best fit can be created. In fact, in real-life scenarios, we hardly find any data with purely separable classes; most classes have a few or more observations in overlapping classes.
The mathematical representation of the support vector classifier is as follows, a slight correction to the constraints to accommodate error terms:
In constraint 4, the C value is a non-negative tuning parameter to either accommodate more or fewer overall errors in the model. Having a high value of C will lead to a more robust model, whereas a lower value creates the flexible model due to less violation of error terms. In practice, the C value would be a tuning parameter as is usual with all machine learning models.
The impact of changing the C value on margins is shown in the two diagrams below. With the high value of C, the model would be more tolerating and also have space for violations (errors) in the left diagram, whereas with the lower value of C, no scope for accepting violations leads to a reduction in margin width. C is a tuning parameter in Support Vector Classifiers:
Support vector machines
Support vector machines are used when the decision boundary is non-linear. It’s useful when it becomes impossible to separate with support vector classifiers. The diagram below explains the non-linearly separable cases for both 1-dimension and 2-dimensions:
Clearly, you can’t classify using support vector classifiers whatever the cost value is. This is why you would want to then introduce something called the kernel trick.
In the diagram below, a polynomial kernel with degree 2 has been applied in transforming the data from 1-dimensional to 2-dimensional data. By doing so, the data becomes linearly separable in higher dimensions. In the left diagram, different classes (red and blue) are plotted on X1 only, whereas after applying degree 2, we now have 2-dimensions, X1 and X21 (the original and a new dimension). The degree of the polynomial kernel is a tuning parameter. You need to tune them with various values to check where higher accuracy might be possible with the model:
However, in the 2-dimensional case, the kernel trick is applied as below with the polynomial kernel with degree 2. Observations have been classified successfully using a linear plane after projecting the data into higher dimensions:
Different types of kernel functions
Kernel functions are the functions that, given the original feature vectors, return the same value as the dot product of its corresponding mapped feature vectors. Kernel functions do not explicitly map the feature vectors to a higher-dimensional space, or calculate the dot product of the mapped vectors. Kernels produce the same value through a different series of operations that can often be computed more efficiently.
The main reason for using kernel functions is to eliminate the computational requirement to derive the higher-dimensional vector space from the given basic vector space, so that observations be separated linearly in higher dimensions. Why someone needs to like this is, derived vector space will grow exponentially with the increase in dimensions and it will become almost too difficult to continue computation, even when you have a variable size of 30 or so. The following example shows how the size of the variables grows.
Here’s an example: When we have two variables such as x and y, with a polynomial degree kernel, it needs to compute x2, y2, and xy dimensions in addition. Whereas, if we have three variables x, y, and z, then we need to calculate the x2, y2, z2, xy, yz, xz, and xyz vector spaces. You will have realized by this time that the increase of one more dimension creates so many combinations. Hence, care needs to be taken to reduce its computational complexity; this is where kernels do wonders. Kernels are defined more formally in the following equation:
Polynomial kernels are often used, especially with degree 2. In fact, the inventor of support vector machines, Vladimir N Vapnik, developed using a degree 2 kernel for classifying handwritten digits. Polynomial kernels are given by the following equation:
Radial Basis Function kernels (sometimes called Gaussian kernels) are a good first choice for problems requiring nonlinear models. A decision boundary that is a hyperplane in the mapped feature space is similar to a decision boundary that is a hypersphere in the original space. The feature space produced by the Gaussian kernel can have an infinite number of dimensions, a feat that would be impossible otherwise. RBF kernels are represented by the following equation:
This is sometimes simplified as the following equation:
It is advisable to scale the features when using support vector machines, but it is very important when using the RBF kernel. When the value of the gamma value is small, it gives you a pointed bump in the higher dimensions. A larger value gives you a softer, broader bump. A small gamma will give you low bias and high variance solutions; on the other hand, a high gamma will give you high bias and low variance solutions and that is how you control the fit of the model using RBF kernels: