MachineX: SVM as Non-Linear Classifiers

In our previous blogs, we have already looked at and gotten a higher level of understanding of SVM and why we should choose SVM over other classifiers.

In this post, we will look at a detailed explanation of how to use SVM for complex decision boundaries and build Non-Linear Classifiers using SVM. The primary method for doing this is by using Kernels.

In linear SVM, we find margin maximizing hyperplane with features Xi’s. Similarly, in Logistic regression, we also try to find the hyperplane, which minimizes logistic loss with features Xi’s. Most often, when we use both these techniques, the results are the same. But linear SVM fails for the same reason a logistic regression would fail; there is a need to have complex or non-linear decision boundaries. These types of boundaries are then achieved by SVM using Kernels. So let’s understand how SVM creates non-linear boundaries using Kernels.

First, try to pick a few points on the feature plane and call them landmarks. Then, we try to compute new features for an example(X) depending on the closeness of these features to the landmarks. What these features do is they measure how similar example X is to one of the landmarks. In this particular post, we will use the Gaussian Kernel, also known as the RBF Kernel, to compute the similarity measure.

Now let’s suppose the X1  L1. According to the formula, the similarity score (f1) ≈ 1 whereas if X1 is far from the landmark L1, then f1 ≈ 0. In simple terms, this means that feature f1 is almost similar to landmark L1. So we predict the value of feature f1 to be similar to L1. In the same way, we add more landmarks to our plane and try to come up with a feature vector by computing the similarity score of our example with these landmarks.

In the RBF Kernel formula, we have a feature known as gamma. Gamma controls how far the influence of the single training example reaches, which affects how tightly the decision boundary ends up surrounding the input space. Small gamma means a larger similarity radius, which means the values father apart are grouped together, which results in more points grouped together and smoother decision boundary. Whereas smaller gamma means that the points need to be closely packed, which results in more complex, tightly-packed decision boundaries.

How to Choose the Landmarks

A simple rule of the thumb is that we take all the examples in the training set and make then landmarks at the same location as that of the training example.

Given (x^1, y^1), (x^2, y^2), (x^3, y^3)........................ (x^n, y^n)
l^1 = x^1, l^2 = x^2,l^3 = x^3............................................ l^n = x^n

Therefore, for every new training example (x^i, y^i), we will find similarity scores according to each landmark.

f^1 = sim(x^i, l^1)
f^2 = sim(x^i, l^2)
f^3 = sim(x^i, l^3)
f^n = sim(x^i, l^n)

This would give us a new feature vector F.

So now, if we want to predict an estimate for a new example, given we already have model coefficients, we would then use a hypothesis where it says:

Predict y=1 if θ ^T * F >= 0

To estimate the model coefficients, we would then minimize the cost function:

We see that we have a parameter C that controls the tradeoff between satisfying the maximum margin criterion to find the simple decision boundary and avoid misclassification errors on the training set. The C parameter is significant for kernelized SVMs, and it communicates with the gamma parameter. If gamma is large, then C will have little to no effect. If gamma is small, the model is much more constrained, and C will be similar to how it would affect a linear classifier.

Hope you enjoyed the post. Let me know your thoughts in the comments.

This article was originally published here