Machine learning for dummies – Support Vector Machines

Support Vector Machines, it sounds scary. Might be a name of some horror movie. Revenge of Support Vector Machines. But do not worry, for us dummies it’s not so complicated. I promise I will not use any intimidating words such as Lagrangian or Mercer’s Theorem. To make it even simpler, I will just talk about classification between two classes, as if we were trying to decide if a letter is an ‘A’ or not (please read previous post if you do not know what I am talking about).

SVMs are really similar to Logistic Regression, they try to find linear (hyper)plane that would separate our classes, one class on one side of the plane, the other on the other side. There are two important differences from Logistic Regression. The first one is that SVMs try to find plane that is as far as possible from both classes. But that’s not so interesting. When trying to classify letters, our problem is that most likely you can not find any such plane in our 784 dimensional space (28×28 pixels) that would separate A’s from other letters.

Imagine that we got to situation on the image below. For simplicity only in 2 dimension, as if our letters had only two pixels. (images taken from Everything You Wanted to Know about the Kernel Trick)

Kernel Trick

Imagine that our A’s are in the middle surrounded by their enemies, other letters. I just can not draw a straight line to separate them.

But I can escape to another dimension by creating a third feature like $latex x_1^2 + x_2^2$ (image on the right). Now I can separate the two classes. Look ma, I can draw a circle using a linear function!

Kernel Trick

While I can manually find a new feature that would separate the classes, I am to lazy to do that. It’s called Machine Learning, I want the machine to do it for me. SVMs can hep me with that. I can ask them to use a “kernel” which will make my model non-linear. Here is how it works, hopefully sufficiently dummified.

Imagine I have 20,000 training samples. The first thing the model does is that it remembers them all. It just shamelessly stores all 20k samples. Now imagine I have a new sample coming to the model. The model calculates distances from my 20k training samples to my new sample which results in 20k new features. First feature is the distance to the first sample in my training set, second is the distance to the second sample etc. Suddenly I have twenty-thousand-dimensional space and it’s much more likely that I will find a linear plane that would separate both classes. This method intuitively makes sense. Points representing A’s are likely to be near other point’s representing A’s so if we somehow incorporate distance into the model, it should get better result. In more clever sounding words, if I project the plane form 20k dimensional space back to my original 784 dimensional space, I can get highly nonlinear boundary.

Now a bit less dummified version (dummified is a word, I swear). Not all 20k training samples are used, only the most interesting ones (they are called Support Vectors hence the name). So the space we are using has less than 20k dimensions. Moreover the distance is not necessary a distance, it’s just some function that takes two points (vectors) as an input and returns a number (new feature). This function is called kernel and we pick it when configuring the classifier – Gaussian (rbf) is the most common. And while the method sounds complicated and slow, it uses some mathematical shortcuts so it’s relatively fast. Unfortunately it does not scale well with number of training samples.

Let’s see how the method looks in the code.

clf = svm.SVC(C=10, gamma=0.01, cache_size=2000) 

The rest of the code is the same as the last time, we’ve just changed the classifier. We see our favorite (hyper)parameter C. There is also a new parameter ‘gamma’ which configures the kernel. Smaller gamma makes our Support Vectors reach further and generalize more.

So what are the results? We got to 99.7% accuracy on the training set. It’s not surprising, the model actually remembers some of the training samples. If we call ‘clf.support_.shape’ it will tell us that it remembers 8,670 samples out of 20,000. It means that our model has to learn 8,670 parameters for each classified class. Let’s count how many numbers the model needs to remember. It stores 8,670 samples each of them having 28×28=784 values. We are trying to classify 10 letters so the model learns 10×8,670 parameters. (Actually 9×8,670, but let’s not go there.) All in all, the model consists of almost 7,000,000 parameters (numbers) most of it in form of Support Vectors picked from training samples. The training takes less than 3 minutes on my laptop.

What are the results on the validation set, on the samples that the model has not seen? It is 87%. The nonlinearity of the model gave us five more percent when compared to Logistic Regression. It’s great. The author of notMNIST set says that the set has ~6,5% label error rate so we need six more percent to get perfect. To confirm that I could not resist and run the classification on test set, which is much cleaner. I got 93% accuracy. Like 93% on this crazy images? Incredible. We still have further 6-7% to go and I am afraid we will have to resort to neural networks to get better results.

Sources:
Coursera Machine Learning Course
Everything You Wanted to Know about the Kernel Trick
scikit learn documentation on RBF SVM parameters
scikit learn documentation on SVC

Machine Learning for dummies – overfitting

Before we can move to more advanced topics, we should discuss over and underfitting. It reminded me my high school biology lessons. We were expected to recognize different kinds of flowers. In the lessons we were given images of flowers and we were supposed to learn that this flower is yellow and has this type of leafs and this one is blue with different kind of leafs. But there was an alternative strategy how to pass the exam. The images had numbers in the corner and it’s been much more easier to memorize the numbers with corresponding names. You did not learn anything about the flowers but if your only objective was to pass the exam, it was perfect.

And exactly the same problem we have in machine learning. If you do not have many training examples and you have complicated model, the learning algorithm may just memorize the training samples without trying to generalize. It basically does the same thing what my classmates did in the biology lessons. It is great if you only want to classify samples from the training set. If you however want to classify unknown samples, you want your algorithm to generalize.

Let’s look at some examples. I will use the same algorithm as the last time on notMNIST data, I will just make the training set smaller, let’s say 1k samples. In noMNIST set there are 10 letters A-J. We have shuffled our training set so in those 1k samples there will be around 100 samples of each letter. I am using Logistic Regression in 784 dimensional space which maps to those 10 classes. We also have bias elements for each class so the model has 7850 parameters. Parameters are just some numbers which the algorithm learns based on the samples. We dummies do not need to know more. So the algorithm learns ~8000 numbers from 1000 samples. Or for each letter ~800 parameters from ~100 samples. There is quite substantial danger that the model will just memorize the training set.

from sklearn import linear_model

clf_l = linear_model.LogisticRegression()
# we do not want to use the whole set
size = 1000

# images are 28x28, we have to make a vector from them
tr = train_dataset[:size,:,:].reshape(size, 784)

# This is the most important line. I just feed the model training samples and it learns
clf_l.fit(tr, train_labels[:size])

# This is prediction on training set
prd_tr = clf_l.predict(tr)
print(float(sum(prd_tr == train_labels[:size]))/prd_tr.shape[0])

# let's check if the model is not cheating, it has never seen this letters before
prd_l = clf_l.predict(valid_dataset.reshape(valid_dataset.shape[0], 784))
float(sum(prd_l == valid_labels))/prd_l.shape[0]

And indeed, the model has been able to correctly classify 99% of the samples from the training set. That’s why we have the validation set. We will use it to check our model. It’s as if our biology teacher had one set of images for the lessons and another set for the exams. Actually, it’s more complicated, we have also test set, but we will not be using that one so often so let’s forget about it for now.

Our model is able to recognize 74% of samples from our validation set. Not to bad. Luckily for us, Logistic Regression is just trying to find linear (flat) plane to split the space so it is not able to cheat much. It can not simply pick the training examples, it has to take half of the space with them. But let’s try to force it to cheat anyway. Logistic Regression has parameter C (as for cheating) which says how much we allow the algorithm to over-fit. Higher C means more cheating, lower C means less cheating (for those who are used to more usual notation C = 1/lambda)

# Cheating level Carnage
clf_l = linear_model.LogisticRegression(C=100)

If we tune cheating parameter to 100, we get 99.9% of success on training set and 71% on validation set. If we use a value from other side of the spectrum C=0.001 we get 75% on training set and 76% on validation set. It’s up to us to decide what is better. Here is a table for some values of C for 1k samples.

C tr. set valid. set
0.001 75% 76%
0.01 81% 79%
0.1 92% 78%
1 99% 75%
10 99% 72%
100 99.9% 71%

Another other way for preventing overfitting is to provide more training examples. Imagine our biology lesson with much more samples with different numbers. From certain number of samples, it is much easier to try to learn how to recognize those damn flowers than to try to memorize all the numbers in the corner. And the same applies for machine learning algorithms too. If I use C=100 with 10k training samples, I get 93% accuracy on training set and 74% on validation set. You can notice that it’s actually worse than most of the results with 1k of examples and different C values. But we can combine both approaches. Let’s say that I pick C=0.01 and use 20k training samples. I get 83% accuracy on training set and 82% on validation set. As you can see, both numbers are converging. There is only 1% difference. It means that we are approaching limits of our model. Using more samples is not likely to help much. I am afraid we have to move to better model to get better results. That’s it for today. You can go now.

Update: Fixed number of parameters

Machine learning for dummies – Logistic Regression

Hi, welcome to my on-line course “machine learning for dummies”. Usually courses like this are written by an expert. This is not my case, though. I do not know anything about machine learning. I have spent more than 10 years working with server side Java, moving data here and there. I have even worked as REST API architect or Scrum master. Yes, the Scrum that eats programmers soul. But no machine learning whatsoever apart from image recognition course at the university 12 years ago and Machine Learning course on Coursera few months ago. But now I have decided that I will learn ML, so I quit my job, declined an interesting offer and started to learn ML.

Since Deep learning is the current hype, I have started with Deep Learning course at Udacity. Deep learning course from Google right in my living room? Cool.

And the course went quite well until the very first exercise which was about recognizing notMnist dataset

notMNIST

You have this “letters” and you want to recognize them. Sounds like a job for machine learning, right?

Installing Python libraries

To finish the exercise, you need to know something about machine learning and Python first. I do not know neither. Python is quite simple, if you know programming you will learn it in no time. The most complex task is the installation. I have spent like six hours trying to install all the required libraries. If you do not know Python, you just have no idea if you want to use OS specific packages (apt, brew), pip, easy_install or whatnot. The thing is, in Java you have projects that describe their dependencies using Maven or Gradle. So if you want to work on Java project, you just execute ‘mvn test’, it downloads half of the Internet and you are done. But in Python you have no project so you have to download all libraries yourself. What’s more, those libraries depend on each other and you need to have correct versions. Well I downloaded and recompiled half of the Internet multiple times, every time getting different conflict (most often between matplotlib and numpy). Luckily, people using Python know that their package systems suck so they provide Docker images with all the libraries installed. I have used the image to figure out which version of libraries work together. Today, I was trying to install it again on Mac OS and I was able to do it under an hour using Anaconda so maybe this Python distribution system works better than the others.

But once you have Python with all the libraries installed, you get your superpowers. Thanks to numpy, you can easily work with matrices, thanks to matplotlib you can plot charts and thanks to scikit-learn you can do machine learning without knowing much about it. That’s what I need.

Logistic Regression

So back to our notMnist example. After I have reviewed most of the Coursera course I have learned that in machine learning you need features – the data you will use in your algorithm. But we have images, not features. The most naive approach is to take the pixels and use them as features. Images in notMNIST are 28×28 pixel, so for each image we have 784 numbers (shades of gray). We can create 784 dimensional space and represent each image as one point in this space. Please think about it for a while. I will create really high dimensional space and represent each sample (image) by one point in the space. The idea is that if I have the right set of features, the points in this space will became separable. In other words, I will be able to split the space into two parts, one that contains points representing letters A and another part with all other letters. This is called Logistic Regression and there are implementation that do exactly that. Basically you just need to feed the algorithm samples with correct labels and it tries to figure out how to separate those points in the high-dimensional space. Once the algorithm finds the (hyper)plane that separates ‘A’s from other letters, you have your classifier. For each new letter you just check in which part of the space the corresponding point is and you know if it’s an A or not. Piece of cake.

But if you think about it, it just can not work. You just feed it raw pixels. Look at the images above, you are trying to recognize letter ‘A’ just based on values of the pixels. But each of the ‘A’s above is completely different. The learning algorithm just can not learn to recognize the letter just based on individual pixels. Or can it?

What’s more, I am trying to find a flat plane that would separate all As in my 784 dimensional space from all other letters. What if As in my crazy space form a little island which is surrounded by other letters. I just can not separate them by a flat plane. Or can I?

Well surprisingly I can and it is quite easy to implement

from sklearn import linear_model

clf_l = linear_model.LogisticRegression()
# we do not want to use the whole set, my old CPU would just melt
size = 20000 

# images are 28x28, we have to make a vector from them
tr = train_dataset[:size,:,:].reshape(size, 784)

# This is the most important line. I just feed the model 20k examples and it learns
clf_l.fit(tr, train_labels[:size])

# This is prediction on training set
prd_tr = clf_l.predict(tr)
print(float(sum(prd_tr == train_labels[:size]))/prd_tr.shape[0])
# prints 0.86 we are able to recognize 86% of the letters in the training set

# let's check if the model is not cheating, it has never seen
# the letters from the validation set before
prd_l = clf_l.predict(valid_dataset.reshape(valid_dataset.shape[0], 784))
float(sum(prd_l == valid_labels))/prd_l.shape[0]
# Cool 80% matches

Even this method that just can’t work is able to classify 80% of letters it has never seen before.

You can even easily visualize the letters that did not match

import numpy
import matplotlib.cm as cm

def num2char(num):
    return chr(num + ord('a'))

# indexes of letters that do not match 
# change the last number to get another letter
idx = numpy.where(prd_l != valid_labels)[0][17]
print("Model predicts: " + num2char(clf_l.predict(valid_dataset[idx].reshape(1, 784))))
print("Should be: " +num2char(valid_labels[idx]))
%matplotlib inline
plt.imshow(valid_dataset[idx,:,:] + 128,cmap = cm.Greys_r)

G

This should have been G but the model has classified it as C. Understandable mistake. Another letter is even better.

f

This obvious ‘f’ has been incorrectly classified as A. Well, to me it looks more like an A than an f. The thing is that even this simple naive model mostly does “understandable” mistakes only. Either the misclassified letter is not a letter at all or the classifier makes mistake because the letters are really similar. Like I and J, I and F and so on.

So what we have learned today? First of all that even stupid model on stupid features can work surprisingly well. Secondly that Python libraries are not easy to install but easy to use. Here I have to admit that similar code in Java would have been just crazy.

Next time we will try SVM and maybe try to figure out some better features to make our model even better.