Machine Learning for dummies – word2vec

Last time we have skimmed through neural networks. Today I’d like to describe one cool algorithm that is based on them.

Up until now we have worked on character recognition in images. It’s quite straightforward to convert an image to numbers, we just take pixel intensities and we have numbers. But what to do if we want to work with text? We can not do matrix multiplication with words, we need to somehow convert them to numbers. One way to do it is just take all words and number them. ‘aardvark’ would be 1, ‘aardwolf’ 2 etc. The problem with this approach is that similar words would have completely different numbers. Let’s say you are working on image recognition. You want to have a model that says “This is most likely a cat, but maybe it’s a kitten or a tiger. It definitely is something cat-like”. For this it’s better to have numeric representations of cat, kitten and tiger that are similar. Since we are dummies, I will not try to find mathematical reasons for this. Intuition tells me that pictures of kitten and cat are quite similar so it makes sense that output of the learning algorithm should be similar as well. It’s much harder to teach it if cat has number 10 and kitten 123,564.

But how to get such representations? We can use word2vec which allows us to map words to a n-dimensional vector space in a way that puts similar words together. The trick is quite simple. Similar words are used in similar contexts. I can say “I like pizza with cheese”. Or “I like hamburger with cheese”. Here the contexts are similar, just the food is different. Now I just need some algorithm that would read lot of text and somehow find words that are used in similar contexts and put them together.

Word2vec takes a neural network and teaches it to guess contexts based on words. As an input I can take a word “pizza” and try to teach the network to come up with “I like … with cheese”. This approach is called skip-gram. Or I can try the other direction and teach the network to answer “pizza” if I feed it “I like … with cheese.” This direction is called CBOW. There are some differences in the results, but thay are not important enough for now. In the next text I will describe skip-gram with 4 words in the context.

Let’s take a look at the details. We will take simple neural network with one hidden layer. On the input, I will use one-hot encoding. ‘aardvark’ would be [1,0,0,…], ‘aardwolf’ [0,1,0,…], each word would be represented by a vector with zeros and one ‘1’. If my dictionary has 50k words I would end up with 50k-dimensional input vector. We have the input, let’s move to the hidden layer. The size of the hidden layer is up to me, at the end it will be the size of the vector representing the word. Let’s pick 128. Then there will be an output layer that would map the data back to one-hot encoded vector of context words. The network will take large sparse vector, squeeze it into much smaller and denser one and then unsqueeze it to a large vector again.

I take “pizza”, convert it to a 50k-dimensional vector with only one nonzero value. Then I multiply this vector with 128x50k-dimensional matrix and I get 128-dimensional vector. I take this vector, multiply it with another 50k x128-dimensional matrix and get 50k dimensional vector. After the right normalization, this vector will contain probability that given word occurs in context of word “pizza”. Value on the first position will be quite low, aardvark and pizza are not usually used in the same sentence. Actually, I just did that, so the next time this text gets to a learning algorithm, the first value in the vector will be slightly larger.

Of course I have to somehow get those matrices, but it’s just neural network training with some dirty tricks to make it computable. The trouble is, that the dictionary can by larger than 50k words so the learning phase is not trivial. But I am a dummy, I do not care. I just need to know that I will feed it the whole internet (minus porn) and it will somehow learn to predict the context.

Ok, we have a neural network that can predict the context, where do I get the vectors representing the words? They are in the first matrix from the model. That’s the beauty of it. Imagine that I have “pizza” on the input. It’s a vector with all zeros and one “1”. The matrix multiplication is not a multiplication at all, it just picks one column of the first matrix in the model. One column is just 128-dimensional vector and when we multiply it with the second matrix we want to get probabilities that given word is in the context. I guess that “like”, “eat”, “dinner”, “cheese” or “bacon” will be quite high. Now let’s feed in “hamburger”. It will pick another column from the first matrix, but we expect similar words in the result. Not the same, but similar. And to get similar results, we need similar vector in the hidden layer so when we multiply it with the second matrix we get similar context words. But the vector in hidden layer is just a column in the first matrix. It means, that in order to get similar results for words with similar contexts, the columns in the first matrix have to be similar (well they do not have to, but it usually ends up this way)

It’s quite beautiful and powerful algorithm. What’s more, the positions of the vectors in the vector space have some structure too. You have most likely seen the example where they take vector for “king” subtract vector for “man” and add vector for “women” and they get dangerously close to vector for “queen”. I have no idea why it works this way, but it does.

That’s all for today, next time I will try to show you some results. Here is some visualization you can admire in the meantime.

Sources:
Tensorflow tutorial
Nice (long) lecture about word2vec details
Visualization of High Dimensional Data using t-SNE with R
Just Google the rest

Machine Learning for dummies – Neural Networks

Leave a reply

Even though Neural Networks are the most cool thing in machine learning, they conceptually are just multiple linear regressions chained one after the other. So I will not draw any fancy pictures of neurons, I will just write boring matrix multiplications. Both representations are equivalent.

Last time, we have learned that logistic regression classification is just matter taking sample $x$ , multiplying it by matrix $W$ and adding some constant bias term $b$ . In our letter recognition example $x$ is 784-dimensional vector, $W$ is 10×784 dimensional matrix and $b$ is 10-dimensional vector. Once our model learns weights $W$ and $b$ , we just calculate $W x + b$ and pick the row with the maximal value. Index of the row is our class prediction.

The downside of logistic regression is that it can only classify classes that are separable by linear plane. We can fix that by adding multiple linear regressions one after each other.

Let’s say that we take the input, multiply it by a matrix, add bias and then take the output. $x_2 := W_1 x + b_1$ We can again multiply it by another matrix and add another bias $y:= W_2 x_2 + b_2$ We then can take the result of this second operation and use it as our prediction.

We will call this Neural Network with one hidden layer. The hidden layer is the vector $x_2$ . It is hidden because nobody sees it, it’s just something internal to the model. It’s up to us to decide what the size of $x_2$ will be. The size is usually called number of neurons in the hidden layer. We can even add more hidden layers, each with different size. Unfortunately no one will tell you how many layers and how big they should be, you have to play with the parameters and see how the network performs.

To make things even more complicated, you can also use activation function. Activation function can take the result of each layer and make it more non-linear. So we will get $x_2:= g(W_1 x + b_1)$ where $g$ is the activation function. Popular choices are relu, tanh and for the last layer softmax.

Luckily for us, it’s quite simple to play with neural networks in Python. There are several libraries, I have picked Keras, since it’s quite powerful and easy to use.

from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import SGD

model = Sequential()
# Hidden layer with 1024 neurons
model.add(Dense(output_dim=1024, input_dim=image_size * image_size, init="glorot_uniform"))
# ReLU activation
model.add(Activation("relu"))
# We have 10 letters
model.add(Dense(output_dim=10, init="glorot_uniform"))
# Softmax makes sense for the activation of the output layer
model.add(Activation("softmax"))

# Let the magic happen
model.compile(loss='categorical_crossentropy', optimizer=SGD())

model.fit(train_dataset, train_labels, batch_size=128, nb_epoch=50)

In this example we will create a neural network with one hidden layer of 1024 neurons with activation function ReLU. In other words (ignoring biases), we will take the pixels of the letter, multiply them by a matrix which will result in 1024 numbers. We will then set all the negative numbers to 0 (ReLU) and then multiply the result by another matrix to get 10 numbers. We then find the highest of those numbers. Let’s say that highest result is in second row then the resulting letter is B.

We of course need to train the network. It’s more complex, luckily we do not need to know much about it. So let’s just use Stochastic Gradient Descent with batch size 128. nb_epoch says that we should walk through all training examples 50-times. Please remember that training is just finding of the minimum of some loss function.

I am using Keras with Tensorflow so when I call model.compile it actually generates some optimized distributed parallel C++ code that can even run on GPU. The code will train the network for us. This is important, large networks with lots of hidden layers and lots of parameters can take ages and lot of calculations to learn. Keras and Tensorflow hides the complexity so I do not have to worry about it.

And what about the results? We get 96% accuracy on test set! Another 3% increase from SVM. Since the best result on notMNIS is allegedly 97.1% I think it’s not bad at all. Can you calculate how many parameters we do have? In the first layer we have matrix of 1024×784 parameters + 1024 bias terms. In the output layer we have 10×1024 + 10 parameters. In total it’s around 800,000 parameters. Training takes around 20 minutes on my laptop.

Machine Learning for dummies – Logistic Regression internals

Leave a reply

Before we can start with neural networks, we should talk a bit more about Logistic Regression – without understanding Logistic Regression, we can not understand neural networks. Moreover, I have learned how to use math in WordPress so I need to try it.

As we already know, Logistic Regression tries to find linear hyperplane that separates two classes. Hyperplane is a thing that has one dimension less than the thing it is in. If we have 1-dimensional space (line), hyperplane is a one-dimensional point. If we have 3D space a hyperplane is a 2D plane. And if we have 784-dimensional space a hyperplane is a 783-dimensional space.

Let’s start with 1D example. We have a line and we want to separate it to two segments. We need to find or define one point that separates those two segments. We can define it like this $x_1 - 5$ . If $x_1 > 5$ , we get positive value from the equation and the point is in one segment (class). If $x_1<5$ , the result is negative and the sample is in the other class. In 3D it’s similar, we just need more parameters: $2x_1 + 3x_2 + 4x_3 + 5$ . When we plug-in our sample, we can again decide if the equation returns positive or negative value. Easy

Generic definition of a hyperplane in n-dimensional space looks like this $w_1 x_1 + w_2 x_2 + ... + w_n x_n + b$ . If we use vectors $w$ and $x$ , we can define hyperplane using vector multiplication $w^T x + b$ . But how to get those parameters (a.k.a weights) $w$ ? We need to learn them form training samples.

First of all, we will define labels $y$ as follows. $y=1$ if it’s a positive example and $y=0$ if it’s a negative one. We will only deal with two classes and cover multiple classes later. In our letter classification we will set $y=1$ if the training example is an ‘A’. Now we need to be able to somehow compare output from $w^T x + b$ and $y$ . The trouble is that $w^T x + b$ is an unbounded real number and we need to somehow make it comparable to $y$ which is zero or one. That’s where we will use logistic function (hence the name Logistic Regression). The details are described here, for us dummies it suffices to say that Logistic Function maps real numbers to interval 0 to 1 in a way that is good. Like really good. It’s so good I will give it a name $g$ . So now we have $g(w^T x + b)$ which returns numbers between zero and one.

Thanks to logistic function we can now compare results of our model with the training labels. Imagine that we have already obtained parameters $w$ , we can measure how well the parameters classify our training set. We just have to somehow calculate the distance from prediction. In ML the distance is called cost function. With linear regression people use cross-entropy and again, I will not go into details here. We just need to know that it returns 0 if our model classifies the sample correctly and some really really high number if it classifies it incorrectly.

If we put it together, we can calculate $Cost(g(w^T x + b), y)$ for each pair $x$ and $y$ . If model makes correct prediction, the cost is zero, if it mis-predicts, the cost is high. We can even calculate a sum over all $m$ training examples: $J(w) = \sum\limits_{i=1}^m(Cost(g(w^T x^(i) + b^(i)), y^(i)))$ . If the sum is large, our model is bad and if the sum is 0 our model is the best. Function $J(w)$ is function of parameters $w$ , we have made our training samples and labels part of the function definition. Now the only thing we need is to find such $w$ so the function $J(w)$ is as small as possible.

And that’s how the model learns. We needed all those strange functions just to make this final function nice, having only one global minimum and other nice properties. While it’s possible to find the minimum analytically, for practical reasons it’s usually found by using something called gradient descent. I will not go to the details, it basically just generates random parameters $w$ , calculates $J(w)$ and then it does small change to parameters so the next $J(w)$ is a bit smaller. It then iterates until it reaches the minimum. That’s all we need to know for now.
Please note that we need all those strange functions mainly for learning. When classifying, I can still calculate $w^T x + b$ and if it’s positive, it’s a letter ‘A’ and if it’s negative it’s some other letter. The advantage of applying the logistic function is that I can interpret the result as a probability.

What about multiple classes? We can use one-vs-all classifier. We can train a classifier that is able to separate A’s from the rest of the letters. Then B’s from the rest and so on. For our 10 letters we can apply the algorithm described above ten-times. Or we can use few tricks and learn all the letters in one pass.

The first trick is to change our labeling method. Before, we set y=1 if the letter in training set was an ‘A’. With multiple letters, we will use $y=[1,0,0,...]^T$ for ‘A’, $y=[0,1,0,...]^T$ for ‘B’, $y=[0,0,1,...]^T$ for ‘C’ etc. We also need to train different set of parameters (weights) for each letter. We can add all the weights into one matrix $W$ and calculate the result using matrix multiplication $W x + b$ ( $b$ is 10-dimensional vector now as well). If we put weights for each letter to $W$ as a row, our matrix will have 10 rows (one for each letter) and 784 columns (one for each pixel). Our sample $x$ has 784 rows. If we multiply the two, we will get 10 numbers. We then element-wise add 10-numbers from $b$ and pick the largest one. That’s our prediction. For learning, we will use softmax function instead of logistic function, but the rest remains the same. At the end of the learning we will get 10×784 + 10 numbers that define 10 hyperplanes separating letters in 784-dimensional space.

If case you are wondering why you need to know all of this, the reason is simple. Neural networks are built on top of ideas from logistic regression. In other words, logistic regression is one-layer neural network so in order to make real NN, we just need to add more layers. This is how our logistic regression looks like if depicted as a single neuron