Nejvíc mě dostala prezentace o projektu Loom. Na současném přístupu ke škálování mi cosi nesedělo, ale nedokázal jsem pojmenovat co. Pár firem potřebuje opravdu škálovat, a proto všichni musí začít připisovat knihovny na to, aby podporovaly RxJavu, Reactor a podobné. Musíme přepsat databázové ovladače, HTTP knihovny, přístup k souborovému systému tak, aby byl reaktivní. Za to se nám dostane té odměny, že už nebudeme muset psát v Javě, ale v jakémsi DSL, kterým budeme krásně skládat asynchronní potrubí.

Když se nad tím zamyslíte, tak je to neuvěřitelně prosakující abstrakce. Abych ji mohl použít, musím moji asynchronní knihovnu zadrátovat do API. Takže metody už nevrací `T`

ale `Mono<T>`

. To není prosakující abstrakce, ale cedník.

V projektu Loom si chlapci v Oraclu řekli, že to jde možná dělat lépe. Co nám vadí na vláknech? Jsou drahá, žerou paměť a jsou zbytečně blokována při čekání na sít, souborový systém nebo zámek. Co kdybychom použili jinou abstrakci, říkejme jí třeba Fiber (nitka?). Nitka se veze na vlákně, ale když narazí na blokující operaci, tak se vlákna pustí a to může jít vozit jiné nitky. Když blokující operace skončí, blokovaná nitka naskočí na jiné vlákno a provádí další můj kód. Jelikož se všechny blokující operace obvykle provádí skrz standardní Java knihovnu, kód který řeší opuštění vlákna a opětovné naskočení zpět je schovaný v JVM a jeho knihovnách. Já nemusím nic řešit. (Kdo můj poetický popis nepochopil, nechť se stydí a pustí si video).

To znamená, že jako programátor dostanu škálování v podstatě zadarmo. Můžu dál psát svůj staromódní imperativní kód a JVM zajistí, že to poběží efektivně. Samozřejmě nedostanu všechny výhody reaktivního přístupu jako backpressure (protitlak), ale možná mi to za čistší API, čitelné stack tracy a jiný konfort, na který jsem zvyklý, stojí.

Na Dovoxxu ukazovali zajímavé demo, kde vzali Jetty, místo ThreadPoolu podvrhli něco co používá Fibers a když to zrovna nespadlo, tak klasická aplikace v JAX-RS škálovala jako divá. Bohužel bude ještě pár let trvat, než dořeší všechny detaily jako ThreadLocals a podobně, ale vypadá to nadějně.

Trochu s tím rezonovala přednáška o Briana Goetze o Objektovém a funkcionálním programování. Síla objektového programování je v místech, které překračují hranice (boundaries). Tam potřebuji mít silný kontrakt v čemž OOP exceluje. Uvnitř hranic dává občas smysl zahodit ceremonii s objekty spojenou a využít sílu FP.

- Java in 2018: Change is the Only Constant – co nás čeká v Javě, na konci demo projektu Loom
- Java developer’s journey in Kubernetes land – tipy na to jak požívat Kubernetes, když jste Javista. GitHub projekt s příklady. Na Devoxxu se hodně skloňovalo Istio, jako doplňěk pro Kubernetes pro A/B testy a podobně. Na to se musím podívat.
- REST beyond the obvious - API design for ever evolving systems – pěkná přednáška o architektuře. Zajímavý postřeh je, že čím víc low-level API, tím víc coupluje komponenty dohromady. Nicméně výhody HATEOAS jsem stále nepochopil. Chtěl jsem se zeptat, jak se zajistí, že se klient magicky sám od sebe přeprogramuje, ale nestihl jsem to.
- Event Sourcing - You are doing it wrong – super povídání které by měl vidět každý, kdo se pokouší o Event Sourcing. Přednášející vypadal, že ví o čem mluví. S Oliverem Girkem co mluvil o RESTU se shodli, že verzování API přináší víc škody než užitku.

Co říci závěrem? Devoxx je super konference, člověk si rozšíří obzory, uvědomí si, jak moc ho programování baví a jak živý je Java ekosystém. Až na pár ojedinělých výjimek byly všechny přednášky skvělé, takže několikrát litoval, že se nemůžu rozkrájet a jít do víc sálů najednou. Nevadí, 10 Mistakes Hackers Want You to Make si pustím z YouTube.

]]>Nejjednodušší je, začít daný kmen pozorovat. Máme štěstí, o programování žvaní téměř bez přestání. Vypadá to, že nejdůležitější je, jestli se používají mezery nebo něco, čemu říkají tabelátory. Tak ne, nejdůležitější je psát funkcionálně, stav by měl být prohlášen za tabu. Aha to asi taky ne. Že by řešením bylo zbavit se typů? Nebo je naopak zavést? Co domorodec, to názor.

Také jim můžeme položit následující dvě otázky

- Kdybyste si mohli volně vybrat technologie pro příští projekt, o kolik byste byli rychlejší?
- Pokud byste psali váš současný projekt se současnou technologií znovu od začátku, o kolik byste ho udělali rychleji?

Nevím jak váš, kmen, ale u toho mého by výběr správné technologie moc velké zrychlení nepřinesl. To se samozřejmě může lišit, pokud děláte v korporaci, která vás nutí používat nepoužitelné nástroje, tak přechod na něco příčeňejšího může přinést docela dost. Něco jako když vás nutí hrát basketbal ve svěrací kazajce a pak vám uvolní jednu ruku. Ale u nás ostatních, technologie a programovací finty už tak zásadní přínos obvykle nemají. Něco jako, když jsme doteď hráli bosí a dostaneme super boty. Pomůže to, ale pokud nedokážeme trefit koš, tak zas ne o tolik.

Co odpověď na druhou otázku? To je jiná. Kdybychom současný projekt dělali znovu, tak bychom se vykašlali na tuhle funkcionalitu, protože jsme ji nakonec stejně vymazali, zeptali bychom se finančního ředitele na jeho názor mnohem dřív, takže bychom nemuseli půlku aplikace úplně předělávat. Možná bychom se do toho projektu vůbec nepustili, protože bychom věděli, jak neuvěřitelně nákladné to nakonec bude.

Proč bychom ten samý projekt psali po druhé mnohem rychleji? V čem je ten rozdíl? V tom, že se při implementaci projektu hlavně učíme. Ano, navenek to vypadá, že hlavně píšeme kód, ale největší fuška je v učení se. Učíme se co zákazník chce. Učíme se jak to sakra zaintegrovat do té hromady čehosi, co už ve firmě máme. Učíme se, jak komunikovat se svými kolegy. Učíme se jak ten systém provozovat. Učíme se jak to efektivně nasazovat. Pořád se prostě jen něco učíme.

Ano, psaní kódu je důležité, je to náš hlavní užitečný výstup, ale není to to, co nás brzdí. Sebelepší programovací prostředí, které mi bude číst myšlenky a generovat podle nich kód mi nepomůže, pokud nevím co mám dělat. Možná vás to překvapí, ale nepomůžou mi dokonce ani mikroservices.

Proč to píšu? Přijde mi, že si to lidé neuvědomují. Jeden můj výše postavený kolega se nám snaží pomoci tím, že chce některou naší méně důležitou programovací práci hodit na někoho mimo tým. Prostě někoho najmeme a o nám to naprogramuje. Jednoduché jako facka. Není. Psaní kódu je bohužel ta nejméně problematická část. Nejtěžší je vymyslet co to má dělat, jak se to má napojit, jak to provozovat a udržovat. To se outsourcovat nedá. Psaní kódu je navíc to co nás na tom baví, takže by nám zůstaly ty věci kolem. Programování by si slízl někdo jiný. Odporná představa.

Důležitost učení si ale neuvědomují ani programátoři. Ti stále vedou své žabomyší války o tom, jestli je lepší Emacs nebo Vi, jestli mají ukládat data do SQL nebo NoSQL, jestli mají používat Docker nebo nevím co. Něco z toho jsou užitečné debaty, ale dokud se ve stejné míře nevěnují i tomu, jak se lépe učit i ty nezáživné neprogramovací věci, tak jsou to debaty bezpředmětné. Lepší technologie mi může pomoci vyndat ze svěrací kazajky i tu druhou ruku, ale když nebudu mít zájem učit se pravidla hry a nepochopím, že tu skákavou kulatou věc mám dostat do té obroučky co visí nesmyslně vysoko nad hřištěm, tak jen budu pokračovat ve zmateném pobíhání po hřišti. Nejspíš mě to bude víc bavit, ale výsledek nebude o moc lepší.

]]>```
assertThatJson("{\"test\":1}").node("test").isEqualTo(1);
assertThatJson("[1 ,2]").when(IGNORING_ARRAY_ORDER).isEqualTo("[2, 1]");
```

I really like fluent assertions. First of all, they are really easy to use, you get help from your IDE unlike when using Hamcrest static methods. Moreover, they are quite easy to implement. This is how the original implementation looked like.

```
public class JsonFluentAssert {
protected JsonFluentAssert(...) {
}
public static JsonFluentAssert assertThatJson(Object json) {
return new JsonFluentAssert(...);
}
public JsonFluentAssert isEqualTo(Object expected) {
...
}
public JsonFluentAssert node(String path) {
...
}
public JsonFluentAssert when(Option firstOption, Option... otherOptions) {
...
}
...
}
```

We have static factory method `assertThatJson()`

that creates `JsonFluentAssert`

. And all other methods return the same class so you can chain them together. Nice and simple. Unfortunately, there are three mistakes in this API. Do you see them? Congratulations if you do. If not, do not be sad, it took me several year to see them

The biggest mistake is that the API supports chaining even after the assertion method `isEqualTo()`

is called. It seems I designed this way on purpose since there is a test like this

```
assertThatJson("{\"test1\":2, \"test2\":1}")
.node("test1").isEqualTo(2)
.node("test2").isEqualTo(1);
```

The problem is that to support such rare use-case, now the API is more error prone. I got several error reports complaining that this does not work

```
assertThatJson("[1 ,2]").isEqualTo("[2, 1]").when(IGNORING_ARRAY_ORDER);
```

It looks reasonable, but it can not work. The comparison has to be done in `isEqualTo`

and if it fails, it throws an assertion error so `when()`

method is not called at all. In current design you can not postpone the comparison since we do not know which method is the last one in the invocation chain. OK, how to fix it? Ideally, `isEqualTo()`

should have returned void or maybe some simple type. But I can not change the contract, it would break the API.

If I can not change the API, I can do the second best thing – mark it as deprecated. This way the user would at least get a compile time warning. It seems simple, I just need to return a new type from `isEqualTo()`

with `when()`

method marked as deprecated.

Here comes second mistake – `isEqualTo()`

and other methods return a class not an interface. An interface would have given me more space to maneuver. And again, I can not introduce an interface without potentially breaking backwards compatibility. Some clients might have stored the expression result in an variable like this

```
JsonFluentAssert node1Assert =
assertThatJson("{\"test1\":2, \"test2\":1}").node("test1");
```

If I want to mark method `when()`

as deprecated if called after assertion, `isEqualTo()`

has to return a subclass of `JsonFluentAssert`

like this

```
public class JsonFluentAssert {
private JsonFluentAssert(...) {
}
public static JsonFluentAssert assertThatJson(Object json) {
return new JsonFluentAssert(...);
}
public JsonFluentAssertAfterAssertion isEqualTo(Object expected) {
...
}
public JsonFluentAssert node(String path) {
...
}
public JsonFluentAssert when(Option firstOption, Option... otherOptions) {
...
}
...
public static class JsonFluentAssertAfterAssertion extends JsonFluentAssert {
@Override
@Deprecated
public JsonFluentAssert when(Option firstOption, Option... otherOptions) {
return super.when(firstOption, otherOptions);
}
}
}
```

Third mistake was to make the class extensible by making the constructor protected and not private. I doubt anyone actually extends the class but I can not know for sure. And I can not change signature of `isEqualTo()`

without breaking subclasses.

It's pretty tricky situation. I can either keep broken API or break backward compatibility, tough choice. At the end I have decided to bet on the fact that no one is extending JsonFluentAssert and I did the change depicted above. I may be wrong but there is nothing else I could do, I do not want to live with bad API design forever. But if you have a better solution, please let me know. Full source code is available here.

]]>They are fascinating. Please remember that word2vec is an unsupervised algorithm, you just feed it a lot of text and it learns itself. You do not have to tell it anything about the language, grammar, rules, it just learns by reading.

What's more, people from Google have published a model that is already trained on Google news, so you can just download the model, load it to you Python interpreter and play. The model has about 3.4G and contains 3M words, each of them represented as a 300-dimensional vector. Here is the source I have used for my experiments.

```
from gensim.models import Word2Vec
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
# take father, subtract man and add woman
model.most_similar(positive=['father', 'woman'], negative=['man'])
[('mother', 0.8462507128715515),
('daughter', 0.7899606227874756),
('husband', 0.7560455799102783),
('son', 0.7279756665229797),
('eldest_daughter', 0.7120418548583984),
('niece', 0.7096832990646362),
('aunt', 0.6960804462432861),
('grandmother', 0.6897341012954712),
('sister', 0.6895190477371216),
('daughters', 0.6731119751930237)]
```

You see, you can take vector for “father” subtract “man” and add “woman” and you will get “mother”. Cool. How does it work? As we have discussed the last time, word2vec groups similar words together and luckily it also somehow discovers relations between the words. While it's hard to visualize the relations in 300-dimensional space, we can project the vectors to 2D.

```
plot("man woman father mother daughter son grandmother grandfather".split())
```

Now you see how it works, if you want to move from “father“ to “mother“, you just move down by the distance which is between “man” and “woman”. You can see that the model is not perfect. One would expect the distance between “mother” and “daughter” will be the same as between “father” and “son”. Here it is much shorter. Actually, maybe it corresponds to reality.

Let's try something more complicated

```
plot("pizza hamburger car motorcycle lamb lamp cat cow sushi horse pork pig".split())
```

In this example, we loose lot of information by projecting it to 2D, but we can see the structure anyway. We have food on the left, meat in the middle, animals on the right and inanimate objects at the top. Lamb and lamp sound similar but it did not confuse the model. It just is not sure if a lamb it's meat or an animal.

And now on something completely different – names.

```
plot("Joseph David Michael Jane Olivia Emma Ava Bill Alex Chris Max Theo".split())
```

Guys on the left, gals on the right and unisex in the middle. I wonder, though what the vertical axis means.

I have played with the model for hours and it does not cease to surprise me. It just reads the text and learns so much, incredible. If you do not want to play with my code you can also try it online.

]]>Up until now we have worked on character recognition in images. It's quite straightforward to convert an image to numbers, we just take pixel intensities and we have numbers. But what to do if we want to work with text? We can not do matrix multiplication with words, we need to somehow convert them to numbers. One way to do it is just take all words and number them. 'aardvark' would be 1, 'aardwolf' 2 etc. The problem with this approach is that similar words would have completely different numbers. Let's say you are working on image recognition. You want to have a model that says “This is most likely a cat, but maybe it's a kitten or a tiger. It definitely is something cat-like”. For this it's better to have numeric representations of cat, kitten and tiger that are similar. Since we are dummies, I will not try to find mathematical reasons for this. Intuition tells me that pictures of kitten and cat are quite similar so it makes sense that output of the learning algorithm should be similar as well. It's much harder to teach it if cat has number 10 and kitten 123,564.

But how to get such representations? We can use word2vec which allows us to map words to a n-dimensional vector space in a way that puts similar words together. The trick is quite simple. Similar words are used in similar contexts. I can say “I like pizza with cheese”. Or “I like hamburger with cheese”. Here the contexts are similar, just the food is different. Now I just need some algorithm that would read lot of text and somehow find words that are used in similar contexts and put them together.

Word2vec takes a neural network and teaches it to guess contexts based on words. As an input I can take a word “pizza” and try to teach the network to come up with “I like ... with cheese”. This approach is called skip-gram. Or I can try the other direction and teach the network to answer “pizza” if I feed it “I like ... with cheese.” This direction is called CBOW. There are some differences in the results, but thay are not important enough for now. In the next text I will describe skip-gram with 4 words in the context.

Let's take a look at the details. We will take simple neural network with one hidden layer. On the input, I will use one-hot encoding. 'aardvark' would be [1,0,0,...], 'aardwolf' [0,1,0,...], each word would be represented by a vector with zeros and one '1'. If my dictionary has 50k words I would end up with 50k-dimensional input vector. We have the input, let's move to the hidden layer. The size of the hidden layer is up to me, at the end it will be the size of the vector representing the word. Let's pick 128. Then there will be an output layer that would map the data back to one-hot encoded vector of context words. The network will take large sparse vector, squeeze it into much smaller and denser one and then unsqueeze it to a large vector again.

I take “pizza”, convert it to a 50k-dimensional vector with only one nonzero value. Then I multiply this vector with 128x50k-dimensional matrix and I get 128-dimensional vector. I take this vector, multiply it with another 50k x128-dimensional matrix and get 50k dimensional vector. After the right normalization, this vector will contain probability that given word occurs in context of word “pizza”. Value on the first position will be quite low, aardvark and pizza are not usually used in the same sentence. Actually, I just did that, so the next time this text gets to a learning algorithm, the first value in the vector will be slightly larger.

Of course I have to somehow get those matrices, but it's just neural network training with some dirty tricks to make it computable. The trouble is, that the dictionary can by larger than 50k words so the learning phase is not trivial. But I am a dummy, I do not care. I just need to know that I will feed it the whole internet (minus porn) and it will somehow learn to predict the context.

Ok, we have a neural network that can predict the context, where do I get the vectors representing the words? They are in the first matrix from the model. That's the beauty of it. Imagine that I have “pizza” on the input. It's a vector with all zeros and one “1”. The matrix multiplication is not a multiplication at all, it just picks one column of the first matrix in the model. One column is just 128-dimensional vector and when we multiply it with the second matrix we want to get probabilities that given word is in the context. I guess that “like”, “eat”, “dinner”, “cheese” or “bacon” will be quite high. Now let's feed in “hamburger”. It will pick another column from the first matrix, but we expect similar words in the result. Not the same, but similar. And to get similar results, we need similar vector in the hidden layer so when we multiply it with the second matrix we get similar context words. But the vector in hidden layer is just a column in the first matrix. It means, that in order to get similar results for words with similar contexts, the columns in the first matrix have to be similar (well they do not have to, but it usually ends up this way)

It's quite beautiful and powerful algorithm. What's more, the positions of the vectors in the vector space have some structure too. You have most likely seen the example where they take vector for “king” subtract vector for “man” and add vector for “women” and they get dangerously close to vector for “queen”. I have no idea why it works this way, but it does.

That's all for today, next time I will try to show you some results. Here is some visualization you can admire in the meantime.

Sources:

Tensorflow tutorial

Nice (long) lecture about word2vec details

Visualization of High Dimensional Data using t-SNE with R

Just Google the rest

Last time, we have learned that logistic regression classification is just matter taking sample , multiplying it by matrix and adding some constant bias term . In our letter recognition example is 784-dimensional vector, is 10x784 dimensional matrix and is 10-dimensional vector. Once our model learns weights and , we just calculate and pick the row with the maximal value. Index of the row is our class prediction.

The downside of logistic regression is that it can only classify classes that are separable by linear plane. We can fix that by adding multiple linear regressions one after each other.

Let's say that we take the input, multiply it by a matrix, add bias and then take the output. We can again multiply it by another matrix and add another bias We then can take the result of this second operation and use it as our prediction.

We will call this Neural Network with one hidden layer. The hidden layer is the vector . It is hidden because nobody sees it, it's just something internal to the model. It's up to us to decide what the size of will be. The size is usually called number of neurons in the hidden layer. We can even add more hidden layers, each with different size. Unfortunately no one will tell you how many layers and how big they should be, you have to play with the parameters and see how the network performs.

To make things even more complicated, you can also use activation function. Activation function can take the result of each layer and make it more non-linear. So we will get where is the activation function. Popular choices are relu, tanh and for the last layer softmax.

Luckily for us, it's quite simple to play with neural networks in Python. There are several libraries, I have picked Keras, since it's quite powerful and easy to use.

```
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import SGD
model = Sequential()
# Hidden layer with 1024 neurons
model.add(Dense(output_dim=1024, input_dim=image_size * image_size, init="glorot_uniform"))
# ReLU activation
model.add(Activation("relu"))
# We have 10 letters
model.add(Dense(output_dim=10, init="glorot_uniform"))
# Softmax makes sense for the activation of the output layer
model.add(Activation("softmax"))
# Let the magic happen
model.compile(loss='categorical_crossentropy', optimizer=SGD())
model.fit(train_dataset, train_labels, batch_size=128, nb_epoch=50)
```

In this example we will create a neural network with one hidden layer of 1024 neurons with activation function ReLU. In other words (ignoring biases), we will take the pixels of the letter, multiply them by a matrix which will result in 1024 numbers. We will then set all the negative numbers to 0 (ReLU) and then multiply the result by another matrix to get 10 numbers. We then find the highest of those numbers. Let's say that highest result is in second row then the resulting letter is B.

We of course need to train the network. It's more complex, luckily we do not need to know much about it. So let's just use Stochastic Gradient Descent with batch size 128. nb_epoch says that we should walk through all training examples 50-times. Please remember that training is just finding of the minimum of some loss function.

I am using Keras with Tensorflow so when I call model.compile it actually generates some optimized distributed parallel C++ code that can even run on GPU. The code will train the network for us. This is important, large networks with lots of hidden layers and lots of parameters can take ages and lot of calculations to learn. Keras and Tensorflow hides the complexity so I do not have to worry about it.

And what about the results? We get 96% accuracy on test set! Another 3% increase from SVM. Since the best result on notMNIS is allegedly 97.1% I think it's not bad at all. Can you calculate how many parameters we do have? In the first layer we have matrix of 1024x784 parameters + 1024 bias terms. In the output layer we have 10x1024 + 10 parameters. In total it's around 800,000 parameters. Training takes around 20 minutes on my laptop.

]]>As we already know, Logistic Regression tries to find linear hyperplane that separates two classes. Hyperplane is a thing that has one dimension less than the thing it is in. If we have 1-dimensional space (line), hyperplane is a one-dimensional point. If we have 3D space a hyperplane is a 2D plane. And if we have 784-dimensional space a hyperplane is a 783-dimensional space.

Let's start with 1D example. We have a line and we want to separate it to two segments. We need to find or define one point that separates those two segments. We can define it like this . If , we get positive value from the equation and the point is in one segment (class). If , the result is negative and the sample is in the other class. In 3D it's similar, we just need more parameters: . When we plug-in our sample, we can again decide if the equation returns positive or negative value. Easy

Generic definition of a hyperplane in n-dimensional space looks like this . If we use vectors and , we can define hyperplane using vector multiplication . But how to get those parameters (a.k.a weights) ? We need to learn them form training samples.

First of all, we will define labels as follows. if it's a positive example and if it's a negative one. We will only deal with two classes and cover multiple classes later. In our letter classification we will set if the training example is an 'A'. Now we need to be able to somehow compare output from and . The trouble is that is an unbounded real number and we need to somehow make it comparable to which is zero or one. That's where we will use logistic function (hence the name Logistic Regression). The details are described here, for us dummies it suffices to say that Logistic Function maps real numbers to interval 0 to 1 in a way that is good. Like really good. It's so good I will give it a name . So now we have which returns numbers between zero and one.

Thanks to logistic function we can now compare results of our model with the training labels. Imagine that we have already obtained parameters , we can measure how well the parameters classify our training set. We just have to somehow calculate the distance from prediction. In ML the distance is called cost function. With linear regression people use cross-entropy and again, I will not go into details here. We just need to know that it returns 0 if our model classifies the sample correctly and some really really high number if it classifies it incorrectly.

If we put it together, we can calculate for each pair and . If model makes correct prediction, the cost is zero, if it mis-predicts, the cost is high. We can even calculate a sum over all training examples: . If the sum is large, our model is bad and if the sum is 0 our model is the best. Function is function of parameters , we have made our training samples and labels part of the function definition. Now the only thing we need is to find such so the function is as small as possible.

And that's how the model learns. We needed all those strange functions just to make this final function nice, having only one global minimum and other nice properties. While it's possible to find the minimum analytically, for practical reasons it's usually found by using something called gradient descent. I will not go to the details, it basically just generates random parameters , calculates and then it does small change to parameters so the next is a bit smaller. It then iterates until it reaches the minimum. That's all we need to know for now.

Please note that we need all those strange functions mainly for learning. When classifying, I can still calculate and if it's positive, it's a letter 'A' and if it's negative it's some other letter. The advantage of applying the logistic function is that I can interpret the result as a probability.

What about multiple classes? We can use one-vs-all classifier. We can train a classifier that is able to separate A's from the rest of the letters. Then B's from the rest and so on. For our 10 letters we can apply the algorithm described above ten-times. Or we can use few tricks and learn all the letters in one pass.

The first trick is to change our labeling method. Before, we set y=1 if the letter in training set was an 'A'. With multiple letters, we will use for 'A', for 'B', for 'C' etc. We also need to train different set of parameters (weights) for each letter. We can add all the weights into one matrix and calculate the result using matrix multiplication ( is 10-dimensional vector now as well). If we put weights for each letter to as a row, our matrix will have 10 rows (one for each letter) and 784 columns (one for each pixel). Our sample has 784 rows. If we multiply the two, we will get 10 numbers. We then element-wise add 10-numbers from and pick the largest one. That's our prediction. For learning, we will use softmax function instead of logistic function, but the rest remains the same. At the end of the learning we will get 10x784 + 10 numbers that define 10 hyperplanes separating letters in 784-dimensional space.

If case you are wondering why you need to know all of this, the reason is simple. Neural networks are built on top of ideas from logistic regression. In other words, logistic regression is one-layer neural network so in order to make real NN, we just need to add more layers. This is how our logistic regression looks like if depicted as a single neuron

]]>SVMs are really similar to Logistic Regression, they try to find linear (hyper)plane that would separate our classes, one class on one side of the plane, the other on the other side. There are two important differences from Logistic Regression. The first one is that SVMs try to find plane that is as far as possible from both classes. But that's not so interesting. When trying to classify letters, our problem is that most likely you can not find any such plane in our 784 dimensional space (28x28 pixels) that would separate A's from other letters.

Imagine that we got to situation on the image below. For simplicity only in 2 dimension, as if our letters had only two pixels. (images taken from Everything You Wanted to Know about the Kernel Trick)

Imagine that our A's are in the middle surrounded by their enemies, other letters. I just can not draw a straight line to separate them.

But I can escape to another dimension by creating a third feature like $latex x_1^2 + x_2^2$ (image on the right). Now I can separate the two classes. Look ma, I can draw a circle using a linear function!

While I can manually find a new feature that would separate the classes, I am to lazy to do that. It's called Machine Learning, I want the machine to do it for me. SVMs can hep me with that. I can ask them to use a “kernel” which will make my model non-linear. Here is how it works, hopefully sufficiently dummified.

Imagine I have 20,000 training samples. The first thing the model does is that it remembers them all. It just shamelessly stores all 20k samples. Now imagine I have a new sample coming to the model. The model calculates distances from my 20k training samples to my new sample which results in 20k new features. First feature is the distance to the first sample in my training set, second is the distance to the second sample etc. Suddenly I have twenty-thousand-dimensional space and it's much more likely that I will find a linear plane that would separate both classes. This method intuitively makes sense. Points representing A's are likely to be near other point's representing A's so if we somehow incorporate distance into the model, it should get better result. In more clever sounding words, if I project the plane form 20k dimensional space back to my original 784 dimensional space, I can get highly nonlinear boundary.

Now a bit less dummified version (dummified is a word, I swear). Not all 20k training samples are used, only the most interesting ones (they are called Support Vectors hence the name). So the space we are using has less than 20k dimensions. Moreover the distance is not necessary a distance, it's just some function that takes two points (vectors) as an input and returns a number (new feature). This function is called kernel and we pick it when configuring the classifier - Gaussian (rbf) is the most common. And while the method sounds complicated and slow, it uses some mathematical shortcuts so it's relatively fast. Unfortunately it does not scale well with number of training samples.

Let's see how the method looks in the code.

```
clf = svm.SVC(C=10, gamma=0.01, cache_size=2000)
```

The rest of the code is the same as the last time, we've just changed the classifier. We see our favorite (hyper)parameter C. There is also a new parameter 'gamma' which configures the kernel. Smaller gamma makes our Support Vectors reach further and generalize more.

So what are the results? We got to 99.7% accuracy on the training set. It's not surprising, the model actually remembers some of the training samples. If we call 'clf.support_.shape' it will tell us that it remembers 8,670 samples out of 20,000. It means that our model has to learn 8,670 parameters for each classified class. Let's count how many numbers the model needs to remember. It stores 8,670 samples each of them having 28x28=784 values. We are trying to classify 10 letters so the model learns 10x8,670 parameters. (Actually 9x8,670, but let's not go there.) All in all, the model consists of almost 7,000,000 parameters (numbers) most of it in form of Support Vectors picked from training samples. The training takes less than 3 minutes on my laptop.

What are the results on the validation set, on the samples that the model has not seen? It is 87%. The nonlinearity of the model gave us five more percent when compared to Logistic Regression. It's great. The author of notMNIST set says that the set has ~6,5% label error rate so we need six more percent to get perfect. To confirm that I could not resist and run the classification on test set, which is much cleaner. I got 93% accuracy. Like 93% on this crazy images? Incredible. We still have further 6-7% to go and I am afraid we will have to resort to neural networks to get better results.

Sources:

Coursera Machine Learning Course

Everything You Wanted to Know about the Kernel Trick

scikit learn documentation on RBF SVM parameters

scikit learn documentation on SVC

And exactly the same problem we have in machine learning. If you do not have many training examples and you have complicated model, the learning algorithm may just memorize the training samples without trying to generalize. It basically does the same thing what my classmates did in the biology lessons. It is great if you only want to classify samples from the training set. If you however want to classify unknown samples, you want your algorithm to generalize.

Let's look at some examples. I will use the same algorithm as the last time on notMNIST data, I will just make the training set smaller, let's say 1k samples. In noMNIST set there are 10 letters A-J. We have shuffled our training set so in those 1k samples there will be around 100 samples of each letter. I am using Logistic Regression in 784 dimensional space which maps to those 10 classes. We also have bias elements for each class so the model has 7850 parameters. Parameters are just some numbers which the algorithm learns based on the samples. We dummies do not need to know more. So the algorithm learns ~8000 numbers from 1000 samples. Or for each letter ~800 parameters from ~100 samples. There is quite substantial danger that the model will just memorize the training set.

```
from sklearn import linear_model
clf_l = linear_model.LogisticRegression()
# we do not want to use the whole set
size = 1000
# images are 28x28, we have to make a vector from them
tr = train_dataset[:size,:,:].reshape(size, 784)
# This is the most important line. I just feed the model training samples and it learns
clf_l.fit(tr, train_labels[:size])
# This is prediction on training set
prd_tr = clf_l.predict(tr)
print(float(sum(prd_tr == train_labels[:size]))/prd_tr.shape[0])
# let's check if the model is not cheating, it has never seen this letters before
prd_l = clf_l.predict(valid_dataset.reshape(valid_dataset.shape[0], 784))
float(sum(prd_l == valid_labels))/prd_l.shape[0]
```

And indeed, the model has been able to correctly classify 99% of the samples from the training set. That's why we have the validation set. We will use it to check our model. It's as if our biology teacher had one set of images for the lessons and another set for the exams. Actually, it's more complicated, we have also test set, but we will not be using that one so often so let's forget about it for now.

Our model is able to recognize 74% of samples from our validation set. Not to bad. Luckily for us, Logistic Regression is just trying to find linear (flat) plane to split the space so it is not able to cheat much. It can not simply pick the training examples, it has to take half of the space with them. But let's try to force it to cheat anyway. Logistic Regression has parameter C (as for cheating) which says how much we allow the algorithm to over-fit. Higher C means more cheating, lower C means less cheating (for those who are used to more usual notation C = 1/lambda)

```
# Cheating level Carnage
clf_l = linear_model.LogisticRegression(C=100)
```

If we tune cheating parameter to 100, we get 99.9% of success on training set and 71% on validation set. If we use a value from other side of the spectrum C=0.001 we get 75% on training set and 76% on validation set. It's up to us to decide what is better. Here is a table for some values of C for 1k samples.

C | tr. set | valid. set |
---|---|---|

0.001 | 75% | 76% |

0.01 | 81% | 79% |

0.1 | 92% | 78% |

1 | 99% | 75% |

10 | 99% | 72% |

100 | 99.9% | 71% |

Another other way for preventing overfitting is to provide more training examples. Imagine our biology lesson with much more samples with different numbers. From certain number of samples, it is much easier to try to learn how to recognize those damn flowers than to try to memorize all the numbers in the corner. And the same applies for machine learning algorithms too. If I use C=100 with 10k training samples, I get 93% accuracy on training set and 74% on validation set. You can notice that it's actually worse than most of the results with 1k of examples and different C values. But we can combine both approaches. Let's say that I pick C=0.01 and use 20k training samples. I get 83% accuracy on training set and 82% on validation set. As you can see, both numbers are converging. There is only 1% difference. It means that we are approaching limits of our model. Using more samples is not likely to help much. I am afraid we have to move to better model to get better results. That's it for today. You can go now.

Update: Fixed number of parameters

]]>Since Deep learning is the current hype, I have started with Deep Learning course at Udacity. Deep learning course from Google right in my living room? Cool.

And the course went quite well until the very first exercise which was about recognizing notMnist dataset

You have this “letters” and you want to recognize them. Sounds like a job for machine learning, right?

To finish the exercise, you need to know something about machine learning and Python first. I do not know neither. Python is quite simple, if you know programming you will learn it in no time. The most complex task is the installation. I have spent like six hours trying to install all the required libraries. If you do not know Python, you just have no idea if you want to use OS specific packages (apt, brew), pip, easy_install or whatnot. The thing is, in Java you have projects that describe their dependencies using Maven or Gradle. So if you want to work on Java project, you just execute 'mvn test', it downloads half of the Internet and you are done. But in Python you have no project so you have to download all libraries yourself. What's more, those libraries depend on each other and you need to have correct versions. Well I downloaded and recompiled half of the Internet multiple times, every time getting different conflict (most often between matplotlib and numpy). Luckily, people using Python know that their package systems suck so they provide Docker images with all the libraries installed. I have used the image to figure out which version of libraries work together. Today, I was trying to install it again on Mac OS and I was able to do it under an hour using Anaconda so maybe this Python distribution system works better than the others.

But once you have Python with all the libraries installed, you get your superpowers. Thanks to numpy, you can easily work with matrices, thanks to matplotlib you can plot charts and thanks to scikit-learn you can do machine learning without knowing much about it. That's what I need.

So back to our notMnist example. After I have reviewed most of the Coursera course I have learned that in machine learning you need features – the data you will use in your algorithm. But we have images, not features. The most naive approach is to take the pixels and use them as features. Images in notMNIST are 28x28 pixel, so for each image we have 784 numbers (shades of gray). We can create 784 dimensional space and represent each image as one point in this space. Please think about it for a while. I will create really high dimensional space and represent each sample (image) by one point in the space. The idea is that if I have the right set of features, the points in this space will became separable. In other words, I will be able to split the space into two parts, one that contains points representing letters A and another part with all other letters. This is called Logistic Regression and there are implementation that do exactly that. Basically you just need to feed the algorithm samples with correct labels and it tries to figure out how to separate those points in the high-dimensional space. Once the algorithm finds the (hyper)plane that separates 'A's from other letters, you have your classifier. For each new letter you just check in which part of the space the corresponding point is and you know if it's an A or not. Piece of cake.

But if you think about it, it just can not work. You just feed it raw pixels. Look at the images above, you are trying to recognize letter 'A' just based on values of the pixels. But each of the 'A's above is completely different. The learning algorithm just can not learn to recognize the letter just based on individual pixels. Or can it?

What's more, I am trying to find a flat plane that would separate all As in my 784 dimensional space from all other letters. What if As in my crazy space form a little island which is surrounded by other letters. I just can not separate them by a flat plane. Or can I?

Well surprisingly I can and it is quite easy to implement

```
from sklearn import linear_model
clf_l = linear_model.LogisticRegression()
# we do not want to use the whole set, my old CPU would just melt
size = 20000
# images are 28x28, we have to make a vector from them
tr = train_dataset[:size,:,:].reshape(size, 784)
# This is the most important line. I just feed the model 20k examples and it learns
clf_l.fit(tr, train_labels[:size])
# This is prediction on training set
prd_tr = clf_l.predict(tr)
print(float(sum(prd_tr == train_labels[:size]))/prd_tr.shape[0])
# prints 0.86 we are able to recognize 86% of the letters in the training set
# let's check if the model is not cheating, it has never seen
# the letters from the validation set before
prd_l = clf_l.predict(valid_dataset.reshape(valid_dataset.shape[0], 784))
float(sum(prd_l == valid_labels))/prd_l.shape[0]
# Cool 80% matches
```

Even this method that just can't work is able to classify 80% of letters it has never seen before.

You can even easily visualize the letters that did not match

```
import numpy
import matplotlib.cm as cm
def num2char(num):
return chr(num + ord('a'))
# indexes of letters that do not match
# change the last number to get another letter
idx = numpy.where(prd_l != valid_labels)[0][17]
print("Model predicts: " + num2char(clf_l.predict(valid_dataset[idx].reshape(1, 784))))
print("Should be: " +num2char(valid_labels[idx]))
%matplotlib inline
plt.imshow(valid_dataset[idx,:,:] + 128,cmap = cm.Greys_r)
```

This should have been G but the model has classified it as C. Understandable mistake. Another letter is even better.

This obvious 'f' has been incorrectly classified as A. Well, to me it looks more like an A than an f. The thing is that even this simple naive model mostly does “understandable” mistakes only. Either the misclassified letter is not a letter at all or the classifier makes mistake because the letters are really similar. Like I and J, I and F and so on.

So what we have learned today? First of all that even stupid model on stupid features can work surprisingly well. Secondly that Python libraries are not easy to install but easy to use. Here I have to admit that similar code in Java would have been just crazy.

Next time we will try SVM and maybe try to figure out some better features to make our model even better.

]]>