Category Archives: Articles in English

Mistakes I made designing JsonUnit fluent API

Several years ago, I have decided that it would be nice to add fluent API to JsonUnit library. If you are not familiar with fluent assertion APIs it looks like this.

assertThatJson("{\"test\":1}").node("test").isEqualTo(1);

assertThatJson("[1 ,2]").when(IGNORING_ARRAY_ORDER).isEqualTo("[2, 1]");

I really like fluent assertions. First of all, they are really easy to use, you get help from your IDE unlike when using Hamcrest static methods. Moreover, they are quite easy to implement. This is how the original implementation looked like.

public class JsonFluentAssert {

    protected JsonFluentAssert(...) {
    }

    public static JsonFluentAssert assertThatJson(Object json) {
        return new JsonFluentAssert(...);
    }

    public JsonFluentAssert isEqualTo(Object expected) {
        ... 
    }


    public JsonFluentAssert node(String path) {
        ...
    }


    public JsonFluentAssert when(Option firstOption, Option... otherOptions) {
    ...
    }

    ...
}

We have static factory method assertThatJson() that creates JsonFluentAssert. And all other methods return the same class so you can chain them together. Nice and simple. Unfortunately, there are three mistakes in this API. Do you see them? Congratulations if you do. If not, do not be sad, it took me several year to see them

Unnecessarily complicated

The biggest mistake is that the API supports chaining even after the assertion method isEqualTo() is called. It seems I designed this way on purpose since there is a test like this

assertThatJson("{\"test1\":2, \"test2\":1}")
    .node("test1").isEqualTo(2)
    .node("test2").isEqualTo(1);

The problem is that to support such rare use-case, now the API is more error prone. I got several error reports complaining that this does not work

assertThatJson("[1 ,2]").isEqualTo("[2, 1]").when(IGNORING_ARRAY_ORDER);

It looks reasonable, but it can not work. The comparison has to be done in isEqualTo and if it fails, it throws an assertion error so when() method is not called at all. In current design you can not postpone the comparison since we do not know which method is the last one in the invocation chain. OK, how to fix it? Ideally, isEqualTo() should have returned void or maybe some simple type. But I can not change the contract, it would break the API.

If I can not change the API, I can do the second best thing – mark it as deprecated. This way the user would at least get a compile time warning. It seems simple, I just need to return a new type from isEqualTo() with when() method marked as deprecated.

Methods return class not an interface

Here comes second mistake – isEqualTo() and other methods return a class not an interface. An interface would have given me more space to maneuver. And again, I can not introduce an interface without potentially breaking backwards compatibility. Some clients might have stored the expression result in an variable like this

JsonFluentAssert node1Assert = 
assertThatJson("{\"test1\":2, \"test2\":1}").node("test1");

If I want to mark method when() as deprecated if called after assertion, isEqualTo() has to return a subclass of JsonFluentAssert like this

public class JsonFluentAssert {

    private JsonFluentAssert(...) {
    }

    public static JsonFluentAssert assertThatJson(Object json) {
        return new JsonFluentAssert(...);
    }

    public JsonFluentAssertAfterAssertion isEqualTo(Object expected) {
        ... 
    }


    public JsonFluentAssert node(String path) {
        ...
    }


    public JsonFluentAssert when(Option firstOption, Option... otherOptions) {
        ...
    }
    ...
    public static class JsonFluentAssertAfterAssertion extends JsonFluentAssert {

        @Override
        @Deprecated
        public JsonFluentAssert when(Option firstOption, Option... otherOptions) {
            return super.when(firstOption, otherOptions);
        }
   }
}

Extensibility

Third mistake was to make the class extensible by making the constructor protected and not private. I doubt anyone actually extends the class but I can not know for sure. And I can not change signature of isEqualTo() without breaking subclasses.

Solution

It’s pretty tricky situation. I can either keep broken API or break backward compatibility, tough choice. At the end I have decided to bet on the fact that no one is extending JsonFluentAssert and I did the change depicted above. I may be wrong but there is nothing else I could do, I do not want to live with bad API design forever. But if you have a better solution, please let me know. Full source code is available here.

Machine Learning – word2vec results

Last time we have discussed word2vec algorithm, today I’d like to show you some of the results.

They are fascinating. Please remember that word2vec is an unsupervised algorithm, you just feed it a lot of text and it learns itself. You do not have to tell it anything about the language, grammar, rules, it just learns by reading.

What’s more, people from Google have published a model that is already trained on Google news, so you can just download the model, load it to you Python interpreter and play. The model has about 3.4G and contains 3M words, each of them represented as a 300-dimensional vector. Here is the source I have used for my experiments.

from gensim.models import Word2Vec
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
# take father, subtract man and add woman
model.most_similar(positive=['father', 'woman'], negative=['man'])

[('mother', 0.8462507128715515),
 ('daughter', 0.7899606227874756),
 ('husband', 0.7560455799102783),
 ('son', 0.7279756665229797),
 ('eldest_daughter', 0.7120418548583984),
 ('niece', 0.7096832990646362),
 ('aunt', 0.6960804462432861),
 ('grandmother', 0.6897341012954712),
 ('sister', 0.6895190477371216),
 ('daughters', 0.6731119751930237)]

You see, you can take vector for “father” subtract “man” and add “woman” and you will get “mother”. Cool. How does it work? As we have discussed the last time, word2vec groups similar words together and luckily it also somehow discovers relations between the words. While it’s hard to visualize the relations in 300-dimensional space, we can project the vectors to 2D.

plot("man woman father mother daughter son grandmother grandfather".split())

Family relationships

Now you see how it works, if you want to move from “father“ to “mother“, you just move down by the distance which is between “man” and “woman”. You can see that the model is not perfect. One would expect the distance between “mother” and “daughter” will be the same as between “father” and “son”. Here it is much shorter. Actually, maybe it corresponds to reality.

Let’s try something more complicated

plot("pizza hamburger car motorcycle lamb lamp cat cow sushi horse pork pig".split())

Food

In this example, we loose lot of information by projecting it to 2D, but we can see the structure anyway. We have food on the left, meat in the middle, animals on the right and inanimate objects at the top. Lamb and lamp sound similar but it did not confuse the model. It just is not sure if a lamb it’s meat or an animal.

And now on something completely different – names.

plot("Joseph David Michael Jane Olivia Emma Ava Bill Alex Chris Max Theo".split())

Names

Guys on the left, gals on the right and unisex in the middle. I wonder, though what the vertical axis means.

I have played with the model for hours and it does not cease to surprise me. It just reads the text and learns so much, incredible. If you do not want to play with my code you can also try it online.

Machine Learning for dummies – word2vec

Last time we have skimmed through neural networks. Today I’d like to describe one cool algorithm that is based on them.

Up until now we have worked on character recognition in images. It’s quite straightforward to convert an image to numbers, we just take pixel intensities and we have numbers. But what to do if we want to work with text? We can not do matrix multiplication with words, we need to somehow convert them to numbers. One way to do it is just take all words and number them. ‘aardvark’ would be 1, ‘aardwolf’ 2 etc. The problem with this approach is that similar words would have completely different numbers. Let’s say you are working on image recognition. You want to have a model that says “This is most likely a cat, but maybe it’s a kitten or a tiger. It definitely is something cat-like”. For this it’s better to have numeric representations of cat, kitten and tiger that are similar. Since we are dummies, I will not try to find mathematical reasons for this. Intuition tells me that pictures of kitten and cat are quite similar so it makes sense that output of the learning algorithm should be similar as well. It’s much harder to teach it if cat has number 10 and kitten 123,564.

But how to get such representations? We can use word2vec which allows us to map words to a n-dimensional vector space in a way that puts similar words together. The trick is quite simple. Similar words are used in similar contexts. I can say “I like pizza with cheese”. Or “I like hamburger with cheese”. Here the contexts are similar, just the food is different. Now I just need some algorithm that would read lot of text and somehow find words that are used in similar contexts and put them together.

Word2vec takes a neural network and teaches it to guess contexts based on words. As an input I can take a word “pizza” and try to teach the network to come up with “I like … with cheese”. This approach is called skip-gram. Or I can try the other direction and teach the network to answer “pizza” if I feed it “I like … with cheese.” This direction is called CBOW. There are some differences in the results, but thay are not important enough for now. In the next text I will describe skip-gram with 4 words in the context.

Let’s take a look at the details. We will take simple neural network with one hidden layer. On the input, I will use one-hot encoding. ‘aardvark’ would be [1,0,0,…], ‘aardwolf’ [0,1,0,…], each word would be represented by a vector with zeros and one ‘1’. If my dictionary has 50k words I would end up with 50k-dimensional input vector. We have the input, let’s move to the hidden layer. The size of the hidden layer is up to me, at the end it will be the size of the vector representing the word. Let’s pick 128. Then there will be an output layer that would map the data back to one-hot encoded vector of context words. The network will take large sparse vector, squeeze it into much smaller and denser one and then unsqueeze it to a large vector again.

I take “pizza”, convert it to a 50k-dimensional vector with only one nonzero value. Then I multiply this vector with 128x50k-dimensional matrix and I get 128-dimensional vector. I take this vector, multiply it with another 50k x128-dimensional matrix and get 50k dimensional vector. After the right normalization, this vector will contain probability that given word occurs in context of word “pizza”. Value on the first position will be quite low, aardvark and pizza are not usually used in the same sentence. Actually, I just did that, so the next time this text gets to a learning algorithm, the first value in the vector will be slightly larger.

Of course I have to somehow get those matrices, but it’s just neural network training with some dirty tricks to make it computable. The trouble is, that the dictionary can by larger than 50k words so the learning phase is not trivial. But I am a dummy, I do not care. I just need to know that I will feed it the whole internet (minus porn) and it will somehow learn to predict the context.

Ok, we have a neural network that can predict the context, where do I get the vectors representing the words? They are in the first matrix from the model. That’s the beauty of it. Imagine that I have “pizza” on the input. It’s a vector with all zeros and one “1”. The matrix multiplication is not a multiplication at all, it just picks one column of the first matrix in the model. One column is just 128-dimensional vector and when we multiply it with the second matrix we want to get probabilities that given word is in the context. I guess that “like”, “eat”, “dinner”, “cheese” or “bacon” will be quite high. Now let’s feed in “hamburger”. It will pick another column from the first matrix, but we expect similar words in the result. Not the same, but similar. And to get similar results, we need similar vector in the hidden layer so when we multiply it with the second matrix we get similar context words. But the vector in hidden layer is just a column in the first matrix. It means, that in order to get similar results for words with similar contexts, the columns in the first matrix have to be similar (well they do not have to, but it usually ends up this way)

It’s quite beautiful and powerful algorithm. What’s more, the positions of the vectors in the vector space have some structure too. You have most likely seen the example where they take vector for “king” subtract vector for “man” and add vector for “women” and they get dangerously close to vector for “queen”. I have no idea why it works this way, but it does.

That’s all for today, next time I will try to show you some results. Here is some visualization you can admire in the meantime.

Sources:
Tensorflow tutorial
Nice (long) lecture about word2vec details
Visualization of High Dimensional Data using t-SNE with R
Just Google the rest