Facenet online triplet generation

The article here mentions a smart way to generate triplets for a convolutional neural network (in order to generate face embeddings).

For a mini-batch with n images, only the semi-hard triplets are used for learning (triplets containing semi-hard negatives, which are negative images that are close enough to the anchor image).

How is the training set created? What does a batch contain?

In our experiments we sample the training data such that around 40 faces are selected per identity per mini- batch. Additionally, randomly sampled negative faces are added to each mini-batch.

What I did

I used Labeled Faces in the Wild dataset for training (13233 images, 5749 people, 1680 people with two or more images, and for each batch I chose one anchor, some positives (meaning that I could only use 1680 batches, because I need more than one image of ONE person), and negatives - images of other people randomly selected.

Something is wrong with my training set. Should a mini-batch contain more anchors?

Instead of picking the hardest positive, we use all anchor- positive pairs in a mini-batch while still selecting the hard negatives

Online triplet generation? How is it done? (technical details are welcome)

Generate triplets online. This can be done by selecting the hard positive/negative exemplars from within a mini-batch.

To select the semi-hard negatives I need to calculate the embeddings of my triplets. So I need to make a pass trough the triplet neural net, and compare the embeddings. And then, I need to calculate the loss function only using the hard triplets. That's what I think I have to do.

I used three convolutional neural networks with shared parameters (four convolution layers with max pooling and one fully connected layer). I didn't use online triplet generation yet , because I can't understand how it's done. It results in no more than 70% accuracy.

This triplet network is not the state-of-the-art approach but definitely one of the best milestones in the field. It's not so intuitive but here we go:

What you mean to do, is correct: during training, after each "epoch" you define, you use your embedding model to calculate embeddings for your gallery and queries. Then, you want to create those "semi-hard" triplets. So lets say you have person1, which is the anchor, that you try to match to everyone else (person2, 3 and so on). Then, you want to take the closest one to him, maybe use a threshold to define that it is close enough so you can call it hard. That's the negative. For the positive, you can also try to take one that is farthest away from him, but that's not necessarily good. Be cautious with bad data. You need to at least have a positive.

So now you have for each person a triplet: 1 anchor, 1 negative (closest) and 1 positive. You can pre-calculate this on the whole dataset or randomly selected large-enough list of those, so you'll have enough to populate an epoch. Once doing that, you take those triplets and populate a batch.

You repeat that calculation every epoch to generate new set of triplets to use in the next one. The more often you do it, the more "online" you get. The reason that you must repeat it is that the embedding model changes. So you may have improved on what formerly was "hard", but now the model handles it better and other cases became "hard" due to that. So it must be an iterative process. You can think of it as some sort of expectation-maximization, where the expectation is where you calculate the distances on your data, and maximization is minimizing the loss (maximizing likelihood/accuracy) of your embedding model. Then repeat.

Another thing the authors mention in the paper is that a portion of the batch is done that way (negative mining) and the others are just by randomly selected negatives (from the negative pool), to keep the embedding space intact. That's actually a waste once your model is trained enough, because if the negative samples are far away, the loss can be 0 due to the margin loss, contributing no gradients to the network. That's partially the reason why you don't get a good level of accuracy - the model sees things that he handles correctly. The probability of meeting an "hard" triplet just by random selection is relatively low (think about the complexity of the triplets space).

So basically the idea is to have batches with triplets that contribute to the loss, so you'll be minimizing something that improves the model rather than meeting one hard triplet every once in a long while. This is extremely important to improve such models to the level of accuracy that it's actually trustworthy (FRR@FAR-wise).

Lastly, about this using all the positives in a batch - you can also use all the negatives as well. So it could be that you have two anchors of the same person in a batch, or even one anchor being a negative for another anchor. You can compute all the pairwise distances in a batch and then use them as samples. It needs to be balanced though. It's also more sophisticated to code as you need to be careful, considering which embedding belongs to whom, calling it a positive or a negative accordingly.

Recommended topics

Hot tags