Finetuning ResNet50 on Totally Looks Like Dataset#

In this tutorial, we will use Finetuner to finetune embeddings on Totally Looks Like dataset (TLL) dataset. The dataset consists of 6016 pairs of images (12032 in total).


Totally-Looks-Like is a dataset and benchmark challenging machine-learned representations to reproduce human perception of image similarity. As shown below, each image patch in the left has a corresponded similar image patch in the right.


The rationale for choosing TLL dataset is

  • it is a relatively small dataset and doesn’t require training a large network from scratch. So we train only part of the ResNet-50 by freezing layers.

  • it consists of pairs of images that can be formed as positive pairs (same classes). A random image can be considered as a negative pair. We construct triplet and use the TripletLoss.

After finetuning, the distance between positive pairs is expected to be pulled closer, while the distance between positive and negative pairs is expected to be pushed away.


Data preparation#

We will download and, as stated before, each of them consists of 6016 images which can be formed into pairs based on the same file name.

pip install gdown
pip install finetuner
pip install torchvision



Afterwards, we load all images from unzipped left and right folders and turn them into sorted order as DocArray DocumentArray. 80% of the dataset will be used to train the machine learning model. While 20% of the dataset will be used to evaluate the quality of embeddings on the search task.

from docarray import DocumentArray

left_da = DocumentArray.from_files('left/*.jpg')
right_da = DocumentArray.from_files('right/*.jpg')
# we use 80% for training machine learning model.
left_da = DocumentArray(sorted(left_da, key=lambda x: x.uri))
right_da = DocumentArray(sorted(right_da, key=lambda x: x.uri))

ratio = 0.8
train_size = int(ratio * len(left_da))

train_da = left_da[:train_size] + right_da[:train_size]

Preparing training data#

After loading data into DocArray DocumentArray, we can prepare documents for training. Finetuner will do the most challenging work for you, all you need to do is:

  1. Assign a label into each Document named finetuner_label as its class name.

  2. Perform pre-processing for the document. In this case, we load the image from URI, normalize the image and reshape it from H, W, C to C, H W will C is the color channel of the image.

def assign_label_and_preprocess(doc):
    doc.tags['finetuner_label'] = doc.uri.split('/')[1]
    return doc.load_uri_to_image_tensor().set_image_tensor_normalization().set_image_tensor_channel_axis(-1, 0)


Choosing the base model#

We create a pre-trained ResNet-50 model from torchvision, and since we want to learn a better embedding, the first thing we need is to see which layer is suitable to use as an embedding layer. You can call finetuner.display(model, input_size) to plot the model architecture.

import finetuner as ft
import torchvision

resnet = torchvision.models.resnet50(pretrained=True)
ft.display(resnet, (3, 224, 224))

You can get more information in Tailor docs. Since the model is pre-trained on ImageNet on a classification task, the output fc layer should not be considered a embedding layer. We can use the pooling layer adaptiveavgpool2d_173 as the output of the embedding model. This layer generates a 2048 dimensional dense embedding as output.


Selecting an appropriate embedding layer is non-trivial. In general, you have to remove the task-specific top from the pre-trained model.


Model training is straight forward in Finetuner. You’ll need to configure several hyperparameters, plug in your model and training set, and that’s it.

The script below demonstrates how to combine Tailor + Tuner interface for model finetuning. The parameter above to_embedding_model=True are Tuner parameters, the rest are Tailor parameters.

We save the returned embedding model as tuned_model, and give an input image at inference time. The model then generates a representation of the image (2048 dimensional vectors).

import finetuner as ft
from finetuner.tuner.pytorch.losses import TripletLoss
from finetuner.tuner.pytorch.miner import TripletEasyHardMiner

tuned_model =
    loss=TripletLoss(miner=TripletEasyHardMiner(neg_strategy='hard'), margin=0.3), 
    input_size=(3, 224, 224),
    freeze=['conv2d_1', 'batchnorm2d_2', 'conv2d_5', 'batchnorm2d_6', 'conv2d_8', 'batchnorm2d_9', 'conv2d_11', 'batchnorm2d_12'],

But how does it work?

  1. Finetuner will look into your labels defined in the tag of the DocArray Document, find the positive sample and find a hard-negative (as we use neg_strategy='hard'') sample as triplets.

  2. Finetuner try to optimize the TripletLoss, aiming at pulling documents with the same classes closer, while pushing documents with different class away.


Evaluating the embedding quality#

We’ll use [email protected] to measure the quality of the representation on the search task. [email protected] means for all the test data, how likely the positive match ranked within the top 10 matches with respect to the query Document.

Remember that we have the train_da ready, now we need to perform the same preprocessing on test DocumentArray:

def preprocess(doc):
    return doc.load_uri_to_image_tensor().set_image_tensor_normalization().set_image_tensor_channel_axis(-1, 0)

test_left_da = left_da[train_size:]
test_right_da = right_da[train_size:]


And we create embeddings on our test set using the finetuned model:

# use finetuned model to create embeddings, only test data
test_left_da.embed(tuned_model, device='cuda')
test_right_da.embed(tuned_model, device='cuda')

Last but not least, we match test_left_da against test_right_da.

You can consider test_left_da as user queries, and test_right_da as our indexed document collection. For each test_left_da, match function will find top-10 nearest embeddings in test_right_da. And we evaluate result with [email protected]

test_left_da.match(test_right_da, limit=10)

def hit_rate(da, topk=1):
    hit = 0
    for d in da:
        for m in d.matches[:topk]:
            if d.uri.split('/')[-1] == m.uri.split('/')[-1]:
                hit += 1
    return hit/len(da)

for k in range(1, 11):
    print(f'[email protected]{k}:  finetuned: {hit_rate(test_left_da, k):.3f}')

Compare the finetuned model and pre-trained model, how much performance gain we get?

We did an experiment using pretrained ResNet50 on ImageNet, by chopping off the last classification layer, the rest of the model serves as a feature extractor.

The result are shown in the table below:

[email protected] pre-trained fine-tuned
[email protected] 0.068 0.122
[email protected] 0.142 0.230
[email protected] 0.183 0.301

Now let’s look at some results where finetuned model performed well and pretrained ResNet50 performed bad.


Here are some results in reverse: i.e. pretrained ResNet50 performed well but finetuned performed bad