Text-to-Image Search via CLIP#

This guide will showcase fine-tuning a CLIP model for text to image retrieval.


We’ll be fine-tuning CLIP on the fashion captioning dataset which contains information about fashion products.

For each product the dataset contains a title and images of multiple variants of the product. We constructed a parent Document for each picture, which contains two chunks: an image document and a text document holding the description of the product.


Our journey starts locally. We have to prepare the data and push it to the cloud and Finetuner will be able to get the dataset by its name. For this example, we already prepared the data, and we’ll provide the names of training and evaluation data (clip-fashion-train-data and clip-fashion-eval-data) directly to Finetuner.

Backbone model#

Currently, we only support openai/clip-vit-base-patch32 for text to image retrieval tasks. However, you can see all available models either in choose backbone section or by calling describe_models().


From now on, all the action happens in the cloud!

First you need to login to Jina ecosystem:

import finetuner

Now that everything’s ready, let’s create a fine-tuning run!

import finetuner

run = finetuner.fit(
    learning_rate= 1e-5,

Let’s understand what this piece of code does:

finetuner.fit parameters

The only required arguments are model and train_data. We provide default values for others. Here is the full list of the parameters.

  • We start with providing model, run_name, names of training and evaluation data.

  • We also provide some hyper-parameters such as number of epochs and a learning_rate.

  • Additionally, we use BestModelCheckpoint to save the best model after each epoch and EvaluationCallback for evaluation.


We created a run! Now let’s see its status.

{'status': 'CREATED', 'details': 'Run submitted and awaits execution'}

Since some runs might take up to several hours/days, it’s important to know how to reconnect to Finetuner and retrieve your run.

import finetuner
run = finetuner.get_run('clip-fashion')

You can continue monitoring the run by checking the status - status() or the logs - logs().


Currently, we don’t have a user-friendly way to get evaluation metrics from the EvaluationCallback we initialized previously. What you can do for now is to call logs() in the end of the run and see evaluation results:

           INFO     Done ✨                                                                              __main__.py:219
           INFO     Saving fine-tuned models ...                                                         __main__.py:222
           INFO     Saving model 'model' in /usr/src/app/tuned-models/model ...                          __main__.py:233
           INFO     Pushing saved model to Hubble ...                                                    __main__.py:240
[10:38:14] INFO     Pushed model artifact ID: '62a1af491597c219f6a330fe'                                 __main__.py:246
           INFO     Finished 🚀                                                                          __main__.py:248

Evaluation of CLIP

In this example, we did not plug-in an EvaluationCallback since the callback can evaluate one model at one time. In most cases, we want to evaluate two models: i.e. use CLIPTextEncoder to encode textual Documents as query_data while use CLIPImageEncoder to encode image Documents as index_data. Then use the textual Documents to search image Documents.

We have done the evaulation for you in the table below.

Before Finetuning After Finetuning
average_precision 0.253423 0.415924
dcg_at_k 0.902417 2.14489
f1_score_at_k 0.0831918 0.241773
hit_at_k 0.611976 0.856287
ndcg_at_k 0.350172 0.539948
precision_at_k 0.0994012 0.256587
r_precision 0.231756 0.35847
recall_at_k 0.108982 0.346108
reciprocal_rank 0.288791 0.487505


After the run has finished successfully, you can download the tuned model on your local machine:


That’s it! Now you have a fine-tuned model which is ready to be integrated with the Jina ecosystem.