Text-to-Image Search via CLIP#

This guide will showcase fine-tuning a CLIP model for text to image retrieval.


We’ll be fine-tuning CLIP on the fashion captioning dataset which contains information about fashion products.

For each product the dataset contains a title and images of multiple variants of the product. We constructed a parent Document for each picture, which contains two chunks: an image document and a text document holding the description of the product.


Our journey starts locally. We have to prepare the data and push it to the cloud and Finetuner will be able to get the dataset by its name. For this example, we already prepared the data, and we’ll provide the names of training and evaluation data (clip-fashion-train-data and clip-fashion-eval-data) directly to Finetuner.

Backbone model#

Currently, we only support openai/clip-vit-base-patch32 for text to image retrieval tasks. However, you can see all available models either in choose backbone section or by calling describe_models().


From now on, all the action happens in the cloud!

First you need to login to Jina ecosystem:

import finetuner

Now that everything’s ready, let’s create a fine-tuning run!

import finetuner

run = finetuner.fit(
    learning_rate= 1e-5,

Let’s understand what this piece of code does:

finetuner.fit parameters

The only required arguments are model and train_data. We provide default values for others. Here is the full list of the parameters.

  • We start with providing model, run_name, names of training and evaluation data.

  • We also provide some hyper-parameters such as number of epochs and a learning_rate.

  • Additionally, we use BestModelCheckpoint to save the best model after each epoch and EvaluationCallback for evaluation.


We created a run! Now let’s see its status.

{'status': 'CREATED', 'details': 'Run submitted and awaits execution'}

Since some runs might take up to several hours/days, it’s important to know how to reconnect to Finetuner and retrieve your run.

import finetuner
run = finetuner.get_run('clip-fashion')

You can continue monitoring the run by checking the status - status() or the logs - logs().


Currently, we don’t have a user-friendly way to get evaluation metrics from the EvaluationCallback we initialized previously. What you can do for now is to call logs() in the end of the run and see evaluation results:

           INFO     Done ✨                                                                              __main__.py:219
           INFO     Saving fine-tuned models ...                                                         __main__.py:222
           INFO     Saving model 'model' in /usr/src/app/tuned-models/model ...                          __main__.py:233
           INFO     Pushing saved model to Hubble ...                                                    __main__.py:240
[10:38:14] INFO     Pushed model artifact ID: '62a1af491597c219f6a330fe'                                 __main__.py:246
           INFO     Finished 🚀                                                                          __main__.py:248

Evaluation of CLIP

In this example, we did not plug-in an EvaluationCallback since the callback can evaluate one model at one time. In most cases, we want to evaluate two models: i.e. use CLIPTextEncoder to encode textual Documents as query_data while use CLIPImageEncoder to encode image Documents as index_data. Then use the textual Documents to search image Documents.

We have done the evaulation for you in the table below.

Before Finetuning After Finetuning
average_precision 0.253423 0.415924
dcg_at_k 0.902417 2.14489
f1_score_at_k 0.0831918 0.241773
hit_at_k 0.611976 0.856287
ndcg_at_k 0.350172 0.539948
precision_at_k 0.0994012 0.256587
r_precision 0.231756 0.35847
recall_at_k 0.108982 0.346108
reciprocal_rank 0.288791 0.487505


After the run has finished successfully, you can download the tuned model on your local machine:

artifact = run.save_artifact('clip-model')


Now you saved the artifact into your host machine, let’s use the fine-tuned model to encode a new Document:

from docarray import Document, DocumentArray

# Prepare some documents to encode
text_da = DocumentArray([Document(text='some text to encode')])
image_da = DocumentArray([Document(uri='my-image.png')])
# Load model from artifact
clip_text_encoder = finetuner.get_model(artifact=artifact, select_model='clip-text')
clip_image_encoder = finetuner.get_model(artifact=artifact, select_model='clip-vision')
# Encoding will happen in-place in your `DocumentArray`
finetuner.encode(model=clip_text_encoder, data=text_da)
finetuner.encode(model=clip_image_encoder, data=image_da)

(1, 512)
(1, 512)

what is select_model?

When fine-tuning CLIP, we are fine-tuning the CLIPVisionEncoder and CLIPTextEncoder in parallel. The artifact contains two models: clip-vision and clip-text. The parameter select_model tells finetuner which model to use for inference, in the above example, we use clip-text to encode a Document with text content.

Inference with ONNX

In case you set to_onnx=True when calling finetuner.fit function, please use model = finetuner.get_model('/path/to/YOUR-MODEL.zip', is_onnx=True)

Check out clip-as-service to learn how to plug-in a finetuned CLIP model to our CLIP specific service.

That’s it! If you want to integrate the fine-tuned model into your Jina Flow, please check out integrated with the Jina ecosystem.