Backbone Model#

Finetuner provides several widely used backbone models, including resnet, efficientnet, clip and bert. Thereby, for most of them, Finetuner provides multiple variants, e.g., the common resnet50 and the more complex resnet152 model.

Finetuner will convert these backbone models to embedding models by removing the head or applying pooling, performing fine-tuning and producing the final embedding model. The embedding model can be fine-tuned for text-to-text, image-to-image or text-to-image search tasks.

You can call:

import finetuner

finetuner.describe_models(task='text-to-text')
import finetuner

finetuner.describe_models(task='image-to-image')
import finetuner

finetuner.describe_models(task='text-to-image')
import finetuner

finetuner.describe_models(task='mesh-to-mesh')

to get a list of supported models:

                                                       Finetuner backbones: text-to-text                                                       
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                   name          task  output_dim  architecture                                                              description ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ jina-embedding-t-en-v1  text-to-text         312   transformer     Text embedding model trained using Linnaeus-Clean dataset by Jina AI │
│ jina-embedding-s-en-v1  text-to-text         512   transformer     Text embedding model trained using Linnaeus-Clean dataset by Jina AI │
│ jina-embedding-b-en-v1  text-to-text         768   transformer     Text embedding model trained using Linnaeus-Clean dataset by Jina AI │
│ jina-embedding-l-en-v1  text-to-text        1024   transformer     Text embedding model trained using Linnaeus-Clean dataset by Jina AI │
│           bert-base-en  text-to-text         768   transformer               BERT model pre-trained on BookCorpus and English Wikipedia │
│        bert-base-multi  text-to-text         768   transformer                         BERT model pre-trained on multilingual Wikipedia │
│   distiluse-base-multi  text-to-text         512   transformer       Knowledge distilled version of the multilingual Sentence Encoder   │
│          sbert-base-en  text-to-text         768   transformer                                  Pretrained BERT, fine-tuned on MS Marco │
└────────────────────────┴──────────────┴────────────┴──────────────┴─────────────────────────────────────────────────────────────────────────┘
                                     Finetuner backbones: image-to-image                                     
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃               name            task  output_dim  architecture                              description ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│  efficientnet-base  image-to-image        1792           cnn  EfficientNet B4 pre-trained on ImageNet │
│ efficientnet-large  image-to-image        2560           cnn  EfficientNet B7 pre-trained on ImageNet │
│       resnet-large  image-to-image        2048           cnn        ResNet152 pre-trained on ImageNet │
│        resnet-base  image-to-image        2048           cnn         ResNet50 pre-trained on ImageNet │
└────────────────────┴────────────────┴────────────┴──────────────┴─────────────────────────────────────────┘
                                          Finetuner backbones: text-to-image                                           
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃            name           task  output_dim  architecture                                            description ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│    clip-base-en  text-to-image         512   transformer                                        CLIP base model │
│   clip-large-en  text-to-image        1024   transformer                    CLIP large model with patch size 14 │
│ clip-base-multi  text-to-image         512   transformer                                             Open MCLIP │
│                                                            "xlm-roberta-base-ViT-B-32::laion5b_s13b_b90k" model │
└─────────────────┴───────────────┴────────────┴──────────────┴───────────────────────────────────────────────────────┘
                                        Finetuner backbones: mesh-to-mesh                                         
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃          name          task  output_dim  architecture                                          description ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ pointnet-base  mesh-to-mesh         512      pointnet  PointNet++ embedding model for 3D mesh point clouds │
└───────────────┴──────────────┴────────────┴──────────────┴─────────────────────────────────────────────────────┘
  • ResNets are suitable for image-to-image search tasks with high performance requirements, where resnet152 is bigger and requires higher computational resources than resnet50.

  • EfficientNets are suitable for image-to-image search tasks with low training and inference times. The model is more light-weighted than ResNet. Here, efficientnet_b4 is the bigger and more complex model.

  • CLIP is the one for text-to-image search, where the images do not need to have any text descriptors.

  • BERT is generally suitable for text-to-text search tasks.

  • Msmarco-distilbert-base-v3 is designed for matching web search queries to short text passages and is a suitable backbone for similar text-to-text search tasks.

  • PointNet++ is an embedding model, which we derived from the popular PointNet++ model. The original model is designed for classifying 3D meshes. Our derived model can be used to encode meshes into vectors for search.

It should be noted that:

  • ResNet/EfficientNet models are loaded from the torchvision library.

  • Transformer-based models are loaded from the huggingface transformers library.

  • msmarco-distilbert-base-v3 has been fine-tuned once by sentence-transformers on the MS MARCO dataset on top of BERT.