# Tuner¶

Tuner is one of the three key components of Finetuner. Given an embedding model and labeled data, Tuner trains the model to fit the data.

Labeled data can be constructed by following this.

## fit method¶

Tuner can be called via finetuner.fit(). Its minimum form is as follows:

import finetuner

finetuner.fit(
embed_model,
train_data,
**kwargs
)


Here, embed_model must be an embedding model; and train_data must be labeled data. Other parameters such as epochs, optimizer can be found in the Developer Reference.

### loss argument¶

Loss function of the Tuner can be specified via the loss argument of finetuner.fit().

By default, Tuner uses SiameseLoss (with cosince distance) for training. You can also use other built-in losses by specifying finetuner.fit(..., loss='...').

Let $$\mathbf{x}_i$$ denotes the predicted embedding for Document $$i$$. The built-in losses are summarized as follows:

SiameseLoss

$\ell_{i,j} = \mathrm{sim}(i,j)d(\mathbf{x}_i, \mathbf{x}_j) + (1 - \mathrm{sim}(i,j))\max(m - d(\mathbf{x}_i, \mathbf{x}_j))$
, where $$\mathrm{sim}(i,j)$$ equals 1 Document $$i$$ and $$j$$ are positively related, and 0 otherwise, $$d(\mathbf{x}_i, \mathbf{x}_j)$$ represents the distance between $$\mathbf{x}_i$$ and $$\mathbf{x}_j$$ and $$m$$ is the “margin”, the desired wedge between dis-similar items.

TripletLoss

$\ell_{i, p, n}=\max(0, d(\mathbf{x}_i, \mathbf{x}_p)-d(\mathbf{x}_i, \mathbf{x}_n)+m)$
, where Document $$p$$ and $$i$$ are positively related, whereas $$n$$ and $$i$$ are negatively related or unrelated, $$d(\cdot, \cdot)$$ representes a distance function, and $$m$$ is the desired distance between (wedge) between the positive and negative pairs

Tip

Although siamese and triplet loss works on pair and triplet inputs respectively, there is no need to worry about the data input format. You only need to make sure your data is labeled according to Data Format, then you can switch between all losses freely.

## save method¶

After a model is tuned, you can save it by calling finetuner.save(model, save_path).

## Examples¶

### Tune a simple MLP on Fashion-MNIST¶

1. Write an embedding model. An embedding model can be written in Keras/PyTorch/Paddle. It can be either a new model or an existing model with pretrained weights. Below we construct a 784x128x32 MLP that transforms Fashion-MNIST images into 32-dim vectors.

import torch
embed_model = torch.nn.Sequential(
torch.nn.Flatten(),
torch.nn.Linear(in_features=28 * 28, out_features=128),
torch.nn.ReLU(),
torch.nn.Linear(in_features=128, out_features=32))

import tensorflow as tf
embed_model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(32)
])

import paddle

2. Build labeled match data according to the steps here. You can refer to finetuner.toydata.generate_fashion for an implementation. In this example, for each Document we generate 10 positive matches and 10 negative matches.

3. Feed the labeled data and embedding model into Finetuner:

import finetuner
from finetuner.toydata import generate_fashion

finetuner.fit(
embed_model,
train_data=generate_fashion(),
eval_data=generate_fashion(is_testset=True)
)


### Tune a transformer model on Covid QA¶

1. Write an embedding model:

import torch
from transformers import AutoModel

TRANSFORMER_MODEL = 'sentence-transformers/paraphrase-MiniLM-L6-v2'

class TransformerEmbedder(torch.nn.Module):
def __init__(self):
super().__init__()
self.model = AutoModel.from_pretrained(TRANSFORMER_MODEL)

def forward(self, inputs):
out_model = self.model(**inputs)
cls_token = out_model.last_hidden_state[:, 0, :]
return cls_token

1. Build labeled match data according to the steps here. You can refer to finetuner.toydata.generate_qa for an implementation.

2. Feed labeled data and the embedding model into Finetuner:

from typing import List

import finetuner
from finetuner.toydata import generate_qa
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL)

def collate_fn(inputs: List[str]):
batch_tokens = tokenizer(
inputs,
truncation=True,
max_length=50,