Session Dataset#
A SessionDataset
is a DocumentArray
, in which each Document has matches
. The labels are stored under matched Document’s .tags['finetuner_label']
. The label can be either 1
(denoting similarity of match to its parent Document
) or -1
(denoting dissimilarity of the match from its parent Document
).
The word “session” comes from the scenario where a user’s click-through introduces an implicit yes/no on the relevance, hence an interactive session.
Batch construction#
A SessionDataset
works with SessionSampler
.
Here the batches are simply constructed by putting together enough root documents and their matches (we call this a session) to fill the batch according to the batch_size
parameter. An example of a batch of size batch_size=8
made of two sessions is shown on the image below. 0
denotes the root document.

Examples#
Toy example#
Here is an example of a session dataset
from docarray import Document, DocumentArray
from finetuner.tuner.dataset import SessionDataset, SessionSampler
ds = DocumentArray(
[Document(content='shirt'), Document(content='shoe')]
)
ds[0].matches = [
Document(content='red shirt', tags={'finetuner_label': 1}),
Document(content='red shoe', tags={'finetuner_label': -1}),
]
ds[1].matches = [
Document(content='black shoe', tags={'finetuner_label': 1}),
Document(content='black pants', tags={'finetuner_label': -1}),
]
sds = SessionDataset(ds)
for b in SessionSampler(sds.labels, batch_size=2):
print([sds[bb] for bb in b])
[('shirt', (0, 0)), ('red shoe', (0, -1))]
[('shoe', (1, 0)), ('black pants', (1, -1))]
We got 2 batches here.
Covid QA data#
Covid QA data is a CSV that has 481 rows with the columns question
, answer
& wrong_answer
.

To convert this dataset into a SessionDataset
, we build each Document to contain the following relevant information:
.text
: thequestion
column.matches
: the generated positive & negative matches Document.text
: theanswer
/wrong_answer
column.tags['finetuner_label']
: the match label:1
or-1
.
Matches are built with the logic below:
only allows 1 positive match per Document, it is taken from the
answer
column;always include
wrong_answer
column as the negative match. Then sample other documents’ answer as negative matches.
One can use generate_qa()
to generate it.