finetuner.tuner.dataset.samplers module

class finetuner.tuner.dataset.samplers.ClassSampler(labels, batch_size, num_items_per_class=None)[source]

Bases: finetuner.tuner.dataset.base.BaseSampler

A batch sampler that fills the batch with an equal number of items from each class.

It will try to make sure that all the items get used once in a single epoch, and that only num_items_per_class items in a batch come from one class. When there would not be enough items left from a single class to fill the batch, items will be randomly sampled from other already used items for this class.

However, some cutoff might occur if there are not enough items left from different classes to fill the batch - in this case some items do not get used in that epoch.

Construct the batch sample.

Parameters
  • labels (Sequence[int]) – A sequence of items labels, each label should be an integer denoting the class of the item

  • batch_size (int) – How many items to include in a batch

  • num_items_per_class (Optional[int]) – How many items per class (unique labels) to include in a batch. For example, if batch_size is 20, and num_items_per_class is 4, the batch will consist of 4 items for each of the 5 classes.

class finetuner.tuner.dataset.samplers.SessionSampler(labels, batch_size, shuffle=True)[source]

Bases: finetuner.tuner.dataset.base.BaseSampler

A batch sampler that fills the batch with items with items from as many sessions as possible.

When constructing each batch, the sampler will start by adding all items of one session, and continue adding sessions in this way until the total number of items in equals batch_size - from the last session in batch only the first few items are taken, so as to not exceed the desider batch size.

The last session in batch, if it was not included completely and was not the only session in the batch, will appear again in next the batch as the first added session (with all items, including those that appeared in the previous batch).

It is assumed that the anchor document is the first document in its session - something that is always true if labels come from SessionDataset

Constuct the batch sampler.

Parameters
  • labels (Sequence[Tuple[int, int]]) – A sequence of items labels, each label should be a tuple with two integers - first one is the id of the session, while the second one denots the match type of the item (0 for root document, 1 for positive match and -1 for negative match)

  • batch_size (int) – How many items to include in a batch

  • shuffle (bool) – Suffle the order of sessions. If false, will use the order of sessions which is given by labels.