finetuner.tuner.dataset.datasets module

class finetuner.tuner.dataset.datasets.ClassDataset(docs, preprocess_fn=None)[source]

Bases: finetuner.tuner.dataset.base.BaseDataset[int]

Dataset for enapsulating data where each item has a class label.

Create the dataset instance.

Parameters
  • docs (DocumentSequence) –

    The documents for the dataset. Each document is expected to have - a content (only blob or text are accepted currently) - a class label, saved under tags['finetuner__label']. This class

    label should be an integer or a string

  • preprocess_fn (Optional[ForwardRef]) – A pre-processing function, to apply pre-processing to documents on the fly. It should take as input the document in the dataset, and output whatever content the framework-specific dataloader (and model) would accept.

property labels: List[int]

Get the list of integer labels for all items in the dataset.

Return type

List[int]

class finetuner.tuner.dataset.datasets.SessionDataset(docs, preprocess_fn=None)[source]

Bases: finetuner.tuner.dataset.base.BaseDataset[Tuple[int, int]]

Dataset for enapsulating data that comes in batches of “sessions”.

A session here is supposed to mean an anchor document, together with a set of matches, which may be either positive or negative inputs.

Create the dataset instance.

Parameters
  • docs (DocumentSequence) –

    The documents for the dataset. Each document is expected to have - a content (only blob or text are accepted currently) - matches, which should also have content, as well a label, stored under

    tags['finetuner__label'], which be either 1 or -1, denoting whether the match is a positive or negative input in relation to the anchor document

  • preprocess_fn (Optional[ForwardRef]) – A pre-processing function, to apply pre-processing to documents on the fly. It should take as input the document in the dataset, and output whatever content the framework-specific dataloader (and model) would accept.

property labels: List[Tuple[int, int]]

Get the list of labels for all items in the dataset.

A label consists of two integers, the session ID (index of root document in original document array), and the type, which is 0 if the document is the anchor (root) document, 1 if it is a positive input (match), and -1 if it is a negative input (match)

Return type

List[Tuple[int, int]]