finetuner.data module#

class finetuner.data.CSVOptions(size=None, sampling_rate=None, dialect='auto', encoding='utf-8', is_labeled=False, convert_to_blob=True, create_point_clouds=True, point_cloud_size=2048)[source]#

Bases: object

Class containing options for reading CSV files

Parameters:
  • size (Optional[int]) – The number of rows that will be sampled.

  • sampling_rate (Optional[float]) – The sampling rate between [0, 1] indicating how many lines of the CSV are skipped. a sampling rate of 1 means that none are skipped, 0.5 means that half are skipped, and 0 means that all lines are skipped.

  • dialect (Union[str, Dialect]) – A description of the expected format of the CSV, can either be an object of the csv.Dialect class, or one of the strings returned by the :meth:`csv.list_dialects()’ function.

  • encoding (str) – The encoding of the CSV file.

  • is_labeled (bool) – Whether the second column of the CSV represents a label that should be assigned to the item in the first column (True), or if it is another item that should be semantically close to the first (False).

  • convert_to_blob (bool) – Whether uris to local files should be converted to blobs

  • create_point_clouds (bool) – Determines whether from uris to local 3D mesh files should point clouds be sampled.

  • point_cloud_size (int) – Determines the number of points sampled from a mesh to create a point cloud.

size: int | None = None#
sampling_rate: float | None = None#
dialect: str | Dialect = 'auto'#
encoding: str = 'utf-8'#
is_labeled: bool = False#
convert_to_blob: bool = True#
create_point_clouds: bool = True#
point_cloud_size: int = 2048#
class finetuner.data.LabeledCSVParser(file, task, options=None)[source]#

Bases: _CSVParser

CSV has two columns where the first column is the data, the second column is the label. To use the handler, make sure csv contains two columns and is_labeled=True.

parse()[source]#
class finetuner.data.QueryDocumentRelationsParser(file, task, options=None)[source]#

Bases: _CSVParser

In the case that user do not have explicitly annotated labels, but rather a set of query-document pairs which express that a document is relevant to a query, or form as a text-image pair.

parse()[source]#
class finetuner.data.PairwiseScoreParser(file, task, options=None)[source]#

Bases: _CSVParser

CSV has three columns, column1, column2 and a float value indicates the similarity between column1 and column2.

parse()[source]#
class finetuner.data.DataSynthesisParser(file, task, options=None)[source]#

Bases: _CSVParser

CSV has either one column or one row, each item in the CSV represents a single document so the structure of the CSV file is not important.

parse()[source]#
class finetuner.data.CSVContext(model=None, options=None)[source]#

Bases: object

A CSV context switch class with conditions to parse CSVs into DocumentArray.

Parameters:
  • model (Optional[str]) – The model being used, to get model stub and associated task.

  • options (Optional[CSVOptions]) – An instance of :class`CSVOptions`.

build_dataset(data)[source]#
finetuner.data.get_csv_file_context(file, encoding)[source]#

Get csv file context, such as file_ctx, csv dialect and number of columns.

finetuner.data.get_csv_file_dialect_columns(file, encoding)[source]#

Get csv dialect and number of columns of the csv.

finetuner.data.build_encoding_dataset(model, data)[source]#

If data has been provided as a list, a DocumentArray is created from the elements of the list

Return type:

DocumentArray

finetuner.data.check_columns(task, col1, col2)[source]#
Determines the expected modalities of each column using the task argument,

Then checks the given row of the CSV to confirm that it contains valid data

Parameters:
  • task (str) – The task of the model being used.

  • col1 (str) – A single value from the first column of the CSV.

  • col2 (str) – A single value from the second column of the CSV.

Return type:

Tuple[str, str]

Returns:

The expected modality of each column

finetuner.data.create_document(modality, column, convert_to_blob, create_point_clouds, point_cloud_size=2048)[source]#
Checks the expected modality of the value in the given column

and creates a Document with that value

Parameters:
  • modality (str) – The expected modality of the value in the given column

  • column (str) – A single value of a column

  • convert_to_blob (bool) – Whether uris to local image files should be converted to blobs.

  • create_point_clouds (bool) – Whether from uris to local 3D mesh files should point clouds be sampled.

  • point_cloud_size (int) – Determines the number of points sampled from a mesh to create a point cloud.

Return type:

Document