SentenceTransformerModel.train

opensearch_py_ml.ml_models.SentenceTransformerModel.train(self, read_path: str, overwrite: bool = False, output_model_name: str | None = None, zip_file_name: str | None = None, compute_environment: str | None = None, num_machines: int = 1, num_gpu: int = 0, learning_rate: float = 2e-05, num_epochs: int = 10, batch_size: int = 32, verbose: bool = False, percentile: float = 95) None

Read the synthetic queries and use it to fine tune/train (and save) a sentence transformer model.

Parameters

param read_path:

required, path to the zipped file that contains generated queries, if None, raise exception. the zipped file should contain pickled file in list of dictionary format with key named as ‘query’, ‘probability’ and ‘passages’. For example: [{‘query’:q1,’probability’: p1,’passages’: pa1}, …]. ‘probability’ is not required for training purpose

type read_path:

string

param overwrite:

optional, synthetic_queries/ folder in current directory is to store unzip queries files. Default to set overwrite as false and if the folder is not empty, raise exception to recommend users to either clean up folder or enable overwriting is True

type overwrite:

bool

param output_model_name:

the name of the trained custom model. If None, default as model_id + ‘.pt’

type output_model_name:

string

param zip_file_name:

Optional, file name for zip file. if None, default as model_id + ‘.zip’

type zip_file_name:

string

param compute_environment:

optional, compute environment type to run model, if None, default using LOCAL_MACHINE

type compute_environment:

string

param num_machines:

optional, number of machine to run model , if None, default using 1

type num_machines:

int

param num_gpu:

optional, number of gpus to run model , if None, default to 0. If number of gpus > 1, use HuggingFace accelerate to launch distributed training

param learning_rate:

optional, learning rate to train model, default is 2e-5

type learning_rate:

float

param num_epochs:

optional, number of epochs to train model, default is 10

type num_epochs:

int

param batch_size:

optional, batch size for training, default is 32

type batch_size:

int

param verbose:

optional, use plotting to plot the training progress. Default as false

type verbose:

bool

param percentile:

we find the max length of {percentile}% of the documents. Default is 95% Since this length is measured in terms of words and not tokens we multiply it by 1.4 to approximate the fact that 1 word in the english vocabulary roughly translates to 1.3 to 1.5 tokens

type percentile:

float

Returns

return:

no return value expected

rtype:

None