seqio.evaluation package#
- class seqio.evaluation.Evaluator(mixture_or_task_name, feature_converter, eval_split='validation', use_cached=False, seed=42, sequence_length=None, num_examples=None, shuffle=False, logger_cls=(), log_dir=None, use_memory_cache=True, async_compute_metrics=True, target_field_name='targets')[source]#
A class to encapsulate all eval-related information.
Users should define predict_fn and then pass it to evaluate method. predict_fn should operate with enumerated tf.data.Dataset. See evaluate method for more detail.
evaluation data is cached once and will be used for arbitrary number of evaluation runs.
If none of the evaluation tasks has metrics functions defined, the evaluation will be skipped. Evaluator.evaluate will return ({}, {}) assuming that compute_metrics is True.
Note that we cache two versions of the datasets. The first version (self.cached_task_datasets) has the task features (e.g., “inputs” and “targets”), which are returned from seqio.Task.get_dataset. The second version (self.cached_model_datasets) has model features (e.g., “decoder_target_tokens”). This is returned from the feature converter. The former is used for postprocessing associated with the Task that requires the original task datasets. The latter is passed to predict_fn for evaluation.
- eval_tasks#
a mapping from a mixture or a task name to seqio.Task object(s).
- cached_model_datasets#
cached evaluation datasets with model features.
- cached_task_datasets#
cached evaluation datasets with task features.
- model_feature_shapes#
mapping from model feature to its shape in the cached_model_datasets.
- loggers#
a sequence of subclasses of Logger.
- evaluate(*, compute_metrics, step=None, predict_fn=None, score_fn=None, predict_with_aux_fn=None, model_fns=None)[source]#
Predict and score self.eval_tasks.
Evaluation must preserve the example ordering. This requirement is satisfied by using enumerated dataset. Each of the cached eval task datasets is an enumerated tf.data.Dataset where each element has (index, example) format. Therefore, each index serves as a unique integer id for the example.
predict_fn takes as input the cached eval dataset. The output may be of the form Sequence[(index, token_ids)] where token_ids is the sequence of token ids output by the model with the input example whose index matches index. Therefore, even if predict_fn mixes the order of the examples during prediction, the order can be corrected as long as the correct index for each example is maintained. predict_with_aux_fn is almost exactly the same as predict_fn, except that it also returns a dictionary of auxiliary values along with each sequence of token_ids.
Similarly, score_fn takes the cached eval dataset as input and returns Sequence[(index, score)] where score is the sequence of log likelihood scores for the targets in the eval dataset.
A common example is the multi-host setup where the evaluation dataset is split into multiple hosts that independently make predictions and combine the results during which the ordering can be mixed.
There are 4 steps involved in the evaluation using predicted tokens:
Model returns indices and output_tokens: Sequence[Tuple[int, Sequence[int]]], potentially with some auxiliary values.
output tokens are decoded by vocab.decode
Postprocessors are applied to the decoded output. These are denoted as predictions.
Each metric function is applied to the predictions and the cached targets.
Using auxiliary values is exactly the same as predicted tokens, except that a Mapping[str, Sequence[Any]] is also returned. Where len(Sequence[Any]) should correspond to the number of elements in the dataset.
There are 2 steps involved in the evaluation using scores:
Model returns indices and scores: Sequence[Tuple[int, Sequence[float]]]
Each metric function is applied to the scores and the cached targets.
- Parameters:
compute_metrics – whether to compute metrics.
step – an optional step number of the current evaluation. If unspecified, a dummy value of -1 will be used.
predict_fn – a user-defined function, which takes in a tf.data.Dataset and outputs the sequence of predicted tokens. Only called if predict metrics exist for the tasks.
score_fn – a user-defined function, which takes in a tf.data.Dataset and outputs the log likelihood score of the targets. Only called if score metrics exist for the task.
predict_with_aux_fn – a user-defined function that has exactly the same behaviour as predict_fn, except that it also returns a dictionary of auxiliary values. Only called if predict_with_aux metrics exist for the tasks.
model_fns – a dict mapping model output type to the model function (user-defined) that can produce outputs of that model output type.
- Returns:
- a Future containing a mapping from task name to computed metrics,
or None if compute_metrics is False.
- all_output: a mapping from task name to all the output that metric
evaluation needs.
- Return type:
metrics