seqio.test_utils package#

SeqIO test utilities.

class seqio.test_utils.DataInjector(task_name, per_split_data)[source]#

Inject per_split_data into task while within the scope of this object.

This context takes per_split_data, wraps it in a FunctionDataSource, and replaces the data source in task with it. After calling this function, task’s get_dataset(split) function will return per_split_data[split].

task_name#

A SeqIO task name.

per_split_data#

A string-keyed dict of string-keyed dicts. The top-level dict should be keyed by dataset splits, and the second-level dict should hold the dataset data.

class seqio.test_utils.FakeLazyTfds(name, tfds_splits, resolved_tfds_name, data_dir, load, load_shard, info, files, size)#
data_dir#

Alias for field number 3

files#

Alias for field number 7

info#

Alias for field number 6

load#

Alias for field number 4

load_shard#

Alias for field number 5

name#

Alias for field number 0

resolved_tfds_name#

Alias for field number 2

size#

Alias for field number 8

tfds_splits#

Alias for field number 1

class seqio.test_utils.FakeMixtureTest(*args, **kwargs)[source]#

TestCase that sets up fake cached and uncached tasks.

setUp()[source]#

Hook method for setting up the test fixture before exercising it.

class seqio.test_utils.FakeTaskTest(*args, **kwargs)[source]#

TestCase that sets up fake cached and uncached tasks.

setUp()[source]#

Hook method for setting up the test fixture before exercising it.

tearDown()[source]#

Hook method for deconstructing the test fixture after testing it.

verify_task_matches_fake_datasets(task_name='', use_cached=False, token_preprocessed=False, ndfeatures=False, ragged_features=False, splits=('train', 'validation'), sequence_length={'inputs': 13, 'targets': 13}, num_shards=None, task=None)[source]#

Assert all splits for both tokenized datasets are correct.

class seqio.test_utils.FakeTfdsInfo(splits, features, description, version, homepage, file_format, config_name)#
config_name#

Alias for field number 6

description#

Alias for field number 2

features#

Alias for field number 1

file_format#

Alias for field number 5

homepage#

Alias for field number 4

splits#

Alias for field number 0

version#

Alias for field number 3

class seqio.test_utils.MockVocabulary(encode_dict, vocab_size=None)[source]#

Mocks a vocabulary object for testing.

encode(s)[source]#

Tokenizes string to an int sequence, without adding EOS.

encode_tf(s)[source]#

Tokenizes string Scalar to an int32 Tensor, without adding EOS.

property vocab_size#

Vocabulary size, including extra ids.

seqio.test_utils.assert_dataset(dataset, expected, expected_dtypes=None, rtol=1e-07, atol=0)[source]#

Tests whether the entire dataset == expected or [expected].

Parameters:
  • dataset – a tf.data dataset

  • expected – either a single example, or a list of examples. Each example is a dictionary.

  • expected_dtypes – an optional mapping from feature key to expected dtype.

  • rtol – the relative tolerance.

  • atol – the absolute tolerance.

seqio.test_utils.assert_datasets_eq(dataset1, dataset2)[source]#

Assert that two tfds datasets are equal.

seqio.test_utils.assert_datasets_neq(dataset1, dataset2)[source]#

Assert that two tfds datasets are unequal.

seqio.test_utils.assert_dict_contains(expected, actual)[source]#

Assert that ‘expected’ is a subset of the data in ‘actual’.

seqio.test_utils.assert_dict_values_equal(a, b)[source]#

Assert that a and b contain equivalent numpy arrays.

seqio.test_utils.create_default_dataset(x, feature_names=('inputs', 'targets'), output_types=None, output_shapes=None)[source]#

Creates a dataset from the given sequence.

seqio.test_utils.get_fake_dataset(split, shuffle_files=False, seed=None, shard_info=None, ndfeatures=False, ragged_features=False)[source]#

Returns a tf.data.Dataset with fake data.

seqio.test_utils.random_token_preprocessor(ex, seed, sequence_length)[source]#

Selects a random shift to roll the tokens by for each feature.

seqio.test_utils.split_tsv_preprocessor(dataset, field_names=('prefix', 'suffix'))[source]#

Splits TSV into dictionary.

seqio.test_utils.test_postprocessing(task_name, raw_data, target_feature_name='targets', predict_output=None, score_output=None, feature_encoder=<seqio.feature_converters.EncDecFeatureConverter object>, sequence_length=None)[source]#

Test the postprocessing and metrics for a given task.

This function injects raw_data into task, then creates an Evaluator based on that task. It then calls Evaluator.evaluate() using predict_fn and score_fn args that return predict_output and score_output, returning the output of the evaluate() call. (Note that, due to the fact that evaluate uses the task data, this test will also actuate the task preprocessing code.)

Usually, this function will be invoked metrics, _, _ = test_postprocessing() since the second and third returned data should be the same as the passed predict_output and score_output.

Parameters:
  • task_name – A SeqIO task name.

  • raw_data – A string-keyed dict of string-keyed dicts. The top-level dict should be keyed by dataset splits, and the second-level dict should hold the dataset data.

  • target_feature_name – Feature whose vocabulary will be used to encode predict_output. Defaults to ‘targets’.

  • predict_output – A list of strings representing model predictions for the raw_data. Optional, only used when the task specifies metric_fns.

  • score_output – A list of floats representing the score of the raw_data. Optional, only used when the task specifies score_metric_fns.

  • feature_encoder – An optional feature encoder object. Defaults to None.

  • sequence_length – An optional length specification.

Returns:

a mapping from metric name to values.

Return type:

metrics

seqio.test_utils.test_preprocessing(task_name, raw_data, seed=None, sequence_length=None)[source]#

Test task preprocessing, returning iterator of the generated dataset.

This function injects raw_data into task and runs the preprocessing routines from task, returning the output of task.get_dataset().as_numpy_iterator().

Parameters:
  • task_name – A SeqIO task name.

  • raw_data – A string-keyed dict of string-keyed dicts. The top-level dict should be keyed by dataset splits, and the second-level dict should hold the dataset data.

  • seed – optional seed used for deterministic Task preprocessing. Specifically, this seed is passed to the Task to be used in map_seed_manager() wrappers around preprocessor functions.

  • sequence_length – optional mapping of feature names to their token lengths used in the model.

Returns:

Iterator with the result of running the tasks’ preprocessing code on raw_data.

seqio.test_utils.test_preprocessing_single(task_name, raw_data, seed=None, sequence_length=None)[source]#

Test task preprocessing, where a single item is expected to be generated.

This is similar to test_preprocessing, but returns a single generated item. This also asserts that no more than a single item is generated during preprocessing.

This function injects raw_data into task and runs the preprocessing routines from task, returning the output of next(task.get_dataset().as_numpy_iterator()).

Parameters:
  • task_name – A SeqIO task name.

  • raw_data – A string-keyed dict of string-keyed dicts. The top-level dict should be keyed by dataset splits, and the second-level dict should hold the dataset data.

  • seed – optional seed used for deterministic Task preprocessing. Specifically, this seed is passed to the Task to be used in map_seed_manager() wrappers around preprocessor functions.

  • sequence_length – optional mapping of feature names to their token lengths used in the model.

Returns:

The result of running the tasks’ preprocessing code on raw_data.

seqio.test_utils.test_task(task_name, raw_data, output_feature_name='targets', feature_encoder=<seqio.feature_converters.EncDecFeatureConverter object>, seed=None)[source]#

Test the preprocessing and metrics functionality for a given task.

This function injects raw_data into the task, then creates an Evaluator based on that task. It runs the task preprocessing on that raw data and extracts the expected value based on output_feature_name. Then, it creates an Evaluator object based on the task_name and runs evaluate using the expected value, returning both the result of the preprocessing and the metrics from the evaluate call.

The expected format for raw_data is a nested dict of the form {‘split_name’: {‘data_key’: data}}.

Note that testing metrics that use score_outputs from this API is currently unsupported.

Parameters:
  • task_name – A SeqIO task name.

  • raw_data – A string-keyed dict of string-keyed dicts. The top-level dict should be keyed by dataset splits, and the second-level dict should hold the dataset data.

  • output_feature_name – A string key for the output feature. Used to extract the expected target from the preprocessing output.

  • feature_encoder – An optional feature encoder object. Defaults to EncDecFeatureEncoder.

  • seed – optional seeed used for deterministic Task preprocessing. Specifically, this seed is passed to the Task to be used in map_seed_manager() wrappers around preprocessor functions.

Returns:

A tuple (preprocessing_output, metrics), where preprocessing_output is the result of running the tasks’ preprocessing code on raw_data and metrics is a mapping from task name to computed metrics.

seqio.test_utils.test_text_preprocessor(dataset)[source]#

Performs preprocessing on the text dataset.

seqio.test_utils.test_token_preprocessor(dataset, output_features, sequence_length)[source]#

Change all occurrences of non-zero even numbered tokens in inputs to 50.