seqio.experimental package#

Experimental utilities for SeqIO.

class seqio.experimental.FewshotDataSource(original_source, num_shots, train_preprocessors=(), eval_preprocessors=(), train_split='train', train_feature_keys=('inputs', 'targets'), shuffle_buffer_size=1000, eval_on_fixed_exemplars=False)[source]#

Combines two splits of another DataSource to provide fewshot examples.

Output examples are a dictionary containing a single eval example and a batch of train examples. For example, with num_shots=2:

{
‘train’: {
‘inputs’: [

‘How many Beatles are there?’, ‘How many Beatles are alive in 2020?’

], ‘targets’: [‘4’, ‘2’]

}, ‘eval’: {

‘inputs’: ‘What city were the Beatles from?’ ‘targets’: ‘Liverpool’

}

}

Note that if num_shots is 0, the ‘train’ entry will not be included in the resulting examples.

get_dataset(split=Split('train'), shuffle=True, seed=None, shard_info=None, *, sequence_length=None, use_cached=False, num_epochs=1)[source]#

Overrides base class to add shard identifier and remove use_cached.

Parameters:
  • split – string, the split to return.

  • shuffle – bool, whether to shuffle the input source.

  • seed – tf.int64 scalar tf.Tensor (or None) for shuffling input source.

  • shard_info – optional specification for loading a shard of the split.

  • sequence_length – Unused

  • use_cached – Unused

  • num_epochs – Unused

list_shards(split)[source]#

Returns string identifiers of input shards.

property supports_arbitrary_sharding#

Whether supports sharding beyond those available in list_shards.

seqio.experimental.add_fully_cached_mixture(mixture_name, sequence_length, disallow_shuffling=False)[source]#

Adds fully-cached version of the mixture for given sequence lengths.

seqio.experimental.add_fully_cached_task(task_name, sequence_length, disallow_shuffling=False)[source]#

Adds fully-cached version of the task for given sequence lengths.

seqio.experimental.add_task_with_sentinels(task_name, num_sentinels=1)[source]#

Adds sentinels to the inputs/outputs of a task.

Adds num_sentinels sentinels to the end of ‘inputs’ and at the beginning of ‘targets’. This is known to help fine-tuning span corruption models, especially on smaller datasets.

This will also rename the task by adding a “_{num_sentinels}_sentinel” suffix to the task name, but making sure it comes before the following suffixes: ‘_train’, ‘_dev’, ‘_test’, ‘.’.

Example before: ‘inputs’: What is the captial of illinois? ‘targets’: Springfield.

Example after: ‘inputs’: What is the captial of illinois? <extra_id_0> ‘targets’: <extra_id_0> Springfield.

Parameters:
  • task_name – a str, which is the name of the task you want to have sentinels added to. Note this will not override the current task, but will create a new one.

  • num_sentinels – integer, number of sentinels to end of inputs and the beginning of targets.

seqio.experimental.disable_registry()[source]#

Disables the seqio TaskRegistry and MixtureRegistry.

seqio.experimental.fewshot_preprocessor(ds, inputs_prefix='', targets_prefix='', example_separator='\n\n', prompt='', reverse=False)[source]#

Create ‘inputs’ and ‘targets’ strings for (zero/few)-shot evaluation.

Inputs and targets will be formatted using the given prefixes along with a separator between each pair. The few-shot examples from the train set will include both inputs and targets, whereas the eval example (at the end) will contain only the input followed by the targets prefix.

NOTE: The final target prefix will be right-stripped so that the input does not end with whitepsace.

For example, a 2-shot output might look like: output: {

‘inputs’:

‘0 How many states in the US? X 1 50 X 0 How many cents in a dollar? X ‘ ‘1 100 X 0 Who was in the Beatles? X 1’,

‘targets’: ‘John’, ‘answers’: [‘John’, ‘Paul’, ‘George’, ‘Ringo’]

}

Parameters:
  • ds – A dictionary of zipped eval and train tf.data.Datasets, each preprocessed with at least the fields ‘inputs’ and ‘targets’. Note that the train dataset will not exist in the 0-shot case.

  • inputs_prefix – Prefix string for inputs.

  • targets_prefix – Prefix string for targets.

  • example_separator – The string separator to delimit different examples.

  • prompt – Optional prefix for the entire few-shot input. Typically consists of a natural language description of the task or task instructions.

  • reverse – If True, the list of few shot examples is reversed. If used with eval_on_fixed_exemplars = True and a fixed train_seed, the last N shots will be the same when num_shots is N or N+M. In other words, additional shots are prepended instead of appended.

Returns:

A tf.data.Dataset containing ‘inputs’, ‘targets’, and any other features from the evaluation dataset.