seqio.experimental package#
Experimental utilities for SeqIO.
- class seqio.experimental.FewshotDataSource(original_source, num_shots, train_preprocessors=(), eval_preprocessors=(), train_split='train', train_feature_keys=('inputs', 'targets'), shuffle_buffer_size=1000, eval_on_fixed_exemplars=False)[source]#
Combines two splits of another DataSource to provide fewshot examples.
Output examples are a dictionary containing a single eval example and a batch of train examples. For example, with num_shots=2:
- {
- ‘train’: {
- ‘inputs’: [
‘How many Beatles are there?’, ‘How many Beatles are alive in 2020?’
], ‘targets’: [‘4’, ‘2’]
}, ‘eval’: {
‘inputs’: ‘What city were the Beatles from?’ ‘targets’: ‘Liverpool’
}
}
Note that if num_shots is 0, the ‘train’ entry will not be included in the resulting examples.
- get_dataset(split=Split('train'), shuffle=True, seed=None, shard_info=None, *, sequence_length=None, use_cached=False, num_epochs=1)[source]#
Overrides base class to add shard identifier and remove use_cached.
- Parameters:
split – string, the split to return.
shuffle – bool, whether to shuffle the input source.
seed – tf.int64 scalar tf.Tensor (or None) for shuffling input source.
shard_info – optional specification for loading a shard of the split.
sequence_length – Unused
use_cached – Unused
num_epochs – Unused
- property supports_arbitrary_sharding#
Whether supports sharding beyond those available in list_shards.
- seqio.experimental.add_fully_cached_mixture(mixture_name, sequence_length, disallow_shuffling=False)[source]#
Adds fully-cached version of the mixture for given sequence lengths.
- seqio.experimental.add_fully_cached_task(task_name, sequence_length, disallow_shuffling=False)[source]#
Adds fully-cached version of the task for given sequence lengths.
- seqio.experimental.add_task_with_sentinels(task_name, num_sentinels=1)[source]#
Adds sentinels to the inputs/outputs of a task.
Adds num_sentinels sentinels to the end of ‘inputs’ and at the beginning of ‘targets’. This is known to help fine-tuning span corruption models, especially on smaller datasets.
This will also rename the task by adding a “_{num_sentinels}_sentinel” suffix to the task name, but making sure it comes before the following suffixes: ‘_train’, ‘_dev’, ‘_test’, ‘.’.
Example before: ‘inputs’: What is the captial of illinois? ‘targets’: Springfield.
Example after: ‘inputs’: What is the captial of illinois? <extra_id_0> ‘targets’: <extra_id_0> Springfield.
- Parameters:
task_name – a str, which is the name of the task you want to have sentinels added to. Note this will not override the current task, but will create a new one.
num_sentinels – integer, number of sentinels to end of inputs and the beginning of targets.
- seqio.experimental.fewshot_preprocessor(ds, inputs_prefix='', targets_prefix='', example_separator='\n\n', prompt='', reverse=False)[source]#
Create ‘inputs’ and ‘targets’ strings for (zero/few)-shot evaluation.
Inputs and targets will be formatted using the given prefixes along with a separator between each pair. The few-shot examples from the train set will include both inputs and targets, whereas the eval example (at the end) will contain only the input followed by the targets prefix.
NOTE: The final target prefix will be right-stripped so that the input does not end with whitepsace.
For example, a 2-shot output might look like: output: {
- ‘inputs’:
‘0 How many states in the US? X 1 50 X 0 How many cents in a dollar? X ‘ ‘1 100 X 0 Who was in the Beatles? X 1’,
‘targets’: ‘John’, ‘answers’: [‘John’, ‘Paul’, ‘George’, ‘Ringo’]
}
- Parameters:
ds – A dictionary of zipped eval and train tf.data.Datasets, each preprocessed with at least the fields ‘inputs’ and ‘targets’. Note that the train dataset will not exist in the 0-shot case.
inputs_prefix – Prefix string for inputs.
targets_prefix – Prefix string for targets.
example_separator – The string separator to delimit different examples.
prompt – Optional prefix for the entire few-shot input. Typically consists of a natural language description of the task or task instructions.
reverse – If True, the list of few shot examples is reversed. If used with eval_on_fixed_exemplars = True and a fixed train_seed, the last N shots will be the same when num_shots is N or N+M. In other words, additional shots are prepended instead of appended.
- Returns:
A tf.data.Dataset containing ‘inputs’, ‘targets’, and any other features from the evaluation dataset.