seqio.preprocessors package#
Preprocessors for SeqIO Tasks.
- seqio.preprocessors.append_eos(dataset, output_features)[source]#
Appends EOS to output feature token sequences with add_eos set to True.
Respects the add_eos field of the seqio.Features in output_features.
- Parameters:
dataset – a tf.data.Dataset of tokenized examples to preprocess.
output_features – a mapping of output feature names to Feature objects.
- Returns:
a tf.data.Dataset of tokenized examples with EOS added to specified output features.
- seqio.preprocessors.append_eos_after_trim(dataset, output_features, sequence_length=None)[source]#
Trims output feature token sequences and then appends EOS.
Respects the add_eos field of the seqio.Features in output_features. Truncates features before adding the EOS to ensure they fit in the max length specified by sequence_length once the EOS is added. If sequence_length is None, no trimming is performed.
Note that sequences are automatically trimmed at the end of the Task pipeline, so unless you want the features to always end in EOS, use append_eos instead.
- Parameters:
dataset – a tf.data.Dataset of tokenized examples to preprocess.
output_features – a mapping of output feature names to Feature objects.
sequence_length – a mapping from output feature names to max lengths. If provided, output feature sequences will be trimmed to ensure they are not longer than this length once EOS is added.
- Returns:
a tf.data.Dataset of tokenized examples with EOS added to specified output features.
- seqio.preprocessors.append_eos_after_trim_impl(features, output_features, sequence_length=None)[source]#
Trims output feature token sequences and then appends EOS.
Respects the add_eos field of the seqio.Features in output_features. Truncates features before adding the EOS to ensure they fit in the max length specified by sequence_length once the EOS is added. If sequence_length is None, no trimming is performed.
Note that sequences are automatically trimmed at the end of the Task pipeline, so unless you want the features to always end in EOS, use append_eos instead.
- Parameters:
features – a dict of tokenized examples to preprocess.
output_features – a mapping of output feature names to Feature objects.
sequence_length – a mapping from output feature names to max lengths. If provided, output feature sequences will be trimmed to ensure they are not longer than this length once EOS is added.
- Returns:
a tf.data.Dataset of tokenized examples with EOS added to specified output features.
- seqio.preprocessors.apply_feature_converter(dataset, sequence_length, feature_converter)[source]#
Applies feature converter on the dataset.
Example
Apply EncDecFeatureConverter with pack set to True to convert sequence-to-sequence examples to ‘packed examples’. preprocessors = [
- functools.partial(
apply_feature_converter, feature_converter=feature_converters.EncDecFeatureConverter(pack=True)),
]
- Parameters:
dataset – a tf.data.Dataset of tokenized examples to pack.
sequence_length – a mapping from output feature names to max lengths.
feature_converter – an instance of feature_converters.FeatureConverter.
- Returns:
a tf.data.Dataset of packed examples.
- seqio.preprocessors.preprocess_tensorflow_examples(example, inputs_format, targets_format)[source]#
Parse dict of tf.tensors into inputs and targets.
This function takes a tf.data.Dataset of strings. The function returns a tf.data.Dataset of feature dictionaries of the form {“inputs”: string, “targets”: string}.
inputs_format contains a template string used to produce the “inputs” string. targets_format contains a template string used to produce the “targets” string.
- Parameters:
example (Dict[str, tf.Tensor]) – The example to preprocess, represented as a dictionary where the keys are the feature names and the values are TensorFlow tensors.
inputs_format (str) – A format string specifying how to preprocess the inputs. The format string can include placeholder fields surrounded by curly braces, which will be replaced by the corresponding values from the example dictionary. For example, if the inputs_format is “Summarize this {text}”, the format string “{text}” will be replaced by the actual tensor value.
targets_format (str) – A format string specifying how to preprocess the targets. It follows the same rules as the inputs_format.
- Returns:
The preprocessed example, where the inputs and targets have been formatted according to the provided format strings.
- Return type:
Dict[str, tf.Tensor]
- seqio.preprocessors.print_dataset(features, summarize=3)[source]#
Print dataset fields for debugging purposes.
- seqio.preprocessors.rekey(x, key_map=None)[source]#
Replace the feature keys according to the mapping in key_map.
For example, if the dataset returns examples of the format: {‘foo’: ‘something’, ‘bar’: ‘something else’} and key_map = {‘boo’: ‘foo’, ‘spar’: ‘bar’} then this function will return examples with the format {‘boo’: ‘something’, ‘spar’: ‘something else’}
If a mapping is to an empty key name or None, the new value is set to an empty string.
- Parameters:
x – an example to process.
key_map – dictionary mapping new keys to original keys
- Returns:
A preprocessed example with the format listed above.
- seqio.preprocessors.tokenize(dataset, output_features, copy_pretokenized=True, with_eos=False)[source]#
Encode output features with specified vocabularies.
Passes through other features unchanged. Optionally passes through copy of original features with “_pretokenized” suffix added to the key.
When with_eos is True and input features are ranked > 1, then an EOS is appended only to the last item of each 1-D sequence.
- Parameters:
dataset – a tf.data.Dataset of examples to tokenize.
output_features – a dict of Feature objects; their vocabulary attribute will be used to tokenize the specified features.
copy_pretokenized – bool, whether to pass through copies of original features with “_pretokenized” suffix added to the key.
with_eos – bool, whether to append EOS to the end of the sequence.
- Returns:
a tf.data.Dataset
- seqio.preprocessors.tokenize_and_append_eos(dataset, output_features, copy_pretokenized=True)[source]#
Encode output features with specified vocabularies and append EOS.
Passes through non-string features unchanged. Optionally passes through copy of original features with “_pretokenized” suffix added to the key.
- Parameters:
dataset – a tf.data.Dataset of examples to tokenize.
output_features – a dict of Feature objects; their vocabulary attribute will be used to tokenize the specified features.
copy_pretokenized – bool, whether to pass through copies of original features with “_pretokenized” suffix added to the key.
- Returns:
a tf.data.Dataset
- seqio.preprocessors.tokenize_impl(features, output_features, copy_pretokenized=True, with_eos=False)[source]#
Encode output features with specified vocabularies.
Passes through other features unchanged. Optionally passes through copy of original features with “_pretokenized” suffix added to the key.
When with_eos is True and input features are ranked > 1, then an EOS is appended only to the last item of each 1-D sequence.
- Parameters:
features – a string-keyed dict of tensors to tokenize.
output_features – a dict of Feature objects; their vocabulary attribute will be used to tokenize the specified features.
copy_pretokenized – bool, whether to pass through copies of original features with “_pretokenized” suffix added to the key.
with_eos – bool, whether to append EOS to the end of the sequence.
- Returns:
a string-keyed dict of Tensors
- seqio.preprocessors.truncate_inputs_left(example, sequence_length)[source]#
Pre-processor for truncation of inputs sequences from the left.
Default seqio truncation always removes the overflow on the right, which may not be optimal for decoder only models. Applying this pre-processor truncates the ‘inputs’ from the left according to sequence_length[‘inputs’]. This pre-processor should be applied after [seqio.preprocessors.tokenize, seqio.preprocessors.append_eos].
- Parameters:
example – an example to process.
sequence_length – dictionary with token sequence length for the inputs and targets.
- Returns:
Example with truncated ‘inputs’ sequence.