seqio.utils package#
Utilities for data loading and processing.
- class seqio.utils.Feature(vocabulary, add_eos=True, required=True, dtype=tf.int32, rank=1)[source]#
A container for attributes of output features of data providers.
- dtype = tf.int32#
- class seqio.utils.LazyTfdsLoader(name=None, data_dir=None, split_map=None, decoders=None, builder_kwargs=None, read_only=False)[source]#
Wrapper for TFDS datasets with memoization and additional functionality.
Lazily loads info from TFDS and provides memoization to avoid expensive hidden file operations. Also provides additional utility methods.
- builder_cls(split=None)[source]#
Returns the DatasetBuilder class for this TFDS dataset.
If no builder class can be found (e.g. the class has not been imported), then None will be returned.
- Parameters:
split – optional split for which to return the builder class. This can be used in case there’s a custom split map with different datasets per split.
- Returns:
the builder class of the dataset in case it can be found and None if the builder class cannot be found.
- property data_dir#
Returns the data directory for this TFDS dataset.
- get_split_params(split=None)[source]#
Returns a tuple of (dataset, data_dir) for the given canonical split.
- load(split, shuffle_files, seed=None, shard_info=None)[source]#
Returns a tf.data.Dataset for the given split.
- load_shard(file_instruction, shuffle_files=False, seed=None)[source]#
Returns a dataset for a single shard of the TFDS TFRecord files.
- resolved_tfds_name(split=None)[source]#
Returns the resolved TFDS dataset name.
When the specified TFDS name doesn’t specify everything, e.g. the version has a wildcard or the config is not specified, then this function returns the complete TFDS name if the dataset has already been loaded.
- Parameters:
split – optional split name.
- Returns:
complete TFDS name.
- class seqio.utils.TfdsSplit(dataset, split, data_dir=None)[source]#
Points to a specific TFDS split.
- dataset#
dataset name.
- Type:
str
- split#
TFDS split (e.g. ‘train’), or slice (e.g. ‘train[“:1%”]’).
- Type:
str | None
- data_dir#
directory to read/write TFDS data.
- Type:
str | None
- seqio.utils.add_kwargs_to_transform(transform, **kwargs)[source]#
Returns the partial function or dataclass with the kwargs.
We use this function to add common arguments (sequence_length, output_features) on all transformations that require those.
- Parameters:
transform – A dataclass or function.
**kwargs – Arguments to be passed to the transform.
- Returns:
If transform is a dataclasses attributes matching the kwargs keys will be set to the kwargs values. Otherwise if transform is a function and takes any of the provided kwargs these will be passed (by running a partial function).
- seqio.utils.dict_to_tfexample(dct, store_shapes=False)[source]#
Convert dictionary of tensors to a tf.train.Example proto.
NOTE: Unfortunately, tensorflow.Example is a very simple proto that can only store keyed lists of int64, float and bytes, and doesn’t have a marker to store metadata needed for multi-dimensional or ragged tensors. This function stores using additional features; using tfexample_to_dict will allow you to recover the original dictionary of tensors modulo the following caveat.
tensorflow.train.Example only stores int64s and float32s: this function casts bools, ints and floats respectively, and this type information is lost.
- Parameters:
dct – A dictionary mapping feature keys to 1-D tensors.
store_shapes – If true, add an additional feature to store the shapes of any 2+d or sparse tensors. This feature is only required when using tfexample_to_dict, as tf.io.parse_single_example requires the shape be provided as an argument.
- Returns:
A tf.train.Example for dct
- seqio.utils.flatten_dict(nested_dct, delimiter='/')[source]#
Create a “flattened” dictionary from one with nested keys.
This method converts a nested TFDict like: >>> { >>> “key1”: { >>> “subkey1”: …, >>> “subkey2”: …, >>> }, >>> “key2”: { >>> “subkey1”: … >>> }, >>> } into a flattened dictionary: >>> { >>> “key1/subkey1”: …, >>> “key1/subkey2”: …, >>> “key2/subkey1”: …, >>> }
- Parameters:
nested_dct – A nested dictionary of tensors.
delimiter – A delimiter used to separate keys from subkeys.
- Returns:
A flattened version of nested_dct.
- seqio.utils.fully_qualified_class_name(instance)[source]#
Returns the fully qualified class name of the given instance.
- seqio.utils.function_name(function)[source]#
Returns the name of a (possibly partially applied) function.
- seqio.utils.list_files(file_patterns)[source]#
Returns a sorted list of file for the given file pattern.
Note that shard patterns like foo@2 are not expanded. Only glob patterns are expanded, e.g., foo*.
- Parameters:
file_patterns – A string or an iterable of strings, each of which is a file pattern to expand.
- seqio.utils.make_autoregressive_inputs(targets, sequence_id=None, output_dtype=None, bos_id=0)[source]#
Generate inputs for an autoregressive model, by shifting the targets.
Modified from mesh_tensorflow.transformer.transformer.autoregressive_inputs.
For the first element of each sequence, the returned input id is 0.
For a “packed” dataset, also pass the sequence_id tensor, which aligns with the targets tensor and contains different values for different concatenated examples.
Example for a packed dataset:
- ```
targets = [3, 8, 1, 9, 1, 5, 4, 1, 0, 0]
- sequence_id = [1, 1, 1, 2, 2, 3, 3, 3, 0, 0]
- inputs = [0, 3, 8, 0, 9, 0, 5, 4, 0, 0]
- | |
These positions are set to 0 if sequence_id is not None.
- Parameters:
targets – a tf.int32 tensor with shape [length].
sequence_id – an optional tensor with the same shape as targets.
output_dtype – an optional output data type.
bos_id – bos id.
- Returns:
a tensor with dtype tf.int32 and the same shape as targets.
- seqio.utils.map_over_dataset(fn=None, *, num_seeds=None, num_parallel_calls=-1)[source]#
Decorator to map decorated function over dataset.
Many preprocessors map a function over a dataset. This decorator helps reduce boilerplate for this common pattern.
If num_seeds is set to 1, a unique random seed (pair of int32) will be passed to the mapping function with keyword ‘seed’. If num_seeds is greater than 1, unique random seeds (pairs of int32) will be passed to the mapping function with keyword ‘seeds’. These seeds can be generated deterministically by using the map_seed_manager to set the seed for the process that generates the individual seeds for each mapping function. These seeds will be set sequentially from the initial seed for each call to map_over_dataset where num_seeds > 0.
- Parameters:
fn – map function
num_seeds – optional number of random seeds (pairs of int32) to pass to the mapping function.
num_parallel_calls – num_parallel_calls value to pass to Dataset.map
- Returns:
Callable transform which takes dataset as first argument.
- seqio.utils.map_seed_manager(initial_seed=None)[source]#
Contextmanager to control the initial seed used by map_over_dataset.
- seqio.utils.mixing_rate_num_characters(task, temperature=1.0, char_count_name='text_chars')[source]#
Mixing rate based on the number of characters for the task’s ‘train’ split.
- Parameters:
task – the seqio.Task to compute a rate for.
temperature – a temperature (T) to scale rate (r) by as r^(1/T).
char_count_name – feature name of the character counts in the cached stats file.
- Returns:
The mixing rate for this task.
- seqio.utils.mixing_rate_num_examples(task, maximum=None, scale=1.0, temperature=1.0, fallback_to_num_input_examples=True, split='train')[source]#
Mixing rate based on the number of examples for the task’s split.
It should be noted that SeqIO only injects the task, and all other parameters must be provided by the user when initializing the Mixture.
- Parameters:
task – the seqio.Task to compute a rate for.
maximum – an optional maximum value to clip at after constant scaling but before temperature scaling.
scale – a multiplicative scaling factor applied before temperature.
temperature – a temperature (T) to scale rate (r) by as r^(1/T).
fallback_to_num_input_examples – whether to fallback to using the number of input examples when the Task is not cached. Otherwise, an error will be raised.
split – the split to look at for cached stats.
- Returns:
The mixing rate for this task.
- seqio.utils.set_preprocessor_seed(preprocessor_fn, seed=None)[source]#
Sets the internal map seed for the provided preprocessor.
- seqio.utils.tfexample_ragged_length_key(key, dim)[source]#
Demarcates feature used to store dim-th ragged lengths for key.
This function can be used when parsing data generated by dict_to_tfexample, by specifying the following ragged feature: >>> KEY = “ragged_tensor” >>> tf.io.parse_single_example(example, { >>> KEY: tf.io.RaggedFeature( >>> dtype, >>> value_key=KEY, >>> partitions=( >>> tf.io.RaggedFeature.RowLengths(tfexample_ragged_length_key(KEY, 0)), >>> tf.io.RaggedFeature.RowLengths(tfexample_ragged_length_key(KEY, 1)), >>> … (repeat for however many ragged dimensions the data has) >>> ),) >>> })
- Parameters:
key – The key storing the values of the ragged tensor.
dim – The ragged dimension to generate a prefix for.
- Returns:
The key of the feature storing the dim-th ragged lengths.
- seqio.utils.tfexample_sparse_indices_key(key, dim)[source]#
Demarcates feature used to store dim-th sparse feature indices for key.
This function can be used when parsing data generated by dict_to_tfexample, by specifying the following sparse feature: >>> KEY = “sparse_tensor” >>> tf.io.parse_single_example(example, { >>> KEY: tf.io.SparseFeature( >>> value_key=KEY, >>> index_key=[ >>> tfexample_sparse_indices_key(KEY, 0), >>> tfexample_sparse_indices_key(KEY, 1), >>> … (repeat for however many sparse dimensions the data has) >>> ], >>> size=[shape0, shape1, …], >>> dtype=dtype, >>> ), >>> })
- Parameters:
key – The key storing the values of the sparse tensor.
dim – The sparse dimension to generate a prefix for.
- Returns:
The key of the feature storing the dim-th sparse indices.
- seqio.utils.tfexample_to_dict(example)[source]#
Helper function to create a dictionary of tensors from a TFExample.
NOTE: this function is less efficient than tf.io.parse_single_example, but allows parsing of TFExample without knowing its features ahead of time. See the documentation of tfexample_ragged_length_key for an example of how to parse ragged tensors efficiently using tf.io.parse_single_example.
- Parameters:
example – An instance of tf.train.Example we will convert into a dict.
- Returns:
A dict with the keys of example and tensors as values.
- seqio.utils.trim_and_pack_dataset(dataset, feature_lengths, use_custom_ops=False)[source]#
Creates a ‘packed’ version of a dataset on-the-fly.
Modified from the tensor2tensor library.
This is meant to replace the irritation of having to create a separate “packed” version of a dataset to train efficiently on TPU.
Each example in the output dataset represents several examples in the input dataset.
For each key in the input dataset that also exists in feature_lengths, two additional keys are created:
- <key>_segment_ids: an int32 tensor identifying the parts
representing the original example.
- <key>_positions: an int32 tensor identifying the position within the
original example.
Features that are not in feature_lengths will be removed.
Example
Two input examples get combined to form an output example. The input examples are: {“inputs”: [8, 7, 1, 0], “targets”:[4, 1, 0], “idx”: 0} {“inputs”: [2, 3, 4, 1], “targets”:[5, 6, 1], “idx”: 1} The output example is: {
“inputs”: [8, 7, 1, 2, 3, 4, 1, 0, 0, 0]
- “inputs_segment_ids”: [1, 1, 1, 2, 2, 2, 2, 0, 0, 0]
- “inputs_positions”: [0, 1, 2, 0, 1, 2, 3, 0, 0, 0]
“targets”: [4, 1, 5, 6, 1, 0, 0, 0, 0, 0]
- “targets_segment_ids”: [1, 1, 2, 2, 2, 0, 0, 0, 0, 0]
“targets_positions”: [0, 1, 0, 1, 2, 0, 0, 0, 0, 0]
}
0 represents padding in both the inputs and the outputs.
Sequences in the incoming examples are truncated to length in feature_lengths, and the sequences in the output examples all have this fixed (padded) length. Features not in features_length (i.e, “idx”) are removed.
- Parameters:
dataset – a tf.data.Dataset
feature_lengths – map from feature key to final length. Other features will be discarded.
use_custom_ops – a boolean - custom ops are faster but require a custom-built binary, which is not currently possible on cloud-tpu.
- Returns:
a tf.data.Dataset
- seqio.utils.trim_and_pad_dataset(dataset, feature_lengths)[source]#
Trim and pad first dimension of features to feature_lengths.
- Parameters:
dataset – tf.data.Dataset, the dataset to trim/pad examples in.
feature_lengths – map from feature key to final length. Other features will be returned unchanged.
- Returns:
Trimmed/padded tf.data.Dataset.
- seqio.utils.trim_dataset(dataset, sequence_length, output_features)[source]#
Trim output features to sequence length.
- seqio.utils.unflatten_dict(dct, delimiter='/')[source]#
Create a nested dictionary from one with nested keys.
This method converts a “flat” TFDict with nested keys like: >>> { >>> “key1/subkey1”: …, >>> “key1/subkey2”: …, >>> “key2/subkey1”: …, >>> } into a nested dictionary: >>> { >>> “key1”: { >>> “subkey1”: …, >>> “subkey2”: …, >>> }, >>> “key2”: { >>> “subkey3”: … >>> }, >>> }
- Parameters:
dct – An dictionary of tensors.
delimiter – A delimiter used to separate keys from subkeys.
- Returns:
A nested-version of dct.