seqio.helpers package#

A collection of helper methods.

class seqio.helpers.TruncatedDatasetProvider(child, split_sizes, shuffle_buffer_size=None)[source]#

Wraps a dataset provider, truncating its data using ds.take(N).

property caching_permitted#

See base class for documentation.

get_dataset(split=Split('train'), shuffle=True, seed=None, shard_info=None, *, sequence_length=None, use_cached=False, num_epochs=1)[source]#

See base class for documentation.

list_shards(split)[source]#

See base class for documentation.

num_input_examples(split)[source]#

See base class for documentation.

property output_features#

See base class for documentation.

property splits#

See base class for documentation.

property supports_arbitrary_sharding#

See base class for documentation.

seqio.helpers.mixture_or_task_with_new_vocab(mixture_or_task, new_mixture_or_task_name, *, new_vocab=None, new_output_features=None, add_to_seqio_registry=True, add_cache_placeholder=False, validate_features=True)[source]#

Creates a new Task/Mixture from a given Task/Mixture with a new vocabulary.

Parameters:
  • mixture_or_task – The original Task or Mixture, or the name of a registered Task or Mixture.

  • new_mixture_or_task_name – The name of the new Task or Mixture. For Mixtures, this is also used as a prefix for subtasks, e.g. “subtask_1” is registered with the new vocabulary as “new_mixture_or_task_name.subtask_1”.

  • new_vocab – The new vocabulary to be used. This is used for all features. If configuring different vocabularies for different features, pass the new_output_features arg instead. Note that only one of new_vocab or new_output_features must be provided.

  • new_output_features – A dict of feature name to seqio.Feature to be used for the new Mixture and its subtasks. This dict must (1) have the same keys as the original mix_or_task.output_features and (2) for each key, only the vocabulary and add_eos fields may differ in the new seqio.Feature. This can be created from the original mix_or_task.output_features as follows: ` new_output_features = {} new_output_features["f1"] = dataclasses.replace( mix_or_task.output_features["f1"], vocaulary=f1_vocab, add_eos=True) new_output_features["f2"] = dataclasses.replace( mix_or_task.output_features["f2"], vocaulary=f2_vocab) `

  • add_to_seqio_registry – If True, adds the new Task/Mixture and all sub-Tasks/Mixtures to the SeqIO Registry.

  • add_cache_placeholder – If True, adds CacheDatasetPlaceholder in new tasks if their old tasks do not have it.

  • validate_features – Whether to validate the new feature set is compatible with the source task’s output features.

Returns:

The new Task or Mixture object.

seqio.helpers.mixture_or_task_with_truncated_data(mixture_or_task, new_mixture_or_task_name, *, split_sizes, add_to_seqio_registry=True)[source]#

Creates a new Task/Mixture from a given Task/Mixture with less data.

This can be used for creating smaller subsets of datasets for quick evaluation and few-shot fine-tuning datasets.

Parameters:
  • mixture_or_task – The original Task or Mixture, or the name of a registered Task or Mixture.

  • new_mixture_or_task_name – The name of the new Task or Mixture. For Mixtures, this is also used as a prefix for subtasks, e.g. “subtask_1” is registered with the new vocabulary as “new_mixture_or_task_name.subtask_1”.

  • split_sizes – Dict-like of maximum number of examples to keep in each split. For mixtures, this is the maximum number of examples for each task. e.g. split_sizes={‘train’: 1000, ‘validation’: 500, ‘test’: 500}

  • add_to_seqio_registry – If True, adds the new Task/Mixture to the SeqIO Registry. For Mixtures, sub-Tasks/Mixtures are always registered so that the new Mixture can refer to these.

Returns:

The new Task or Mixture object.

seqio.helpers.mixture_with_missing_task_splits_removed(mixture_name, split, new_mixture_name, *, add_to_seqio_registry=True)[source]#

Creates a new mixture removing all subtasks missing the given split.

In Mixture.get_dataset(…), if a subtask is missing the desired split, it is ignored. This means that actual mixing rates could be different from what is desired, although it is helpful in defining super-Mixtures that contain multiple splits. This helper provides a way to split these super-Mixtures in per-split Mixtures by taking a Mixture and a split, and creating a new Mixture removing all subtasks missing that split. Mixing rates for other subtasks remain unchanged.

Parameters:
  • mixture_name – The name of the original Mixture.

  • split – The split for which to check valid sub-tasks.

  • new_mixture_name – The name of the new Mixture.

  • add_to_seqio_registry – If True, adds the new Mixture to the SeqIO Registry.

Returns:

The new Mixture object.