seqio.feature_converters package#
Base Class#
- class seqio.feature_converters.FeatureConverter(pack=True, use_custom_packing_ops=False, apply_length_check=True, bos_id=0, passthrough_features=None)[source]#
Abstract base class for feature converters.
Subclasses of FeatureConverter are used to convert the tf.data.Dataset instance from the Task API to features that are passed to the model implementation. Note that Task API has an attribute “output_features”, which is referred to as “task features” in the context of FeatureConverter.
Typically the task features contain keys: “inputs” and “targets”. The model features are constructed based on what is consumed by the model architecture. For custom model architectures that require additional model features, one needs to subclass FeatureConverter.
This conversion is fully specified by
defining the mapping of the features in _convert_features method and
defining the relationship between sequence lengths of input and output features in get_model_feature_lengths which is a function of task_feature_lengths.
Therefore, a subclass of FeatureConverter should override _convert_features and get_model_feature_lengths methods.
The actual feature conversion is done in the __call__ method, which wraps around the _convert_features method in order to provide useful checks and ensure compatibilities. See _validate_dataset and __call__ methods for more details.
Other notes:
If pack = True, each feature in the task features should be packable, i.e., 1-dimensional.
Subclasses must override TASK_FEATURES and MODEL_FEATURES. If packing is used, they must override PACKING_FEATURE_DTYPES as well. These are the packing-specific features such as “*_segment_ids”.
Pass-through features are incompatible with packing and should not be used in that case. FeatureConverter only implements the scaffolding, but the real pass-through should be implemented in each sub-class inside _convert_features and get_model_feature_lengths.
- pack#
whether to pack the dataset.
- use_custom_packing_ops#
whether to use custom ops for packing.
- apply_length_check#
if True, it checks whether output feature lengths are less than the lengths given by sequence_length.
- bos_id#
bos id for decoder inputs.
- passthrough_features#
a mapping that extends the TASK_FEATURES and MODEL_FEATURES including features that will pass through without any processing.
Implementations#
Feature converters for common architectures.
In short, feature converters carry out additional data processing to the tf.data.Dataset out of the Task API. They convert the features of the input dataset into more descriptive features (e.g., “decoder_target_tokens” instead of “targets”) as well as pad and/or pack them. The features of the input dataset are referred to as “task_features” because they are the output of the Task API. Those of the output dataset are referred to as “model_features” as they are the features directly fed to the model implementation.
We provide feature converters for the following three architectures:
encoder-decoder
decoder-only
encoder-only
Each of these feature converters inherit the base class FeatureConverter and override two methods _convert_features and get_model_feature_lengths to define how task features are mapped to the model features including the length relationships. Other model architectures can be supported by subclassing the FeatureConverter class in a similar manner.
Definition: standard_features
Throughout this module, we refer to the following 10 fields as standard_features. Depending on the model architecture, a subset of them will be returned by the feature converter.
encoder_input_tokens
encoder_target_tokens
encoder_loss_weights
encoder_positions
encoder_segment_ids
decoder_input_tokens
decoder_target_tokens
decoder_loss_weights
decoder_positions
decoder_segment_ids
*_segment_ids and *_positions fields are only relevant for packed dataset.
*_segment_ids is a tf.Tensor of integer which is aligned with *_input_tokens. Positive integers represent the sequence membership in the packed examples. 0 represents padding. For example, encoder_segment_ids = [1, 1, 2, 2, 2, 0] means that the first two positions belong to the first sequence, the next three to the second sequence and the last position is a padding.
*_positions is a tf.Tensor of integer representing the position index in the original sequence before packing. For example, consider encoder_positions = [0, 1, 0, 1, 2, 0]. The first two tokens were the 0th and 1st tokens of the first sequence and next three tokens are the 0th, 1st and 2nd tokens of the second sequence before packing.
*_loss_weights is used to indicate which positions should be used for the loss calculation.
Underlying assumptions
The feature converters implemented in this module assume the following about the input dataset.
If EOS tokens are required, they are already appended in the input dataset.
The input dataset is not batched.
- class seqio.feature_converters.DecoderFeatureConverter(loss_on_targets_only=True, pack=True, use_custom_packing_ops=False, apply_length_check=True, bos_id=0, passthrough_features=None)[source]#
Wrapper of FeatureConverter that handles both LM and PrefixLM tasks.
The converter to choose depends on the keys of task_feature_lengths.
- class seqio.feature_converters.EncDecFeatureConverter(pack=True, use_custom_packing_ops=False, apply_length_check=True, bos_id=0, passthrough_features=None)[source]#
Feature converter for an encoder-decoder architecture.
The input dataset has “inputs” and “targets” field. These will be converted to a subset of standard features.
To use packing, pass pack = True argument to the FeatureConverter’s constructor. When packing is done, two additional fields are added for each of “inputs” and “targets” fields.
Example for a packed dataset:
The input dataset has two examples each with “inputs” and “targets”.
- ds = [{“inputs”: [7, 8, 5, 1], “targets”: [3, 9, 1]},
{“inputs”: [8, 4, 9, 3, 1], “targets”: [4, 1]}]
task_feature_lengths = {“inputs”: 10, “targets”: 7}
First, the inputs are packed together, padded to length 10 and assigned to “encoder_input_tokens” field. The targets are processed similarly.
The “*_segment_id” fields are generated from the packing operation. For the explanation of these fields, see the module docstring.
The “decoder_loss_weights” is a binary mask indicating where non-padding positions are, i.e., value of 1 indicates non-padding and 0 for padding. This class assumes that the loss is taken only on the decoder side.
- converted_ds = [{
- “encoder_input_tokens”: [7, 8, 5, 1, 8, 4, 9, 3, 1, 0],
- “encoder_segment_ids”: [1, 1, 1, 1, 2, 2, 2, 2, 2, 0],
“encoder_positions”: [0, 1, 2, 3, 0, 1, 2, 3, 4, 0],
- “decoder_target_tokens”: [3, 9, 1, 4, 1, 0, 0],
“decoder_input_tokens”: [0, 3, 9, 0, 4, 0, 0], “decoder_loss_weights”: [1, 1, 1, 1, 1, 0, 0],
- “decoder_segment_ids”: [1, 1, 1, 2, 2, 0, 0],
“decoder_positions”: [0, 1, 2, 0, 1, 0, 0],
}]
Note that two examples are packed together into one example.
- class seqio.feature_converters.EncoderFeatureConverter(mask_id, **kwargs)[source]#
Feature converter for encoder-only achitecture such as BERT.
The inputs and targets to the encoder are expected to be aligned.
Just like BERT (Devlin et al. 2019, https://arxiv.org/abs/1810.04805), a sentinel token (e.g., [CLS]) is expected to be prepended to the inputs and targets sequences. This ensures that the model can be used for a classification task. For a packed dataset, each sequence has separate sentinel tokens. In terms of segment_id, the classification sentinel is considered as a part of the sequence to which it is appended.
Example for a packed dataset:
The input dataset has two examples each with algined “inputs” and “targets”.
Here assume that mask_id = 9 and cls_id = 8
- ds = [{“inputs”: [8, 9, 9, 3, 4, 1], “targets”: [8, 7, 4, 3, 4, 1]},
{“inputs”: [8, 3, 9, 1], “targets”: [8, 3, 6, 1]}]
task_feature_lengths = {“inputs”: 11, “targets”: 11}
- converted_ds = [{
“encoder_input_tokens”: [8, 9, 9, 3, 4, 1, 8, 3, 9, 1, 0],
- “encoder_target_tokens”: [8, 7, 4, 3, 4, 1, 8, 3, 6, 1, 0],
- “encoder_segment_ids”: [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 0],
“encoder_positions”: [0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 0],
“encoder_loss_weights”: [0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0]
}]
Note that two examples are packed together into one example.
- mask_id#
an integer indicating the mask sentinel token. This id is used to find the positions where the loss is taken.
- class seqio.feature_converters.LMFeatureConverter(pack=True, use_custom_packing_ops=False, apply_length_check=True, bos_id=0, passthrough_features=None)[source]#
Feature converter for a language model (decoder-only) architecture.
The input dataset must have “targets” field only.
One common usecase is to pre-train a decoder-only model with the standard language modeling objective (i.e., predict the next token given the previous ones) on a unlabeled text corpus which only has “targets”. Then the pre-trained model can be fine-tuned on a supervised task, e.g., machine translation by concatenating “inputs” and “targets”. For this use case, pre-train with LMFeatureConverter and fine-tune with PrefixLMFeatureConverter.
Example: a packed dataset.
ds = [{“targets”: [3, 9, 1]}, {“targets”: [4, 1]}]
input_lengths = {“targets”: 6}
- converted_ds = {
- “decoder_target_tokens”: [3, 9, 1, 4, 1, 0],
“decoder_input_tokens”: [0, 3, 9, 0, 4, 0], “decoder_loss_weights”: [1, 1, 1, 1, 1, 0],
“decoder_positions”: [0, 1, 2, 0, 1, 0],
“decoder_segment_ids”: [1, 1, 1, 2, 2, 0]
}
Note that two examples are packed together into one example.
- class seqio.feature_converters.PassThroughFeatureConverter(**unused_kwargs)[source]#
This feature converter pass through the dataset without any processing.
- class seqio.feature_converters.PrePackedEncDecFeatureConverter(**kwargs)[source]#
Feature converter for encoder-decoder with pre-packed examples.
The input dataset has “inputs”, “targets”, “inputs_segment_ids”, “inputs_positions”, “targets_segment_ids”, and “targets_positions”. These will be converted to a subset of standard features.
Since this feature converter assumes the data is already pre-packed, setting ‘pack=True’ is not allowed.
- class seqio.feature_converters.PrePackedLMFeatureConverter(**unused_kwargs)[source]#
This feature converter fixes length and filters batch features.
- class seqio.feature_converters.PrePackedPrefixLMFeatureConverter(**unused_kwargs)[source]#
Prefix LM variant of PrePackedLMFeatureConverter.
The pass through feature converter fixes feature lengths and filters batch features. For Prefix LM, the inputs and targets are combined into each feature.
- class seqio.feature_converters.PrefixLMFeatureConverter(loss_on_targets_only=True, **kwargs)[source]#
Feature converter for a prefix language model architecture.
The input dataset must have both “inputs” and “targets” fields. For language modeling objective with “targets” only dataset, use LMFeatureConverter.
A decoder is a network which autoregressively produces an output sequence. It can be used for an input dataset which has a notion of “inputs” as well as “targets”, (e.g., machine translation) by concatenating them to form the new targets. See Raffel et al. (2020), https://arxiv.org/abs/1910.10683, Section 3.2.1 for a more detailed take on this topic.
In the Prefix LM architecture discussed in Raffel et al. (2020), the tokens from the “inputs” portion are applied a fully visible self attention whereas those from “targets” are applied the causal self attention. This makes the contextual representation of the tokens from “inputs” bidirectional.
In order to provide this information, this class provides an additional feature “decoder_causal_attention” on top of the model features returned by LMFeatureConverter. “decoder_causal_attention” is a binary mask where a value of 1 represents that the corresponding input token to the decoder belongs to the “inputs” before concatenation. Note that this attention mask is optional. For a model that does not require this feature, e.g., a fully causal masking on the concatenated sequence, the attention mask can be simply ignored.
Note that “decoder_causal_attention” includes one additional position to the right. This is the position where the final token of the “inputs” (often an EOS) is read and the first “targets” token is predicted. This follows mesh_tensorflow/transformer/transformer.py
Since “inputs” and “targets” are concatenated to form the new targets for the decoder, we might want to compute the loss only on the tokens that belong to “targets” before concatenation. This behavior is controlled by “loss_on_targets_only” attribute, which is passed to the constructor. By default, it is set to True. The resulting “decoder_loss_weights” therefore zeros out “inputs” portion as well as the padding tokens while having 1’s on the targets token.
Example 1: a packed dataset ``` ds = [{“inputs”: [7, 8, 5, 1], “targets”: [3, 9, 1]},
{“inputs”: [8, 4, 9, 3, 1], “targets”: [4, 1]}]
task_feature_lengths = {“inputs”: 7, “targets”: 8}
- converted_ds = {
- “decoder_target_tokens”: [7, 8, 5, 1, 3, 9, 1, 8, 4, 9, 3, 1, 4, 1, 0],
“decoder_input_tokens”: [0, 7, 8, 5, 1, 3, 9, 0, 8, 4, 9, 3, 1, 4, 0], “decoder_loss_weights”: [0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0],
“decoder_positions”: [0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0],
“decoder_segment_ids”: [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 0],
“decoder_causal_attention”: [1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0]
}#
Example 2: unpacked dataset with extra long “inputs” task_feature_lengths ``` ds = [{“inputs”: [9, 4, 6, 1], “targets”: [3, 9, 1]}]
task_feature_lengths = {“inputs”: 10, “targets”: 4}
- converted_ds = {
- “decoder_target_tokens”: [9, 4, 6, 1, 3, 9, 1, 0, 0, 0, 0, 0, 0, 0],
“decoder_input_tokens”: [0, 9, 4, 6, 1, 3, 9, 1, 0, 0, 0, 0, 0, 0], “decoder_loss_weights”: [0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
“decoder_causal_attention”: [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
}
Note that if the inputs length specified in task_feature_lengths is longer than the actual example length, the padding tokens are added after concatenation. ```
- loss_on_targets_only#
whether to compute loss on tokens which belonged to “targets” before concatenation.
- class seqio.feature_converters.PrefixSuffixLMFeatureConverter(loss_on_targets_only=True, **kwargs)[source]#
Feature converter for a input + target + suffix language model.
When “suffixes” is an empty list, it is identical as PrefixLMFeatureConverter. When “suffixes” is not empty, it merges “targets” and “suffixes” but computes the loss only over tokens from “suffixes”.
Example: a packed dataset ``` ds = [{“inputs”: [9, 4, 6], “targets”: [3, 9], “suffixes”: [2, 1]},
{“inputs”: [3, 2,], “targets”: [4,], “suffixes”: []}]
task_feature_lengths = {“inputs”: 7, “targets”: 8}
- converted_ds = {
“decoder_target_tokens”: [9, 4, 6, 3, 9, 2, 1, 3, 2, 4, 0, 0, 0, 0, 0], “decoder_input_tokens”: [0, 9, 4, 6, 3, 9, 2, 0, 3, 2, 0, 0, 0, 0, 0], “decoder_loss_weights”: [0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0], “target_suffix_weights”: [0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0], “decoder_positions”: [0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 0, 0, 0, 0, 0], “decoder_segment_ids”: [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 0, 0], “decoder_causal_attention”: [1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0],
}#