seqio.beam_utils package#
SeqIO Beam utilities.
- class seqio.beam_utils.GetInfo(num_shards, exclude_provenance=True)[source]#
Computes info for dataset examples.
Expects a single PCollections of examples. Returns a dictionary with information needed to read the data (number of shards, feature shapes and types)
- class seqio.beam_utils.GetStats(output_features, task_ids=None, enable_char_counts=False)[source]#
Computes statistics for dataset examples.
The expand method expects a PCollection of examples where each example is a dictionary of string identifiers (e.g. “inputs” and “targets”) mapped to numpy array.
Returns a dictionary with statistics (number of examples, number of tokens) prefixed by the identifiers.
- class seqio.beam_utils.PreprocessTask(task, split, *, preprocessors_seed=None, setup_fn=<function PreprocessTask.<lambda>>, modules_to_import=(), add_provenance=False, tfds_data_dir=None)[source]#
Preprocesses a Task.
Returns a PCollection of example dicts containing Tensors.
- class seqio.beam_utils.WriteExampleArrayRecord(output_path, num_shards=None, preserve_random_access=False)[source]#
Writes examples (dicts) to an ArrayRecord of tf.Example protos.
- class seqio.beam_utils.WriteExampleTfRecord(output_path, num_shards=None)[source]#
Writes examples (dicts) to a TFRecord of tf.Example protos.