seqio.beam_utils package#

SeqIO Beam utilities.

class seqio.beam_utils.GetInfo(num_shards, exclude_provenance=True)[source]#

Computes info for dataset examples.

Expects a single PCollections of examples. Returns a dictionary with information needed to read the data (number of shards, feature shapes and types)

class seqio.beam_utils.GetStats(output_features, task_ids=None, enable_char_counts=False)[source]#

Computes statistics for dataset examples.

The expand method expects a PCollection of examples where each example is a dictionary of string identifiers (e.g. “inputs” and “targets”) mapped to numpy array.

Returns a dictionary with statistics (number of examples, number of tokens) prefixed by the identifiers.

class seqio.beam_utils.PreprocessTask(task, split, *, preprocessors_seed=None, setup_fn=<function PreprocessTask.<lambda>>, modules_to_import=(), add_provenance=False, tfds_data_dir=None)[source]#

Preprocesses a Task.

Returns a PCollection of example dicts containing Tensors.

class seqio.beam_utils.WriteExampleArrayRecord(output_path, num_shards=None, preserve_random_access=False)[source]#

Writes examples (dicts) to an ArrayRecord of tf.Example protos.

class seqio.beam_utils.WriteExampleTfRecord(output_path, num_shards=None)[source]#

Writes examples (dicts) to a TFRecord of tf.Example protos.

class seqio.beam_utils.WriteJson(output_path, prettify=True)[source]#

Writes datastructures to file as JSON(L).

class seqio.beam_utils.WriteToArrayRecord(file_path_prefix, file_name_suffix='', num_shards=0, shard_name_template=None, coder=ToBytesCoder, compression_type='auto', preserve_random_access=False)[source]#

PTransform for a disk-based write to ArrayRecord.