nucleus.io.genomics_reader -- Classes that provide the interface for reading genomics data.
Source code: nucleus/io/genomics_reader.py
Documentation index: doc_index.md
GenomicsReader
defines the core API supported by readers, and is subclassed
directly or indirectly (via DispatchingGenomicsReader
) for all concrete
implementations.
TFRecordReader
is an implementation of the GenomicsReader
API for reading
TFRecord
files. This is usable for all data types when encoding data in
protocol buffers.
DispatchingGenomicsReader
is an abstract class defined for convenience on top
of GenomicsReader
that supports reading from either the native file format or
from TFRecord
files of the corresponding protocol buffer used to encode data
of that file type. The input format assumed is dependent upon the filename of
the input data.
Concrete implementations for individual file types (e.g. BED, SAM, VCF, etc.)
reside in type-specific modules in this package. The instantiation of readers
may have reader-specific requirements documented there. General examples of the
iterate()
and query()
functionality are shown below.
# Equivalent ways to iterate through all elements in a reader.
# 1. Using the reader itself as an iterable object.
kwargs = ... # Reader-specific keyword arguments.
with GenomicsReaderSubClass(output_path, **kwargs) as reader:
for proto in reader:
do_something(reader.header, proto)
# 2. Calling the iterate() method of the reader explicitly.
with GenomicsReaderSubClass(output_path, **kwargs) as reader:
for proto in reader.iterate():
do_something(reader.header, proto)
# Querying for all elements within a specific region of the genome.
from nucleus.protos import range_pb2
region = range_pb2.Range(reference_name='chr1', start=10, end=20)
with GenomicsReaderSubClass(output_path, **kwargs) as reader:
for proto in reader.query(region):
do_something(reader.header, proto)
Classes overview
Name | Description |
---|---|
DispatchingGenomicsReader |
A GenomicsReader that dispatches based on the file extension. |
GenomicsReader |
Abstract base class for reading genomics data. |
TFRecordReader |
A GenomicsReader that reads protocol buffers from a TFRecord file. |
Classes
DispatchingGenomicsReader
A GenomicsReader that dispatches based on the file extension.
If '.tfrecord' is present in the filename, a TFRecordReader is used.
Otherwise, a native reader is.
Subclasses of DispatchingGenomicsReader must define the following methods:
* _native_reader()
* _record_proto()
Methods:
__init__(self, input_path, **kwargs)
iterate(self)
query(self, region)
GenomicsReader
Abstract base class for reading genomics data.
In addition to the abstractmethods defined below, subclasses should
also set a `header` member variable in their objects.
Methods:
__init__(self)
Initializer.
iterate(self)
Returns an iterator for going through all the file's records.
query(self, region)
Returns an iterator for going through the records in the region.
Args:
region: A nucleus.genomics.v1.Range.
Returns:
An iterator containing all and only records within the specified region.
TFRecordReader
A GenomicsReader that reads protocol buffers from a TFRecord file.
Example usage:
reader = TFRecordReader('/tmp/my_file.tfrecords.gz',
proto=tensorflow.Example)
for example in reader:
process(example)
Note that TFRecord files do not have headers, and do not need
to be wrapped in a "with" block.
Methods:
__init__(self, input_path, proto, compression_type=None)
Initializes the TFRecordReader.
Args:
input_path: The filename of the file to read.
proto: The protocol buffer type the TFRecord file is expected to
contain. For example, variants_pb2.Variant or reads_pb2.Read.
compression_type: Either 'ZLIB', 'GZIP', '' (uncompressed), or
None. If None, __init__ will guess the compression type based on
the input_path's suffix.
Raises:
IOError: if there was any problem opening input_path for reading.
c_reader(self)
Returns the underlying C++ reader.
iterate(self)
Returns an iterator for going through all the file's records.
query(self, region)
Returns an iterator for going through the records in the region.
NOTE: This function is not currently implemented by TFRecordReader as the
TFRecord format does not provide a general mechanism for fast random access
to elements in genome order.