nucleus.io.fasta -- Classes for reading FASTA files.
Source code: nucleus/io/fasta.py
Documentation index: doc_index.md
The FASTA format is described at https://en.wikipedia.org/wiki/FASTA_format
API for reading:
from nucleus.io import fasta
from nucleus.protos import range_pb2
with fasta.IndexedFastaReader(input_path) as reader:
region = range_pb2.Range(reference_name='chrM', start=1, end=6)
basepair_string = reader.query(region)
print(basepair_string)
If input_path
ends with '.gz', it is assumed to be compressed. All FASTA
files are assumed to be indexed with the index file located at
input_path + '.fai'
.
Classes overview
Name | Description |
---|---|
FastaReader |
Class for reading (name, bases) tuples from FASTA files. |
InMemoryFastaReader |
An IndexedFastaReader getting its bases from an in-memory data structure. |
IndexedFastaReader |
Class for reading from FASTA files containing a reference genome. |
UnindexedFastaReader |
Class for reading from unindexed FASTA files. |
Classes
FastaReader
Class for reading (name, bases) tuples from FASTA files.
InMemoryFastaReader
An `IndexedFastaReader` getting its bases from an in-memory data structure.
An `InMemoryFastaReader` provides the same API as `IndexedFastaReader` but
doesn't fetch its data from an on-disk FASTA file but rather fetches the bases
from an in-memory cache containing (chromosome, start, bases) tuples.
In particular, the `query(Range(chrom, start, end))` operation fetches bases
from the tuple where `chrom` == chromosome, and then from the bases where the
first base of bases starts at start. If start > 0, then the bases string is
assumed to contain bases starting from that position in the region. For
example, the record ('1', 10, 'ACGT') implies that
`query(ranges.make_range('1', 11, 12))` will return the base 'C', as the 'A'
base is at position 10. This makes it straightforward to cache a small region
of a full chromosome without having to store the entire chromosome sequence in
memory (potentially big!).
Methods:
__init__(self, chromosomes)
Initializes an InMemoryFastaReader using data from chromosomes.
Args:
chromosomes: list[tuple]. The chromosomes we are caching in memory as a
list of tuples. Each tuple must be exactly three elements in length,
containing (chromosome name [str], start [int], bases [str]).
Raises:
ValueError: If any of the chromosomes tuples are invalid.
c_reader(self)
Returns the underlying C++ reader.
contig(self, contig_name)
Returns a ContigInfo proto for contig_name.
is_valid(self, region)
Returns whether the region is contained in this FASTA file.
iterate(self)
Returns an iterable of (name, bases) tuples contained in this file.
query(self, region)
Returns the base pairs (as a string) in the given region.
IndexedFastaReader
Class for reading from FASTA files containing a reference genome.
Methods:
__init__(self, input_path, keep_true_case=False, cache_size=None)
Initializes an IndexedFastaReader.
Args:
input_path: string. A path to a resource containing FASTA records.
keep_true_case: bool. If False, casts all bases to uppercase before
returning them.
cache_size: integer. Number of bases to cache from previous queries.
Defaults to 64K. The cache can be disabled using cache_size=0.
c_reader(self)
Returns the underlying C++ reader.
contig(self, contig_name)
Returns a ContigInfo proto for contig_name.
is_valid(self, region)
Returns whether the region is contained in this FASTA file.
iterate(self)
Returns an iterable of (name, bases) tuples contained in this file.
query(self, region)
Returns the base pairs (as a string) in the given region.
UnindexedFastaReader
Class for reading from unindexed FASTA files.
Methods:
__init__(self, input_path)
Initializes an UnindexedFastaReader.
Args:
input_path: string. A path to a resource containing FASTA records.
c_reader(self)
Returns the underlying C++ reader.
contig(self, contig_name)
Returns a ContigInfo proto for contig_name.
is_valid(self, region)
Returns whether the region is contained in this FASTA file.
iterate(self)
Returns an iterable of (name, bases) tuples contained in this file.
query(self, region)
Returns the base pairs (as a string) in the given region.