class FASTAIndex extends java.lang.Object
A very simplistic FASTA index that allows lookup of sequence names by their sequence content using a fixed length prefix seed. It only supports looking up sequences by their prefixes, with an optional offset from the query start by up to specified number of bases (default=5). Both the original sequence AND its reverse complement are indexed, so you can perform a single query to identify a sequence in the case you are not sure what the strand / orientation of the query sequence is.
NOTE: this class is intended for indexing large numbers of SHORT sequences. It will not work for indexing, for example, a reference sequence for an organism! (You will be able to look up each chromosome by a short prefix, not terribly useful).
NOTE2: this class is not for working at low level with pre-indexed FASTA files (eg: .fai format) For working with those, use the gngs.FASTA class. This class takes a gngs.FASTA object and adds ability to look up by sequence content to it.
Example:
index = new FASTAIndex(new FASTA("tests/test.fasta"), 0..20) assert index.querySequence("AGTCCCTATTACAAA") == "amplicon_1"
Type | Name and description |
---|---|
int |
maxSize Maximum number of sequences to index (0 means unlimited) |
groovy.lang.IntRange |
offsetRange Range of offsets from beginning of sequences to index (memory expensive to increase this a lot) |
int |
seedSize Size of seed to use (bp) |
java.util.Map<java.lang.String, java.lang.String> |
sequenceNames Index of amplicon names maps name => full sequence |
java.util.Map<java.lang.String, java.lang.String> |
sequences Index of amplicon sequences, maps subsequence to amplicon name(s) |
Constructor and description |
---|
FASTAIndex
(gngs.FASTA fasta, Regions regions) Create an index from the given fasta, where each fasta sequence corresponds to a single amplicon |
FASTAIndex
() For unit tests only |
FASTAIndex
(gngs.FASTA fasta, groovy.lang.IntRange offsetRange, int maxSize, int seedSize, BED bed) Create an index from the given fasta, where each fasta sequence corresponds to a single amplicon |
Type Params | Return Type | Name and description |
---|---|---|
|
void |
index(gngs.FASTA fasta, Regions regions) Index the given fasta, masked using the given BED file |
|
java.lang.String |
querySequenceName(java.lang.String sequence) |
Methods inherited from class | Name |
---|---|
class java.lang.Object |
java.lang.Object#wait(long), java.lang.Object#wait(long, int), java.lang.Object#wait(), java.lang.Object#equals(java.lang.Object), java.lang.Object#toString(), java.lang.Object#hashCode(), java.lang.Object#getClass(), java.lang.Object#notify(), java.lang.Object#notifyAll() |
Maximum number of sequences to index (0 means unlimited)
Range of offsets from beginning of sequences to index (memory expensive to increase this a lot)
Size of seed to use (bp)
Index of amplicon names maps name => full sequence
Index of amplicon sequences, maps subsequence to amplicon name(s)
Create an index from the given fasta, where each fasta sequence corresponds to a single amplicon
For unit tests only
Create an index from the given fasta, where each fasta sequence corresponds to a single amplicon
Index the given fasta, masked using the given BED file
Groovy Documentation