@groovy.transform.CompileStatic class VCFIndex extends java.lang.Object
Support for querying indexed VCF files via random access.
In general, if you only want to access a small region of a VCF (eg: to look up variants in a specific region, etc) then this class is your best bet. If, alternatively, you have a smaller VCF file (say, less than 100,000 lines) and you expect to query a significant number of the variants in it, you should go straight for the gngs.VCF class instead and just load everything into memory.
The VCFIndex class is highly optimised for random access to VCF contents. It uses memory mapped files to buffer the VCF to memory to minimize the actual hits to the file system.
Using the VCFIndex class to look up variants in a region is very simple:
VCFIndex index = new VCFIndex("test.vcf") index.query("chrX",1000,2000) { Variant v -> println "Variant $v is in the range chrX:1000-2000" }You can also easily test if a Variant exists within the VCF file:
Variant v = new Variant(chr:"chrX", alt:"A", ref:"T") VCFIndex index = new VCFIndex("test.vcf") if(v in index) println "Variant $v is found in test.vcf"
Modifiers | Name | Description |
---|---|---|
class |
VCFIndex.1 |
Modifiers | Name | Description |
---|---|---|
static int |
BUFFER_SIZE |
A 256 kb memory buffer that will try to minimize operations on the indexed file |
static int |
ONE_GIG |
Type | Name and description |
---|---|
java.lang.String |
fileName File name of VCF file |
gngs.VCF |
headerVCF A dummy VCF used to parse / hold the header information |
htsjdk.tribble.index.Index |
index Indexes used for random access to large VCFs. |
java.io.RandomAccessFile |
indexedFile Raw file corresponding to the VCF |
java.util.List<MappedByteBuffer> |
vcfBuffers |
Type Params | Return Type | Name and description |
---|---|---|
|
gngs.Variant |
contains(gngs.Variant v) Returns true if this VCF contains the specified variant |
|
gngs.Variant |
find(java.lang.String chr, int start, int end, groovy.lang.Closure c) Find the first variant in an interval that returns true from the given closure |
|
java.util.Map |
findAnnovarVariant(java.lang.String chr, java.lang.Object start, java.lang.Object end, java.lang.String obs, int windowSize) Attempts to locate the given Annovar variant in this VCF file. |
|
boolean |
isCase(gngs.Variant v) |
|
java.util.Iterator<gngs.Variant> |
iterator(java.lang.String chr, int start, int end) |
|
void |
query(IRegion r, groovy.lang.Closure c) |
|
void |
query(java.lang.String chr, int start, int end, groovy.lang.Closure c) |
|
void |
queryIdx(java.lang.String chr, int start, int end, groovy.lang.Closure c) Query the current VCF file for all variants in the specified region |
|
void |
queryTabix(java.lang.String chr, int start, int end, groovy.lang.Closure c) |
Methods inherited from class | Name |
---|---|
class java.lang.Object |
java.lang.Object#wait(long), java.lang.Object#wait(long, int), java.lang.Object#wait(), java.lang.Object#equals(java.lang.Object), java.lang.Object#toString(), java.lang.Object#hashCode(), java.lang.Object#getClass(), java.lang.Object#notify(), java.lang.Object#notifyAll() |
A 256 kb memory buffer that will try to minimize operations on the indexed file
File name of VCF file
A dummy VCF used to parse / hold the header information
Indexes used for random access to large VCFs. Used with *.query() functions
Raw file corresponding to the VCF
Load and index for the given VCF file
Returns true if this VCF contains the specified variant
IMPORTANT: the other variant is required to contain only a single allele for this function to return correct results!
v
- Variant to test forFind the first variant in an interval that returns true from the given closure
Example:
VCFIndex index = new VCFIndex("test.vcf") Variant v = index.find("chrX", 1000,2000) { it.type == "DEL" && it.size() > 4 } println "First deletion greater than 4 bases in size is $v"
chr
- Chromosome to scanstart
- Starting position in chromosomeend
- End position in chromosomec
- Closure to call, passing each Variant from the VCF as the first argument.Attempts to locate the given Annovar variant in this VCF file.
Annovar is a popular annotation tool which outputs results in its own proprietary CSV format. Unfortunately, this format does not use the same reference coordinates, and thus an Annovar variant is not directly translatable to a VCF variant, and it is necessary to scan a range of indexes in an original VCF file to locate the corresponding Annovar variant. This function implements the required logic.
chr
- chromosome / reference sequencestart
- Annovar variant start positionend
- Annovar variant end positionwindowSize
- the size of window to scan. VCF files can include overlapping variants in
arbitrarily long windows. Thus it is necessary to start scanning the VCF
significantly before the location of the actual start of the DNA change
to be sure of finding the Annovar variant. In general, the window size
should be as large as the largest indels you expect to have in your
VCF file.Query the current VCF file for all variants in the specified region
chr
- Chromosome / Sequence namestart
- Start positionend
- End positionc
- Closure to call backGroovy Documentation