class VCF extends java.lang.Object
The VCF class supports simple parsing, filtering, querying and updating of VCF files.
The primary method is the parse(Closure) method which can be optionally passed a file name, stream, or if left out, will parse from standard input. When a Closure is supplied, it can be used for filtering and updating the VCF file as it is read. Each record is passed to the closure as it is read, and if false is returned from the closure it will be discarded:
// Let's find all the indels on chr10!
vcf = VCF.parse('test.vcf') { v ->
v.chr == 'chr10' && (v.type == 'INS' || v.type == 'DEL')
}
Various forms of information can be queried - the INFO section is parsed to a
map and genotype information is available, referred to as the "dosage" for each
sample with respect to the variant (an integer number of copies, 0,1,2).
Support for SnpEff and VEP annotations is also provided via
dedicated properties that return a list of SnpEffInfo
objects describing the various annotations available. See the Variant
class for more information about individual variants.
Note that most information is lazily computed so the cost of parsing complex fields is deferred unless they are asked for.
Updating variants requires a special procedure because related fields need to be sycnhronised after modification, and it is a strong convention to add a header line describing updates. An update method provides a closure based mechanism to make this straight forward:
v.update { v.info.FOO='bar' }
Limited support for "Pedigrees" is available. At the moment only grouping of samples into families is supported and intra-family relationships are not modeled.
The Iterable interface and "in" operator are supported, allowing all the normal Groovy collection-manipulation methods. For example, you can do common iteration and membership tests:
for(Variant v in vcf) {
if(v in vcf2) { ... }
}
or grouping, counting, etc:
vcf.countBy { it.chr } // Count of variants by chromosome
If your desire is to lookup a specific region in a very large VCF file, you should index the VCF file and use the VCFIndex class to query the region specifically.
To stream a VCF without storing in memory, use the filter(Closure) methods which will print output, including header, directly to the output.
| Type | Name and description |
|---|---|
static int |
FORMAT_COLUMN_INDEX |
static int |
SAMPLE_COLUMN_INDEX |
java.util.Map<java.lang.String, java.util.List<Variant>> |
chrPosIndex |
java.lang.String |
fileName |
java.util.Map<java.lang.String, java.util.List<Variant>> |
genesIndex of variants by affected gene Note that a variant can affect more than one gene, so will appear multiple times in the value side of the map |
java.util.List<java.lang.String> |
headerLines |
java.lang.String[] |
lastHeaderLine |
boolean |
lazyLoad |
java.util.List<Pedigree> |
pedigrees |
java.util.List<Pedigree> |
samplePedigrees |
java.util.List<java.lang.String> |
samples |
java.util.List<Variant> |
variants |
| Constructor and description |
|---|
VCF
() |
VCF
(File file)Creates an empty VCF file based on the header of the given VCF file |
VCF
(java.lang.String fileName)Creates an empty VCF file based on the header of the given VCF file |
VCF
(java.lang.Iterable<Variant> variants)Creates a VCF containing all the VCFs, and initialised with the header from the first variant |
| Type Params | Return Type | Name and description |
|---|---|---|
|
VCF |
add(Variant v)Add the given variant to this VCF |
|
void |
addInfoHeader(java.lang.String id, java.lang.String desc, java.lang.Object value)Add a header for describing an info value to be added to a VCF. |
|
void |
addInfoHeaders(java.util.List<java.lang.String> linesToAdd)Add new INFO header fields to this VCF |
|
java.lang.Object |
asType(java.lang.Class clazz) |
|
java.lang.String |
buildGtFieldValue(java.lang.String gt, java.util.List<java.lang.String> myGtFields, java.util.List<java.lang.String> includedFields) |
|
double |
denovoRate(int proband, int parent1, int parent2)Calculate the fraction of denovo variants if this VCF contains a trio with the indexes of the samples as given |
|
double |
denovoRate(VCF m, VCF d) |
|
void |
each(Closure c) |
|
static void |
filter(File f, Closure c = null) |
|
static void |
filter(File f, java.util.List<Pedigree> peds, Closure c = null) |
|
static void |
filter(java.lang.String fileName, Closure c = null) |
|
static void |
filter(java.lang.String fileName, java.util.List<Pedigree> peds, Closure c = null) |
|
static void |
filter(Closure c = null) |
|
static void |
filter(Map options = [:], InputStream f, Closure c) |
|
static void |
filter(Map options = [:], InputStream f, java.util.List<Pedigree> peds = null, Closure c = null) |
|
static VCF |
filter(Map options, java.lang.String fileName, Closure c) |
|
static VCF |
filter(Map options = [:], Reader r, Closure c) |
|
void |
filterSamples(java.io.PrintStream p, java.util.List<java.lang.String> includeSamples)Print out a version of this VCF with only the given samples included |
|
Variant |
find(Variant v)Attempt to locate a variant having the same change as the given variant inside this VCF. |
|
Pedigree |
findPedigreeBySampleIndex(int i) |
|
java.util.Map<java.lang.String, java.lang.Integer> |
getConsequenceCounts()Calculate the number of variants for each different VEP consequence, returning a map of consequence => count. |
|
java.util.List<java.lang.String> |
getContigs() |
|
java.util.Map<java.lang.String, FormatMetaData> |
getFormatMetaData() |
|
java.util.Map<java.lang.String, java.util.List<Variant>> |
getGenes() |
|
java.util.List<Variant> |
getHighQualityHets(VCF p2)Finds a set of high quality SNPs that distinguish this VCF from the other VCF provided. |
|
VCFIndex |
getIndex() |
|
java.util.Map<java.lang.String, java.lang.Object> |
getInfoMetaData(java.lang.String id) |
|
int |
getSize() |
|
java.lang.String[] |
getVepColumns() |
|
java.lang.String[] |
getVepColumns(java.lang.String vepType)Dedicated support for dynamically returning the VEP columns present in this VCF file |
|
Sex |
guessSex(java.lang.String sampleId, int sampleSize = 500)Return the sex of a sample, estimated from the heterozygosity of its variants. |
|
Sex |
guessSex(int sampleIndex = 0, int sampleSize = 500)Attempt to guess the sex of a human VCF file by sampling high quality variants on the X chromosome. |
|
boolean |
hasInfo(java.lang.String id)Return true if this VCF file contains the specified INFO tags |
|
boolean |
isCase(java.lang.Object obj)Support for the convenience syntax to check if a VCF contains a specific variant with the form:
if(variant in vcf) {
....
}
Note however that dosage of the variant is NOT compared, so a variant is "in" a VCF
even if its allele count / zygosity is different to the variant of interest. |
|
java.util.Iterator<Variant> |
iterator() |
|
VCF |
load(Closure c = null) |
|
VCF |
merge(VCF other)Perform a simplistic (emphasis on simplistic) merge between VCFs. |
|
static VCF |
parse(java.lang.String fileName, Pedigrees peds, Closure c = null) |
|
static VCF |
parse(Map options = [:], java.lang.String fileName, java.util.List<Pedigree> peds = null, Closure c = null)Convenience method to accept string for parsing file |
|
static VCF |
parse(Map options = [:], java.lang.String fileName, Closure c) |
|
static VCF |
parse(File f, java.util.List<Pedigree> peds = null, Closure c = null) |
|
static VCF |
parse(Map options, File f, java.util.List<Pedigree> peds = null, Closure c = null) |
|
static VCF |
parse(Closure c = null) |
|
static VCF |
parse(java.util.List<Pedigree> peds, Closure c = null) |
|
static VCF |
parse(InputStream f, java.util.List<Pedigree> peds = null, Closure c = null) |
|
static VCF |
parse(Map options, InputStream f, java.util.List<Pedigree> peds = null, Closure c = null) |
|
static VCF |
parse(Map options = [:], Reader r, boolean filterMode, Closure c) |
|
FormatMetaData |
parseFormatMetaDataLine(java.lang.String line) |
|
java.util.Map<java.lang.String, java.lang.Object> |
parseInfoMetaData(java.lang.String info)Parse a VCF INFO meta data line and return the values as a Map |
|
void |
parseLastHeaderLine()Extract sample names and other info from the final header line in the VCF file |
|
java.util.Map<java.lang.String, java.lang.String> |
parseVepColumns() |
|
void |
print() |
|
void |
print(PrintWriter p) |
|
void |
print(java.lang.Appendable p) |
|
void |
printHeader(java.lang.Appendable w) |
|
void |
printHeader() |
|
static void |
processParsedVariant(VCFParseContext ctx, Variant v) |
|
void |
renameSample(java.lang.String fromId, java.lang.String toId)Change the sample ids for this VCF |
|
void |
replaceSamples(java.util.List<java.lang.String> sampleIds)Replace the sample ids in the VCF with the given ones |
|
int |
sampleIndex(java.lang.String sampleName) |
|
java.lang.String |
sniffGenomeBuild()Attempts to identify which version of human genome build this VCF was created from. |
|
Regions |
toBED() |
|
java.util.List<java.util.Map<java.lang.String, java.lang.Object>> |
toListMap() |
|
Regions |
toRegions() |
|
java.lang.String |
toString() |
|
double |
transmissionRate(VCF p1, VCF p2)Calculate the rate at which variants are transmitted to this sample from p1, given p2 is the other parent. |
|
Map |
trioDenovoRate()Infer proband and calculate rate of de novo variants for a VCF containing a trio. |
|
java.util.List |
variantsAt(java.lang.String chr, int pos)Return a list of variants starting at the given location |
| Methods inherited from class | Name |
|---|---|
class java.lang.Object |
java.lang.Object#wait(long, int), java.lang.Object#wait(long), java.lang.Object#wait(), java.lang.Object#equals(java.lang.Object), java.lang.Object#toString(), java.lang.Object#hashCode(), java.lang.Object#getClass(), java.lang.Object#notify(), java.lang.Object#notifyAll() |
Index of variants by affected gene Note that a variant can affect more than one gene, so will appear multiple times in the value side of the map
Creates an empty VCF file based on the header of the given VCF file
The VCF is set to lazy mode which auto-loads the variants when some methods are called.
Creates an empty VCF file based on the header of the given VCF file
The VCF is set to lazy mode which auto-loads the variants when some methods are called.
Creates a VCF containing all the VCFs, and initialised with the header from the first variant
Add a header for describing an info value to be added to a VCF.
value - A prototype value to infer the type of value for the INFO field from.Add new INFO header fields to this VCF
Searches for the last position of the other INFO lines and inserts the new lines at that position. If there are no other INFO lines, inserts the new lines at the end of the headers
Calculate the fraction of denovo variants if this VCF contains a trio with the indexes of the samples as given
proband - index in the samples array of the probandparent1 - index in the samples array of first parentparent2 - index in the samples array of second parentPrint out a version of this VCF with only the given samples included
Attempt to locate a variant having the same change as the given variant inside this VCF.
This function does not guarantee to find the change, if it exists. Currently, it will only locate the change if it is represented by an entry in the VCF starting at the same reference position. A variant will be returned if any allele in the identified variant matches any allele in the given variant. These operations become much more reliable (but still not 100% reliable) when both VCFs from which the variants are drawn are decomposed into primitives.
Calculate the number of variants for each different VEP consequence, returning a map of consequence => count.
Finds a set of high quality SNPs that distinguish this VCF from the other VCF provided.
Dedicated support for dynamically returning the VEP columns present in this VCF file
Return the sex of a sample, estimated from the heterozygosity of its variants.
See guessSex(int) for details of the implementation.
Attempt to guess the sex of a human VCF file by sampling high quality variants on the X chromosome.
With various checks and constraints, the main test is:
Return true if this VCF file contains the specified INFO tags
Note: it only checks if the tag is described in the header, not whether any variant in the VCF actually has the INFO tag. You still need to account that any given record may be missing the tag.
id - id of INFO tag to check forSupport for the convenience syntax to check if a VCF contains a specific variant with the form:
if(variant in vcf) {
....
}
Note however that dosage of the variant is NOT compared, so a variant is "in" a VCF
even if its allele count / zygosity is different to the variant of interest.
Perform a simplistic (emphasis on simplistic) merge between VCFs.
Convenience method to accept string for parsing file
Parse a VCF INFO meta data line and return the values as a Map
Extract sample names and other info from the final header line in the VCF file If samples are specified, the samples parsed will be limited to the given samples. The VCF header line will be modified so that any output will only include the given samples.
Change the sample ids for this VCF
Replace the sample ids in the VCF with the given ones
Attempts to identify which version of human genome build this VCF was created from. Not applicable to non-human genomes.
Calculate the rate at which variants are transmitted to this sample from p1, given p2 is the other parent.
Infer proband and calculate rate of de novo variants for a VCF containing a trio.
Return a list of variants starting at the given location
Groovy Documentation