Ximmer Analyses
Introduction
Ximmer configuration has two basic parts:
- how CNVs are simulated (“simulation”)
- how CNVs callers are run (“analyses”)
You can choose to do either or both of simulation and analysis. If you just want some CNV results from one or more CNV callers, just do analysis. If you just want to simulate some CNVs and then take the data to analyse separately, just do the simulation part. If you want everything, then do both.
For the analysis part, Ximmer is designed to allow you to easily run many different callers with many different settings, or the same caller with many different settings, and compare all the results together. Each separate set of settings applied across a selection of CNV callers is called an “analysis”.
Whether you’re doing simulation or analysis or both, the starting point is to create a configuration file that controls the analysis. This file can have a lot of settings, but in its most minimal form, all it needs is:
- bam_files setting describing where the BAM files to analyse or simulate from are
- target_regions setting describing the capture region
- callers section describing which CNV callers to run (see below)
Setting the BAM Files
The most important input to any CNV detection method is the BAM files to analyse. With Ximmer these are specified directly in the configuration file for each analysis. The specification can be either string value or a list of strings. Each string itself can be a wildcard pattern to match multiple BAM files.
Example of single wildcard expression:
bam_files="/home/simon/data/bam_files/*.bam"
Example of multiple wildcard expressions:
bam_files=[
"/home/simon/data/bam_files1/*.bam",
"/home/simon/data/bam_files2/*.bam"
]
Setting Sample Sex
The sex of each sample is important for two reasons: firstly, one of Ximmer’s simulation methods utilises the sex of each sample directly in the simulation method. However sex is important even when just doing analysis because of the differing ploidy of the X-chromosome between males and females.
Ximmer can automatically detect the sex of samples, so specifying it is optional. However the sex-detection algorithm takes some time to run, and can occasionally be inaccurate if data has unexpected characteristics. Therefore it’s better to specify the sample sexes if you know them. There are two ways to do this. The first way, is to supply a PED file that specifies the sexes. This method is convenient when you already have such a file:
ped_file="/home/simon/data/samples.ped"
The second way is to explicitly list the males and females:
samples {
males = [
"SAMPLE_X123",
"SAMPLE_X542"
...
]
females = [
"SAMPLE_X921",
"SAMPLE_X291",
...
]
}
Note that in all cases, the samples specified must match the sample ids specified in the BAM files supplied.
Default Analysis
Some CNV callers have a lot of adjustable parameters. Therefore it is inconvenient to have to set every parameter every time you run them. For this reason, Ximmer has a section that allows you to define the default parameters. Then any configuration you make is simply overriding the defaults for only the parameters you are interested in modifying. Each separate set of modified parameters is called an “analysis”. The set of default parameters is the “default analysis”. The default analysis is configured in the “callers” section of the configuration file. An example is shown below:
callers {
xhmm {
exome_wide_cnv_rate=1e-04
mean_number_of_targets_in_cnv=3
}
exomedepth {
transition_probability=0.0001
}
cnmops {
prior_impact=5
min_width=1
}
conifer {
conifer_svd_num=1
}
}
If you don’t configure anything else, only the default analysis is what will be run to find CNVs in the data.
Customized Analyses
A very common task in using a CNV detection tool is to try different settings to find out what works best on your particular data. To do that you need to compare between different settings for the same tool. Each group of settings that you wish to run with is called an “analysis”. These are configured in the analysis section of the configuration file. An example is below:
analyses {
'xhmmtune' {
xhmm_1 {
exome_wide_cnv_rate=1e-02
}
xhmm_2 {
exome_wide_cnv_rate=1e-02;
xhmm_pve_mean_factor=0.2;
}
xhmm_3 {
exome_wide_cnv_rate=1e-04;
}
}
}
This example configuration only runs XHMM. The name for the analysis is xhmmtune
. The use of XHMM
is inferred from the prefix xhmm_
for the label of each individual block
within the analyses. The configuration parameters themselves are specified within each
block and are specific to each caller (see table).
Caller | Parameter | Description | Example / Default |
---|---|---|---|
ExomeDepth | transition_probability | 10e-4 | |
expected_cnv_length | 50000 | ||
XHMM | exome_wide_cnv_rate | prior probability of CNV in genome | 10e-8 |
xhmm_pve_mean_factor | fraction of variation to remove by normalisation | 0.7 | |
max_sd_target_rd | maximum cov standard deviation for target | 30 | |
Conifer | conifer_svd_num | Number of SVG / PCA components to remove in normalisation | 2 |
conifer_call_threshold | Z-score threshold at which CNVs are called | ||
cn.MOPs | prior_impact | Weighting of prior probability of CNV | 10 |
min_width | Min target regions to call a CNV | 5 | |
lower_threshold | Affects threshold on coverage for CNV calling | -0.8 | |
panel_type | Sets a range of parameters for panel vs exome | exome (or blank) | |
CODEX | k_offset | Adjusts CODEX’s preferred k by given amount | 0 |
max_k | Sets the maximum value of k to be tried. For small panels or sample nubmers, adjust down |
Filtering by quality
If a CNV caller produces many false positives, you may wish to filter out results that
have a low quality score assigned by the caller. You can specify a quality score threshold
using the quality_filter
setting for each analysis. Note that how this is interpreted
is specific to each CNV caller. Ximmer uses a fixed quality metric for each CNV caller
(see publication).
Example:
cnmops {
prior_impact=5
quality_filter=1.5
}
Specifying Variants
It can be informative to know when variants such as SNVs and indels overlap CNV calls. this is important for two reasons:
- Heterozygosity and allele balance helps inform about whether a CNV call is accurate
- Overlapping loss of function variants can form compound heterzygous configurations that result in a complete loss of a gene.
Ximmer can incorporate variant calls for samples into the analysis. You can provide
these by specifying a list of variant calls under the variants
attribute in the
configuration file:
variants="/home/simon/bams/my_project_variants.vcf"
The variants attribute can also be specified as a list of VCF files:
variants=[
"/home/simon/bams/sample1_variants.vcf",
"/home/simon/bams/sample2_variants.vcf",
...
]
Note that each entry in either of these forms can be a Unix style “glob” to match multiple VCF files.
Identity Masking
Sometimes you do not wish to display the full id that is attached to samples in the BAM and VCF files in your CNV report. This might be because the sample ids are very long and unwieldy, or it could also be because there is sensitive or private information in the ids. Ximmer supports a function to mask out portions of the actual sample ids from being displayed in the report. The function allows the user to specify a regular expression which must have a single group (ie. section enclosed in parentheses). Ximmer will display the section in parentheses only when showing sample ids in the report.
Important: the full sample id is still accessible internally within the report. This feature does not provide security against a malicious user wishing to unmask identities. The underlying sample ids are readily accessible by use of Javascript and potentially may leak into some parts of the user interface as well.
Example: Trim a trailing portion from sample ids in the form _SNN:
sample_id_mask="(.*)_S[0-9]*"
Example: Retain only a trailing SNN portion of the sample id:
sample_id_mask=".*(_S[0-9]*)"
Excluding Specific Regions from Analysis
For a variety of reasons it is sometimes desirable to exclude some regions of the genome from analysis, even when they are included in the exome target regions. Some common reasons can include:
- to exclude regions where CNV calling is difficult and thus causes large numbers of false positives
- to mask regions where CNV calls might result in incidental findings (for example, that violate ethics constraints for research).
Note: regions excluded from analysis are not automatically excluded
from simulation or known CNVs provided as true positives. Thus excluding regions
may result in loss of sensitivity in the output. To exclude regions completely
from use by Ximmer, adjust the target_regions
parameter.
Excluding Results Overlapping Specific Genes
Although Ximmer can exclude some regions from analysis, often the reason to do this
is to avoid including specific genes from the results. Ximmer supports this option
via the exclude_genes
setting. Set this option to a text file containing one HGNC
gene symbol per line to exclude CNVs overlapping the specified genes from your
results. Note that these CNVs will be excluded even if they overlap genes specified
by the gene_filter
option (see below).
Example:
exclude_genes="/home/ximmer/genes/excluded_genes.txt"
Filtering Results to Specific Genes
Although Ximmer supports interactively filtering to specific genes in the curation interface, it may be desirable to hard filter the result set to a gene list. This can ensure only specific genes are looked at, such as when ethics or a clinical indication for testing limits the scope of the investigation.
To set a gene list, add the gene_filter
configuration attribute, set to a file containing
a list of HGNC gene symbols (one per line). CNVs will be removed from the results unless
they overlap at least one gene from the provided set.
Example:
gene_filter="/home/ximmer/genes/gene_list.txt
Gene Lists
More advanced filtering and ranking of genes can be set up using gene lists. A gene list is a list of gene symbols, with each symbol accompanied by a number indicating its priority, which is typically in the range 1 - 4.
You can assign multiple gene lists in a genelists
section. Each gene list is identified
by symbol which should be a short sequence of upper case letters, which should be assigned
to a path to a file that defines the gene symbols and priorities (also called “categories”),
separated by tab characters.
Example configuration:
genelists {
CARDIAC='/home/simon/genelists/cardiac_genes.txt'
}
An example gene list would look like:
DVL1 3
SCN5A 5
NOTE: the gene list is applied to all genes overlapped by a CNV, and the whole CNV is considered to have the rank of its highest ranked gene.
Filtering Results Based on Genelists
By default, setting a gene list only causes genes to be highlighted and searchable
by category in the user interface. However you can also completely exclude genes that
are not of interest from appearing in your results. To do this, set a minimum category
by adding a filter
section to your genelists:
genelists {
CARDIAC='/home/simon/genelists/cardiac_genes.txt'
filter {
miminum_category=1
}
}
Using Different Genelists for Different Samples
If you want to analyse a bunch of samples together but they need different gene lists
applied, you can do this by specifying a sample_map
entry in the gene lists configuration.
The sample map file is a two column, tab separated format that has a sample id in the first
column and the corresponding gene list to apply in the second column.
Example of configuration:
genelists {
CARDIAC='/home/simon/genelists/cardiac_genes.txt'
CMT='/home/simon/genelists/cmt_genes.txt'
sample_map='/home/simon/samples/sample_genelists.txt'
}
The sample map could look like:
SAMPLE1 CARDIAC
SAMPLE2 CMT
SAMPLE3 CARDIAC
...
Minimal Complete Configuration Example
Below is a very simple, minimal but working configuration which analyses a set of BAM files to find CNVs using XHMM and ExomeDepth using their default settings:
bam_files="/home/simon/data/*.bam"
target_regions="/home/simon/data/EXOME.bed"
concurrency=20
callers {
xhmm {}
exomedepth {}
}