Avocado User Guide¶
Introduction¶
Avocado is a variant caller built on top of Apache Spark to allow rapid variant calling on cluster/cloud computing environments. Avocado is built on ADAM’s APIs, and achieves variant calling accuracy that is similar to state-of-the-art tools while being able to drop variant calling latency to approximately 15 minutes when running on a 1,024 core cluster.
Workflows Supported¶
Avocado is run through the avocado-submit command line:
./bin/avocado-submit
Using SPARK_SUBMIT=/usr/local/bin/spark-2.2.1-bin-hadoop2.7/bin/spark-submit
Usage: avocado-submit [<spark-args> --] <avocado-args> [-version]
Choose one of the following commands:
biallelicGenotyper : Call variants under a biallelic model
discover : Discover variants in reads
jointer : Joint call and annotate variants.
mergeDiscovered : Merge variants discovered from reads of multiple samples
reassemble : Reassemble reads to canonicalize variants
trioGenotyper : Call variants in a trio under a biallelic model
The avocado-submit script follows the same conventions as the adam-submit command line, whose documentation can be found here. As a result, just like ADAM, Avocado can be deployed on a local machine, on AWS, an in-house cluster running YARN or SLURM, or using Toil.
Avocado supports several workflows:
- Single sample germline variant calling: Avocado’s BiallelicGenotyper runs on a single sample at a time, and can generate both variants-only (VCF) and all-sites (gVCF) output.
- Joint variant calling: Avocado supports jointly calling variants from a collection of gVCF-styled inputs.
Avocado also contains code to reassemble variants, and a pedigree variant caller. However, this code is experimental and is thus unsupported.
For genotyping, Avocado uses a probabilistic model that assumes that sites are biallelic. This model is derived from the biallelic model used by the Mpileup variant caller, but modified to better call multiallelic sites. When used with the INDEL realigner from ADAM, Avocado has >99% accuracy when genotyping SNPs, and >96% accuracy when genotyping INDELs.
Building Avocado from Source¶
You will need to have Apache Maven version 3.1.1 or later installed in order to build Avocado.
Note: The default configuration is for Hadoop 2.7.3. If building against a different version of Hadoop, please pass-Dhadoop.version=<HADOOP_VERSION>
to the Maven command.
git clone https://github.com/bigdatagenomics/avocado.git
cd avocado
export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=128m"
mvn clean package -DskipTests
Outputs
...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 9.647s
[INFO] Finished at: Thu May 23 15:50:42 PDT 2013
[INFO] Final Memory: 19M/81M
[INFO] ------------------------------------------------------------------------
You might want to take a peek at the scripts/jenkins-test
script and
give it a run. We use this script to test that Avocado is
working correctly.
Running Avocado¶
Avocado is packaged as an überjar and includes all necessary dependencies, except for Apache Hadoop and Apache Spark.
You might want to add the following to your .bashrc
to make running
Avocado easier:
alias avocado-submit="${AVOCADO_HOME}/bin/avocado-submit"
$AVOCADO_HOME
should be the path to where you have checked AVOCADO out on
your local filesystem. The alias calls a script that wraps
the spark-submit
command to set up Avocado. You
will need to have the Spark binaries on your system; prebuilt binaries
can be downloaded from the Spark
website. Our continuous
integration setup
builds ADAM against Spark 2.0.0, Scala versions 2.10
and 2.11, and Hadoop versions 2.3.0 and 2.6.0.
Once this alias is in place, you can run Avocado by simply typing
avocado-submit
at the command line.
avocado-submit
Single Sample Variant Calling¶
To call variants using Avocado, use the BiallelicGenotyper command. This command discovers possibly variant sites from a collection of reads. The discovered sites are then genotyped using a biallelic probabilistic model. The genotyping model is based off of the biallelic model used by the original Samtools mpileup variant caller, but adds additional components for modeling a site with two minor alleles, as well as reads that do not match any known allele. The full genotyping model is described in Chapter 7 of this thesis.
To run the BiallelicGenotyper, you must provide two parameters:
- The path to the input file (in any file format for reads that ADAM can load)
- The path where the output should be saved as Parquet-encoded Genotypes.
You can configure how Avocado discovers variants to genotype, the genotyping phase, and the hard filters that Avocado applies to the called genotypes.
Reads to Evaluate¶
By default, Avocado only calls variants in autosomal regions. Avocado does this
by inspecting the names of the contigs that the reads are mapped to. Avocado
assumes that the contig names start with chr
prefixes. If your reference build
does not have these prefixes (i.e., chromosome 1 is 1
, instead of chr1
), you
need to pass the -is_not_grc
option. To enable calling non-autosomal regions,
you can pass:
-keep_mitochondrial_chromosome
: to call variants on the mitochondrial chromosome.-autosomal_only
: to call variants on the sex chromosomes.
Additionally, we apply several read quality filters. To disable these filters, you can pass:
-keep_duplicate_reads
: To disable discarding reads that have been marked as a PCR duplicate.-keep_non_primary
: To disable discarding reads that have non-primary alignments.-min_mapping_quality
: To change the default mapping quality below which reads are filtered out (default is phred 10).
Variant Discovery¶
Avocado treats a site as possibly variant if enough alternate alleles are seen
with high enough quality. If you already know which variants you want to call,
you can skip variant discovery by passing the -variants_to_call
flag, with a
path that ADAM can load as variants. During variant discovery, you can tune the
variants that are discovered with the following flags:
-min_phred_to_discover_variant
: The minimum Phred quality for a read containing a variant allele to be considered a confident observation of that allele.-min_observations_to_discover_variant
: The minimum number of confident observations of a variant allele for us to choose to score a variant.
Genotyping¶
By default, Avocado only scores the sites we have identified as possible
variants. However, Avocado can also emit genotype likelihoods for sites where
no alternate allele was seen. This output is not banded by quality and thus is
equivalent to a BP RESOLUTION gVCF.
To run in this mode, pass the -score_all_sites
option on the command line.
The only parameters that the genotyping engine consumes are around copy number.
Unless otherwise specified, Avocado assumes that it is running on a diploid
sample. To set the default copy number to another value, pass the ploidy with
the -ploidy
option. Additionally, copy number variants can be passed with the
-cnv
flag. The copy number variants should be described in the GFF format that
is compatible with the DECA and
XHMM copy number variant
callers.
Variant Filtration¶
Avocado applies hard filters separately to SNPs and INDELs. The following hard filters can be applied:
min
andmax
:
depth
: The read depth covering the site.rms_mapping_quality
: The RMS quality of reads mapped at the site.
min
andmax
, separately forhet
andhom
genotypes:
allelic_fraction
: The fraction of reads that support the major vs. minor allele. E.g., if 25 reads support the major (reference) allele and 75 reads support the minor (alternate) allele, the site has an allelic fraction of 0.75. Forhom
genotypes, onlymin
is supported.
min
only, separately forhet
andhom
genotypes:
quality_by_depth
: The genotype quality divided by the read depth. Sites with a low quality-by-depth may be highly covered (leading to high quality), but with low quality reads.
The default values are subject to change, but are described on the
BiallelicGenotyper
’s command line when -help
is printed.
Translating to VCF¶
The BiallelicGenotyper
saves its output as Apache Parquet,
formatted using ADAM’s Genotype schema.
To translate this representation back to VCF, you can use ADAM’s
transformGenotypes command,
or Avocado’s Jointer command. We recommend using the Jointer
command, which will calculate variant QUAL
for all genotyped sites.
Joint Variant Calling¶
Avocado’s Jointer
command supports joint variant calling from gVCF-styled
data.
The Jointer
command can also be used to export Apache
Parquet Genotype data to VCF, and to joint
genotype a collection of samples who all scored the same set of variants.
Our joint variant calling approach is is described in Chapter 7 of this
thesis.
To run the Jointer
command, you must provide two parameters:
- The path to all input files to joint genotyping (to load multiple files, use Hadoop’s glob syntax.
- The path to save the output to, as a VCF file.
To save the VCF file as a single file (instead of sharded output), pass the
-single
flag.
If run on a single sample, Jointer
will calculate variant statistics (VCF
INFO column attributes) and qualities only. If run on multiple samples, the
Jointer
command will update the called genotypes using a binomial prior
that is informed by the observed allele frequency of the variant across all
samples with confident calls. If the input data for the multiple samples is in
gVCF format, pass the -from_gvcf
flag.