July 19, 2023

Navigating Variant Calling for Disease-Causing Mutations: The state-of-art process

Variant calling is the process of identifying and categorizing genetic variants in sequencing data. It is a critical step in the analysis of whole-genome sequencing (WGS) and whole-exome sequencing (WES) data, as it allows researchers to identify potential disease-causing mutations.

Choice of aligners

The first step in variant calling is to align the sequencing reads to a reference genome. This is done using a program such as BWA-MEM or minimap2. Once the reads are aligned, the variant calling algorithm can then identify any differences between the sequencing data and the reference genome.

You will need to choose different aligners depending on the read length of your sequencing experiment. 

  1. For short reads (70- 100 bp), BWA-MEM is particularly effective.
  2. For long reads (> 1000 bp, e.g, PacBio or Oxford Nanopore genomic reads), minimap2 has been shown to be powerful and accurate. It has even been shown to have advantages for short reads > 100 bp, with some reports citing three times the speed of BWA-MEM and Bowtie2 with similar accuracy on simulated data.

Choice of variant callers

Variant calling algorithms typically use a number of different methods to identify variants, including:

Variant callers also operate differently based on the experimental design:

There are a number of different variant calling algorithms available, each with its own strengths and weaknesses. Some of the most popular algorithms include GATK, samtools, bcftools, and VarScan, though the more recently developed DRAGEN, Deepvariant, and Sention have been shown to have better performance, and are rapidly growing in popularity. 

Benchmarking variant calling is a challenge as the ground truth samples are not readily available. [Genome in a Bottle](https://www.nist.gov/programs-projects/genome-bottle) provides a frequently utilized standard set of files for benchmarking data resources. In this benchmark experiment, Sention, which uses optimizes GATK internally for variant calling, is a top performer. However, real cancer samples are heterogeneous and complex. Low-frequency allele variants are hard to detect when they first arise, but they have been shown repeatedly to undergo clonal expansion in later stages of cancer or when undergoing significant environmental pressure.

Choice of variant annotators

Once variants have been called, they need to be filtered and annotated. Filtering is the process of removing variants that are likely to be false positives. Annotation is the process of assigning meaning to variants, such as whether they are known to be pathogenic or not. To better understand the biological significance of the variants, annotation is needed. 

VEP, ANNOVAR, and SnpEff are commonly used variant annotation tools. The OpenCRAVAT https://opencravat.org/ was developed in Rachel Karchin’s group at Johns Hopkins. Compared with other similar tools,  its unique feature lies in its ability to access and integrate an unparalleled range of diverse data resources and computational prediction methods, encompassing germline, somatic, common, rare, coding, and non-coding variants.

“It was designed to have better annotations for somatic variants in cancer than standard variant annotators. It runs on the command line, but also produces interactive variant dashboards/reports that are shareable.” – Author of OpenCRAVAT, Collin Tokheim.  OpenCRAVAT includes databases such as GETEx, ClinVar, COSMIC, gnomAD, CIVIC, 1000 genome, and many others.

Variant calling is a complex and challenging task, but it is essential for the analysis of WGS and WES data. By identifying genetic variants, researchers can gain valuable insights into the causes of disease and develop new treatments. Watershed has implemented many variant calling tools and pipelines, including GATK best practices for germline and somatic variant calling that are ready to plug in for your data analysis. Their platform can help you rapidly obtain confident variant calls.

At Watershed, customers' usage cases include:

The Watershed Bioinformatics team has extensive experience in WGS/WES data analysis. They can assist you with the downstream interpretation. Watershed has been a trusted partner with many biotech companies, including SalioGen, CargoTx, and Benchling.

Read Dr. Tang's original article here.

Dr. Tommy Tang
Director of Computational Biology, Immunitas