GATK Somatic Short Variant Discovery (SNVs and Indels)

Yuwei BaoOctober 2, 2022

Here are some notes about calling somatic mutations using GATK mutect2.

Reference - GATK tutorials:

GATK Somatic short variant discovery (SNVs + Indels)open in new window

Version Iopen in new window

Version IIopen in new window

Relatedopen in new window

1. Call somatic short variants and generate a bamout with Mutect2

gatk --java-options "-Xmx2g" Mutect2 \
    -R hg38/Homo_sapiens_assembly38.fasta \
    -I tumor.bam \
    -I normal.bam \
    -tumor HCC1143_tumor \
    -normal HCC1143_normal \
    -pon resources/chr17_pon.vcf.gz \
    --germline-resource resources/chr17_af-only-gnomad_grch38.vcf.gz \
    --af-of-alleles-not-in-resource 0.0000025 \
    --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
    -L chr17plus.interval_list \
    -O 1_somatic_m2.vcf.gz \
    -bamout 2_tumor_normal_m2.bam

--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter

We disable a specific read filter --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter since this filter removes from analysis paired reads whose mate maps to a different contig. Otherwise, we may miss out on detecting SNVs and indels associated with alternate haplotypes.

This filter removes around 8-9% sample reads from the full data [1]

2. Create a sites-only PoN with CreateSomaticPanelOfNormals

  1. First, run Mutect2 in tumor-only mode on each normal sample.
gatk Mutect2 \
    -R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta \
    -I HG00190.bam \
    -tumor HG00190 \
    --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
    -L chr17plus.interval_list \
    -O 3_HG00190.vcf.gz

Comments on select parameters

  • One option that is not used here is to include a germline resource with --germline-resource.
  • An optional parameter --max-population-af (default 0.01) defines the cutoff for allele frequencies.
  1. Second, collate all the normal VCFs into a single callset with CreateSomaticPanelOfNormals.
gatk CreateSomaticPanelOfNormals \
    -vcfs 3_HG00190.vcf.gz \
    -vcfs 4_NA19771.vcf.gz \
    -vcfs 5_HG02759.vcf.gz \
    -O 6_threesamplepon.vcf.gz

3. Estimate cross-sample contamination using GetPileupSummaries and CalculateContamination.

  1. First, run GetPileupSummaries on the tumor BAM to summarize read support for a set number of known variant sites.
gatk GetPileupSummaries \
    -I tumor.bam \
    -V resources/chr17_small_exac_common_3_grch38.vcf.gz \
    -O 7_tumor_getpileupsummaries.table
  1. Second, estimate contamination with CalculateContamination
gatk CalculateContamination \
    -I 7_tumor_getpileupsummaries.table \
    -O 8_tumor_calculatecontamination.table

Note: CalculateContamination can operate in two modes. The alternate mode incorporates the metrics for the matched normal, to enable a potentially more accurate estimate.

What if I find high levels of contamination?

Picard’s CrosscheckFingerprintsopen in new window

This tool allows us to

  1. Check at the sample level that our tumor and normal are related, as it is imperative they should come from the same individual
  2. Check at the read group level that each of the read group data come from the same individual.

4. Filter for confident somatic calls using FilterMutectCalls

Filter the Mutect2 callset with FilterMutectCalls.

gatk FilterMutectCalls \
    -V somatic_m2.vcf.gz \
    --contamination-table tumor_calculatecontamination.table \
    -O 9_somatic_oncefiltered.vcf.gz

5. (Optional) Estimate artifacts with CollectSequencingArtifactMetrics and filter them with FilterByOrientationBias

  1. First, collect metrics on sequence context artifacts with CollectSequencingArtifactMetrics.
gatk CollectSequencingArtifactMetrics \
    -I tumor.bam \
    -O 10_tumor_artifact \
    –-FILE_EXTENSION ".txt" \
    -R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta

Results provide a global view across the genome that empowers decision making in ways that site-specific analyses cannot. The metrics can help decide whether to consider downstream filtering.

  1. Second, perform orientation bias filtering with FilterByOrientationBias
gatk FilterByOrientationBias \
    -A G/T \
    -A C/T \
    -V 9_somatic_oncefiltered.vcf.gz \
    -P tumor_artifact.pre_adapter_detail_metrics.txt \
    -O 11_somatic_twicefiltered.vcf.gz

6. Set up in IGV to review somatic calls

  1. First, load Human (hg38) as the reference in IGV. Then load these six files in order:
resources/chr17_pon.vcf.gz
resources/chr17af-only-gnomadgrch38.vcf.gz
11somatictwicefiltered.vcf.gz
2tumornormal_m2.bam
normal.bam
tumor.bam
  1. Second, navigate IGV to certain locus.
  2. Third, tweak IGV settings that aid in visualizing reassembled alignments.

Another tutorial

  1. Mutect2
gatk Mutect2 -R ref.fasta \
        -L intervals.interval_list \
        -I tumor.bam \
        -germline-resource af-only-gnomad.vcf \
        -pon panel_of_normals.vcf   \
        --f1r2-tar-gz f1r2.tar.gz \
        -O unfiltered.vcf
  1. LearnReadOrientationModel
gatk LearnReadOrientationModel -I f1r2.tar.gz \
        -O read-orientation-model.tar.gz
  1. Run GetPileupSummaries to summarize read support for a set number of known variant sites.
gatk GetPileupSummaries \
    -I tumor.bam \
    -V chr17_small_exac_common_3_grch38.vcf.gz \
    -L chr17_small_exac_common_3_grch38.vcf.gz \
    -O getpileupsummaries.table

where -L variants.vcf when specifying a VCF file containing variant records; their genomic coordinates will be used as intervals. [2] 4. Estimate contamination with CalculateContamination.

gatk CalculateContamination \
        -I getpileupsummaries.table \
        -tumor-segmentation segments.table \
        -O calculatecontamination.table

  1. https://gatk.broadinstitute.org/hc/en-us/articles/360035889791?id=11136open in new window ↩︎

  2. https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-listsopen in new window ↩︎