GATK Somatic Short Variant Discovery (SNVs and Indels)
Here are some notes about calling somatic mutations using GATK mutect2
.
Reference - GATK tutorials:
GATK Somatic short variant discovery (SNVs + Indels)
1. Call somatic short variants and generate a bamout with Mutect2
gatk --java-options "-Xmx2g" Mutect2 \
-R hg38/Homo_sapiens_assembly38.fasta \
-I tumor.bam \
-I normal.bam \
-tumor HCC1143_tumor \
-normal HCC1143_normal \
-pon resources/chr17_pon.vcf.gz \
--germline-resource resources/chr17_af-only-gnomad_grch38.vcf.gz \
--af-of-alleles-not-in-resource 0.0000025 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L chr17plus.interval_list \
-O 1_somatic_m2.vcf.gz \
-bamout 2_tumor_normal_m2.bam
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter
We disable a specific read filter --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter
since this filter removes from analysis paired reads whose mate maps to a different contig. Otherwise, we may miss out on detecting SNVs and indels associated with alternate haplotypes.
This filter removes around 8-9% sample reads from the full data [1]
2. Create a sites-only PoN with CreateSomaticPanelOfNormals
- First, run Mutect2 in tumor-only mode on each normal sample.
gatk Mutect2 \
-R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta \
-I HG00190.bam \
-tumor HG00190 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L chr17plus.interval_list \
-O 3_HG00190.vcf.gz
Comments on select parameters
- One option that is not used here is to include a germline resource with
--germline-resource
. - An optional parameter
--max-population-af
(default 0.01) defines the cutoff for allele frequencies.
- Second, collate all the normal VCFs into a single callset with
CreateSomaticPanelOfNormals
.
gatk CreateSomaticPanelOfNormals \
-vcfs 3_HG00190.vcf.gz \
-vcfs 4_NA19771.vcf.gz \
-vcfs 5_HG02759.vcf.gz \
-O 6_threesamplepon.vcf.gz
3. Estimate cross-sample contamination using GetPileupSummaries and CalculateContamination.
- First, run
GetPileupSummaries
on the tumor BAM to summarize read support for a set number of known variant sites.
gatk GetPileupSummaries \
-I tumor.bam \
-V resources/chr17_small_exac_common_3_grch38.vcf.gz \
-O 7_tumor_getpileupsummaries.table
- Second, estimate contamination with
CalculateContamination
gatk CalculateContamination \
-I 7_tumor_getpileupsummaries.table \
-O 8_tumor_calculatecontamination.table
Note: CalculateContamination can operate in two modes. The alternate mode incorporates the metrics for the matched normal, to enable a potentially more accurate estimate.
What if I find high levels of contamination?
Picard’s CrosscheckFingerprints
This tool allows us to
- Check at the sample level that our tumor and normal are related, as it is imperative they should come from the same individual
- Check at the read group level that each of the read group data come from the same individual.
FilterMutectCalls
4. Filter for confident somatic calls using Filter the Mutect2 callset with FilterMutectCalls.
gatk FilterMutectCalls \
-V somatic_m2.vcf.gz \
--contamination-table tumor_calculatecontamination.table \
-O 9_somatic_oncefiltered.vcf.gz
5. (Optional) Estimate artifacts with CollectSequencingArtifactMetrics and filter them with FilterByOrientationBias
- First, collect metrics on sequence context artifacts with
CollectSequencingArtifactMetrics
.
gatk CollectSequencingArtifactMetrics \
-I tumor.bam \
-O 10_tumor_artifact \
–-FILE_EXTENSION ".txt" \
-R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta
Results provide a global view across the genome that empowers decision making in ways that site-specific analyses cannot. The metrics can help decide whether to consider downstream filtering.
- Second, perform orientation bias filtering with
FilterByOrientationBias
gatk FilterByOrientationBias \
-A G/T \
-A C/T \
-V 9_somatic_oncefiltered.vcf.gz \
-P tumor_artifact.pre_adapter_detail_metrics.txt \
-O 11_somatic_twicefiltered.vcf.gz
6. Set up in IGV to review somatic calls
- First, load Human (hg38) as the reference in IGV. Then load these six files in order:
resources/chr17_pon.vcf.gz resources/chr17af-only-gnomadgrch38.vcf.gz 11somatictwicefiltered.vcf.gz 2tumornormal_m2.bam normal.bam tumor.bam
- Second, navigate IGV to certain locus.
- Third, tweak IGV settings that aid in visualizing reassembled alignments.
Another tutorial
Mutect2
gatk Mutect2 -R ref.fasta \
-L intervals.interval_list \
-I tumor.bam \
-germline-resource af-only-gnomad.vcf \
-pon panel_of_normals.vcf \
--f1r2-tar-gz f1r2.tar.gz \
-O unfiltered.vcf
LearnReadOrientationModel
gatk LearnReadOrientationModel -I f1r2.tar.gz \
-O read-orientation-model.tar.gz
- Run
GetPileupSummaries
to summarize read support for a set number of known variant sites.
gatk GetPileupSummaries \
-I tumor.bam \
-V chr17_small_exac_common_3_grch38.vcf.gz \
-L chr17_small_exac_common_3_grch38.vcf.gz \
-O getpileupsummaries.table
where -L variants.vcf
when specifying a VCF file containing variant records; their genomic coordinates will be used as intervals. [2] 4. Estimate contamination with CalculateContamination.
gatk CalculateContamination \
-I getpileupsummaries.table \
-tumor-segmentation segments.table \
-O calculatecontamination.table