Use bwa, samtools for aligning reads to reference genome

Yuwei BaoAugust 25, 2023

The very first step of variant calling analysis is to align sequences.fastq to its reference genome reference_genome.fasta.

This can be done using the bwa [1] then followed by the tool samtools [2]

bwa

For example, to align illumina paired-ends reads to its reference genome using bwa mem algorithm:

bwa mem reference_genome.fasta read1.fastq read2.fastq > output.sam

However, we want to convert sam to bam to save disc space, add additional information, mark duplicates, and index the bam file prior to use variant calling tools. Thus, we use samtools.

samtools

samtools view # Convert SAM to BAM
samtools sort # Sort the BAM files according to their placement in the reference genome
samtools index # Index the BAM files
samtools merge # Merge multiple sorted BAM into a single BAM

Use bwa and samtools together without saving the intermediate SAM files

Reference: Dr. Eric C. Anderson's Bioinformatics Handbook [3]

bwa mem genome.fna R1.fq R2.fq |
  samtools view -u -  |   # convert the SAM output from bwa mem into BAM format
  samtools sort -l 9  - -o output_file.bam  # take stdin as the input, sort it, and write (with the best
                                        # compression possible: -l 9) the output to output_file.bam

Practice with one job in one step

Get the example SAM data:

mkdir -p results/sam
wget https://eriqande.github.io/eca-bioinf-handbook/downloads/s001---1.sam
mv s001---1.sam results/sam/

Convert SAM to BAM:

samtools view -b results/sam/s001---1.sam > results/sam/s001---1.bam

Compare the size of SAM and BAM:

du -h results/sam/*

Read the alignments in a BAM file (without header)

samtools view results/sam/s001---1.bam

Read the alignments in a BAM file (with header)

samtools view -h results/sam/s001---1.bam

Read only the header in a BAM file

samtools view -H results/sam/s001---1.bam

Sort the BAM file in order of their placement in the reference genome

samtools sort -l 9 -o results/sam/s001---1.srt.bam results/sam/s001---1.bam

Check how samtools sort makes changes:

samtools view -H results/sam/s001---1.srt.bam | head    # @HD	VN:1.6	SO:coordinate
samtools view -H results/sam/s001---1.srt.bam | tail -n 10  # @PG line stored the commands

Index a bam:

samtools index results/sam/s001---1.srt.bam     # Create s001---1.srt.bam.bai

Merge multiple sorted BAM files into a single BAM file:

samtools merge [options] output-bam-name.bam  sorted-input-bam-1.bam sorted-input-bam-2.bam ..

Check the mapping statistics on a BAM:

samtools flagstats results/sam/s001---1.srt.bam

Practice with multiple jobs in one step

# remove the .bam .srt.bam files:
rm results/sam/s001---1.bam results/sam/s001---1.srt.bam

# now, directly make a sorted BAM file at results/sam/s001---1.srt.bam
# by piping samtools view output into samtools sort.  Note the use
# of -u for uncompressed BAM output, and the - at the end of the
# line, instead of a file name, to mean take
# input from stdin instead of a file
samtools view -u results/sam/s001---1.sam | \
  samtools sort -l 9 -o results/sam/s001---1.srt.bam -

  1. https://github.com/lh3/bwaopen in new window ↩︎

  2. https://github.com/samtools/samtoolsopen in new window ↩︎

  3. https://eriqande.github.io/eca-bioinf-handbook/alignment-of-sequence-data-to-a-reference-genome-and-associated-steps.html#aligning-reads-with-bwaopen in new window ↩︎