BWA
Introduction
BWA is widely used aligner for mapping short reads to a reference genome. It works for both eukaryotes and prokaryotes, and is still one of the standard tools for genomics workflows.
Requirements
BWA is available as a module on NCSA servers.
Load the module:
module load bwa
Notes on Resource Allocation
Depending on the genomics project, there might be some performance differences worth pointing out:
- Eukaryotes:
Eukaryote genomes are usually bigger and therefore it might take longer and need more memory.
BWA is not splicing-aware.
Indexing (with bwa index) might take a while in large genomes
- Prokaryotes:
Prokaryote genomes are relatively smaller and usually do not require much memory
Usage
Below is a basic code snippet that could be used to create the SLURM batch script to run BWA (e.g., bwa mem) alignment for paired-end reads. This is a conventional use case for a project that aims to discover genome variants.
# Path to Working Directory
myWorkDir="/path/to/my/working/directory"
cd $myWorkDir
# Declare the reference genome for convenience
REF="$myWorkDir/ref/ref.fasta"
# Index the reference genome
## Notes:
## Usually we perform the reference indexing just once per project
## Adding reference to a variable (ie $REF) is just a convenience -- it is not a necessary step so the reference could be referenced directly in the code.
bwa index $REF
# Declare the FASTQ files
## Notes:
## These files could be supplied in a loop to increase the performance
## However, the required resources should match with the number of files used for the alignment
R1="$myWorkDir/samples/sample1_R1.fastq.gz"
R2="$myWorkDir/samples/sample1_R2.fastq.gz"
## Name of the alignment output for the sample (sam file)
OUT="$myWorkDir/sample1.sam"
## -- Run BWA MEM to align reads
## -- Parallelize the job using the flag -t, matching the --cpus-per-task value.
bwa mem -t 8 $REF $R1 $R2 > $OUT
References
Li H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997.