
Giang Nguyen
02 May 2026
We demonstrate that using fastp, bwa-mem2, and deep-variant can achieve comparable performance on the HG002 GIAB dataset, demonstrating a minimal tool set while matching the results of nf-core/sarek with full profile support.
nf-core/sarek is widely regarded as a leading pipeline for short-read variant calling across both germline and somatic workflows, backed by a large and active community. However, it does come with several limitations:
Therefore, we developed the nf-germline-short-read-variant-calling pipeline specifically to support large-scale population cohort construction for biobank projects, optimized for germline variant calling from ~30× coverage short-read sequencing data.
Key Features
For benchmarking results: https://github.com/gianglabs/nf-germline-short-read-variant-calling/tree/main/benchmark
Both pipelines achieve excellent SNP and INDEL accuracy with DeepVariant. The nf-germline-short-read pipeline performs comparably to nf-core/sarek while maintaining a simpler, streamlined workflow optimized specifically for germline variant calling.
This pipeline is specifically optimized for SNP and small INDEL detection using DeepVariant with preprocessing skipped. Benchmark results against HG002 (Genome in a Bottle) demonstrate competitive performance with nf-core/sarek:
| Pipeline | Recall | Precision | F1 Score | TP | FN |
|---|---|---|---|---|---|
| nf-germline-short-read-variant-calling | 99.39% | 99.82% | 99.60% | 3,344,672 | 20,455 |
| nf-core/sarek | 99.39% | 99.84% | 99.61% | 3,344,549 | 20,578 |
| Pipeline | Recall | Precision | F1 Score | TP | FN |
|---|---|---|---|---|---|
| nf-germline-short-read-variant-calling | 98.78% | 99.38% | 99.08% | 519,079 | 6,390 |
| nf-core/sarek | 98.97% | 99.46% | 99.21% | 520,048 | 5,421 |
HG002 Benchmarking Results:
Default configuration uses:
pixi run nextflow run main.nf -profile docker -resume
Create a CSV samplesheet with your input. The pipeline supports three input modes:
FASTQ Input (Full Pipeline)
sample,lane,fastq_1,fastq_2
HG002,L001,/path/to/HG002_R1.fastq.gz,/path/to/HG002_R2.fastq.gz
HG003,L001,/path/to/HG003_R1.fastq.gz,/path/to/HG003_R2.fastq.gz
BAM Input (Skip Alignment)
sample,lane,bam,bai
HG002,L001,/path/to/HG002.bam,/path/to/HG002.bam.bai
HG003,L001,/path/to/HG003.bam,/path/to/HG003.bam.bai
CRAM Input (Skip Alignment + Auto-Convert)
sample,lane,cram,crai
HG002,L001,/path/to/HG002.cram,/path/to/HG002.cram.crai
HG003,L001,/path/to/HG003.cram,/path/to/HG003.cram.crai
CRAM Benefits:
Samplesheet Columns:
sample: Sample identifierlane: Sequencing lane (if multiple lanes, create separate rows per lane)fastq_1, fastq_2 (gzipped FASTQ files)bam, bai (aligned BAM + index)cram, crai (compressed alignment + index)It will automatically detect the input format (FASTQ, BAM, or CRAM) to run the appropriate steps:
nextflow run main.nf \
--input samplesheet.csv \
--profile docker \
-resume
Advanced Options
# With multiple SV callers
nextflow run main.nf \
--input samplesheet_cram.csv \
--structural_variant_caller "manta,delly,lumpy" \
--profile docker \
-resume
# Skip annotation for faster processing
nextflow run main.nf \
--input samplesheet.csv \
--skip_annotation \
--profile docker \
-resume
# Use alternative variant caller (GATK or FreeBayes)
nextflow run main.nf \
--input samplesheet.csv \
--small_variant_caller gatk \
--profile docker \
-resume
For test mode with sample data:
nextflow run main.nf -profile docker,test -resume
Output files will be generated in the results/ directory. File structure depends on input mode:
FASTQ Input Results:
results/alignment/*.bam - Aligned BAM filesresults/alignment_qc/ - Alignment quality reportsresults/variant_calling/*.vcf.gz - Raw variant callsresults/variant_annotation/*.vcf - Annotated variantsCRAM Input Results:
results/variant_calling/*.vcf.gz - Raw variant calls (from converted BAM)results/variant_annotation/*.vcf - Annotated variantsCommon output files:
results/multiqc_report.html - Interactive quality control reportresults/pipeline_info/ - Execution timeline and trace logsFor more advanced usage and configuration options, see the Pipeline Architecture documentation.