
Giang Nguyen
03 May 2026
We use deep-somatic as one of the somatic variant callers in this pipeline. DeepSomatic is a deep-learning method for detecting somatic small nucleotide variations and insertions and deletions from both short-read and long-read data. It consistently outperforms existing callers across samples and sequencing technologies.
Here, we adopt deep-somatic as the primary engine for somatic variant calling, while reusing proven modules from earlier pipelines to deliver a state-of-the-art workflow.
Reference: DeepSomatic: Accurate somatic variant calling with deep learning
nf-core/sarek is widely regarded as a leading pipeline for short-read variant calling across both germline and somatic workflows, backed by a large and active community. However, it does come with several limitations:
Therefore, we developed the nf-somatic-short-read-variant-calling pipeline specifically to support large-scale somatic variant calling from tumor-normal paired short-read sequencing data.
Key Features
DeepSomatic is a deep-learning method for detecting somatic small nucleotide variations and insertions and deletions from both short-read and long-read data. The method has modes for whole-genome and whole-exome sequencing and can run on tumor–normal, tumor-only and formalin-fixed paraffin-embedded samples.
Key advantages:
GATK Mutect2 is a widely-used somatic variant caller that uses local assembly to detect somatic variants. It is the gold standard for short-read somatic variant calling.
Strelka is a fast somatic variant caller optimized for small variant detection in tumor-normal samples.
Default configuration uses:
pixi run nextflow run main.nf -profile docker -resume
Create a CSV samplesheet with your input. The pipeline requires tumor-normal paired samples and supports three input modes:
- FASTQ Input (Full Pipeline)
patient,sex,status,sample,lane,fastq_1,fastq_2
P001,M,0,P001_N,L001,/path/to/P001_N_R1.fastq.gz,/path/to/P001_N_R2.fastq.gz
P001,M,1,P001_T,L001,/path/to/P001_T_R1.fastq.gz,/path/to/P001_T_R2.fastq.gz
- BAM Input (Skip Alignment)
patient,sex,status,sample,lane,bam,bai
P001,M,0,P001_N,L001,/path/to/P001_N.bam,/path/to/P001_N.bam.bai
P001,M,1,P001_T,L001,/path/to/P001_T.bam,/path/to/P001_T.bam.bai
- CRAM Input (Skip Alignment + Auto-Convert)
patient,sex,status,sample,lane,cram,crai
P001,M,0,P001_N,L001,/path/to/P001_N.cram,/path/to/P001_N.cram.crai
P001,M,1,P001_T,L001,/path/to/P001_T.cram,/path/to/P001_T.cram.crai
CRAM Benefits:
Samplesheet Columns:
patient: Patient identifier (same for normal/tumor pairs)sex: Sex of the patient (M/F)status: Sample type (0 = normal, 1 = tumor)sample: Sample identifierlane: Sequencing lane (optional, defaults to L001)fastq_1, fastq_2 (gzipped FASTQ files)bam, bai (aligned BAM + index)cram, crai (compressed alignment + index)It will automatically detect the input format (FASTQ, BAM, or CRAM) to run the appropriate steps:
nextflow run main.nf \
--input samplesheet.csv \
--profile docker \
-resume
Advanced Options
# Use DeepSomatic with WGS model
nextflow run main.nf \
--input samplesheet.csv \
--small_variant_caller deepsomatic \
--deepsomatic_model_type WGS \
--profile docker \
-resume
# Use Strelka instead of Mutect2
nextflow run main.nf \
--input samplesheet.csv \
--small_variant_caller strelka \
--profile docker \
-resume
# With Manta for structural variants
nextflow run main.nf \
--input samplesheet.csv \
--structural_variant_caller manta \
--profile docker \
-resume
# Skip annotation for faster processing
nextflow run main.nf \
--input samplesheet.csv \
--skip_annotation \
--profile docker \
-resume
# Enable alignment preprocessing (e.g., BQSR)
nextflow run main.nf \
--input samplesheet.csv \
--preprocessor gatk \
--profile docker \
-resume
For test mode with sample data:
nextflow run main.nf -profile docker,test -resume
Output files will be generated in the results/ directory. File structure depends on input mode:
-FASTQ Input Results:
results/alignment/*.bam - Aligned BAM filesresults/variant_calling/*.vcf.gz - Raw variant callsresults/variant_annotation/*.vcf - Annotated variantsresults/qc/ - Quality metrics (variant stats, coverage bedgraph)-CRAM Input Results:
results/variant_calling/*.vcf.gz - Raw variant calls (from converted BAM)results/variant_annotation/*.vcf - Annotated variantsresults/qc/ - Quality metrics (variant stats, coverage bedgraph)-Common output files:
results/multiqc_report.html - Interactive quality control reportresults/pipeline_info/ - Execution timeline and trace logsFor more advanced usage and configuration options, see the Pipeline Architecture documentation.