A comprehensive Snakemake pipeline for calling somatic mutations from single-cell RNA sequencing data using SComatic.
This pipeline processes single-cell RNA-seq BAM files to identify somatic mutations by:
- Splitting BAM files by cell type and chromosome
- Counting bases at each genomic position
- Calling variants using statistical methods
- Filtering out germline variants and technical artifacts
- Genotyping individual cells for identified variants
- Conda/Mamba for environment management
- Snakemake (≥7.0) with SLURM executor plugin
- Python 3.7+
- R 4.2+ with Bioconductor packages
- Access to reference genome files (hg38)
- SComatic repository (see below)
git clone https://github.com/MorganResearchLab/SomaticMutationPipeline.git
cd SomaticMutationPipeline
# Create the main environment
conda env create -f environment.yaml
# Activate the environment
conda activate somatic-mutation-pipeline
NB: Select the executor plugin appropriate to your HPC. The example below is for SLURM. A list of all snakemake executor plugins can be found online: https://github.com/snakemake/snakemake-interface-executor-plugins
# Install the SLURM executor plugin for Snakemake
pip install snakemake-executor-plugin-slurmIf you don't have access to a shared SComatic installation, clone it:
git clone https://github.com/cortes-ciriano-lab/SComatic.gitThis is the primary configuration file you need to modify for your analysis:
# Update these paths for your system:
conda_prefix: "/path/to/your/miniforge3/envs" # Path to your conda environments
src_dir: "/path/to/SomaticMutationPipeline/scripts" # Path to this pipeline's scripts
scomatic_dir: "/path/to/SComatic" # Path to SComatic repository
ref_dir: "/path/to/reference/chromosomes" # Directory containing genome (e.g. hg38) chromosome files (chr1.fa.gz, chr2.fa.gz, etc.)
bam_dir: "/path/to/your/bam/files" # Directory containing your input BAM files# Cell type annotation file (TSV format)
cell_type_file: "/path/to/cell_metadata.tsv"
# Required columns: Index (cell barcode), Cell_type (cell type annotation)
# Sample metadata file (CSV format)
sra_metadata: "/path/to/sample_metadata.csv"
# Should contain sample IDs and batch information# Batch handling
batch_position: "None" # Options: "prefix", "suffix", "None"
batch_column: "batch" # Column name in sra_metadata containing batch info
# Cell types to analyze (comma-separated, no spaces)
celltypes: "B,CD4,CD8,Monocytes,NK" # Customize for your data
# Parallel processing
window: 5000000 # Genomic chunk size for parallel BAM splitting (5Mb recommended)
# SComatic filtering parameters (usually don't need to change)
min_cells: 5
min_ac_cells: 2
min_ac_reads: 3
max_cell_types: 6
min_cell_types: 1
fisher_cutoff: 0.001These paths point to SComatic's reference files and typically don't need modification:
panel_of_normals: "/path/to/SComatic/PoNs/PoN.scRNAseq.hg38.tsv"
rna_editing_sites: "/path/to/SComatic/RNAediting/AllEditingSites.hg38.txt"Configure for your HPC cluster:
executor: slurm
jobs: 100 # Maximum concurrent jobs
latency-wait: 250 # Wait time for file system
default-resources:
slurm_partition: "your-partition" # Your SLURM partition
slurm_account: "your-account" # Your SLURM account
runtime: "12d" # Default job runtime
mem_mb: 4000 # Default memory
cpus_per_task: 1 # Default CPUs
nodes: 1 # Default nodesThis file contains resource specifications for each pipeline step. You may need to adjust based on your cluster's limits:
{
"split_by_celltype": {
"memory": 12000, # Memory in MB
"time": "95:00:00", # Time limit
"nCPUs": 8 # CPU cores
},
"base_counter": {
"memory": 12000,
"time": "47:00:00"
},
// ... other rules
}- Location: Place in the directory specified by
bam_dir - Format: Standard BAM files with cell barcode (CB) tags
- Naming: Files should be named consistently (e.g.,
Sample1.bam,Sample2.bam)
- Format: Tab-separated values (.tsv)
- Required columns:
Index: Cell barcodes (must match CB tags in BAM files)Cell_type: Cell type annotations- Example for 10X Genomics data:
Index Cell_type AAACCTGAGAAACCAT-1 CD4 AAACCTGAGAAACGAG-1 B AAACCTGAGAAATCCC-1 Monocytes
- Format: Comma-separated values (.csv)
- Purpose: Maps sample names to batch IDs
- Example:
Sample,batch
Sample1,Batch_A
Sample2,Batch_B
- Required: genome build chromosome FASTA files (compressed)
- Format:
chr1.fa.gz,chr2.fa.gz, ...,chr22.fa.gz - Location: Directory specified by
ref_dir
# Test that everything is configured correctly
snakemake -s snakemake.dir/snakefile \
--configfile snakemake.dir/config.yaml \
-n --rerun-triggers mtime
```
### 2. Create Conda Environments (Optional)
Pre-create all required environments:
```bash
snakemake --sdm conda --conda-create-envs-only \
-s snakemake.dir/snakefile \
--configfile snakemake.dir/config.yaml
```
### 3. Execute the Pipeline
```bash
nohup nice -19 snakemake \
-s snakemake.dir/snakefile \
--configfile snakemake.dir/config.yaml \
--sdm conda \
--profile snakeprofile \
--latency-wait 650 \
--slurm-init-seconds-before-status-checks 500 \
-j 25 \
--rerun-triggers mtime \
--rerun-incomplete &
```
### 4. Monitor Progress
```bash
# Watch the pipeline progress
tail -f -n20 nohup.out
# Check SLURM job status
squeue -u $USERThe pipeline creates the following output directories:
results/
├── split_bams/ # Cell-type and chromosome-specific BAM files
├── base_counts/ # Base counts for each position
├── merged_counts/ # Merged counts across cell types
├── variant_calls/ # Initial variant calls
├── filtered_calls/ # Filtered variant calls
├── qc_filtered/ # Quality-filtered variants
├── genotyped/ # Single-cell genotypes
└── germline/ # Final somatic variants (germline filtered)
results/germline/{sample}_{celltype}_not_germline.tsv: Final somatic mutations for each sample and cell typeresults/genotyped/{sample}_{celltype}_{chrom}.single_cell_genotype.tsv: Single-cell genotype data- Log files: Located in
logs/directory for troubleshooting
-
Environment Creation Fails
- Ensure you have conda/mamba installed
- Try using mamba instead of conda for faster environment resolution
- Ensure you have conda/mamba installed
-
BAM Files Not Found
- Verify
bam_dirpath is correct- Check file permissions
- Ensure BAM files have been index, i.e. there is a *bam.bai for each BAM file
- Check file permissions
- Verify
-
Reference Files Missing
- Download genome build chromosome files
- Ensure files are compressed (.gz) and properly indexed
- Download genome build chromosome files
-
Memory/Time Limits Exceeded
- Adjust resources in
snakemake.dir/hpc.json- Consider reducing
windowsize for smaller memory usage
- Consider reducing
- Adjust resources in
-
Cell Barcodes Don't Match
- Check cell barcode format in metadata vs BAM files
- Verify batch handling settings (
batch_position,batch_column)
- Verify batch handling settings (
- Check cell barcode format in metadata vs BAM files
Check specific log files for detailed error messages:
logs/{rule_name}/{sample}_{chrom}.log- Main pipeline log:
nohup.out
- Increase
windowsize for fewer, larger parallel jobs - Adjust memory allocations based on your data size
- Consider using faster storage for temporary files
- Increase
jobsparameter in SLURM profile - Use cluster's high-memory nodes for memory-intensive steps
If you use this pipeline, please cite:
- SComatic: Muyas et al. (2023) Nature Genetics, https://doi.org/10.1038/s41587-023-01863-z
For questions and issues:
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.~