Skip to content

A snakemake workflow for CRAMboozle on Fred Hutch servers. CRAMboozle is a modified version of BAMboozle which de-identifies alignment data and takes either BAM or CRAM format for both input and output (CRAM to CRAM by default).

Notifications You must be signed in to change notification settings

GavinHaLab/CRAMboozle

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CRAMboozle Snakemake Workflow

A Snakemake workflow for de-identifying BAM and CRAM files using CRAMboozle.py.

Overview

This workflow processes multiple BAM/CRAM files in parallel, de-identifying sequencing data by removing identifying information while preserving the essential alignment data for analysis.

Requirements

  • Python 3.10+
  • Snakemake
  • pysam (0.21.0+ for CRAM v3.1 support)
  • Reference genome FASTA file (indexed with samtools faidx)

Setup

1. Configure your samples

Edit config/samples.yaml to specify your samples and their input file paths:

samples:
  patient001: /path/to/patient001.bam
  patient002: /path/to/patient002.cram
  control001: /path/to/control001.bam

Edit config/config.yaml to specify:

  • Reference genome location
  • Output directory
  • Processing options

2. Configure cluster settings (Fred Hutch)

The config/cluster_slurm.yaml is pre-configured for Fred Hutch SLURM cluster. Modify if using a different cluster system.

Running the Workflow

Local execution (small datasets)

snakemake -s CRAMboozle.snakefile -j 4

Cluster execution (Fred Hutch)

# Load required modules
ml snakemake/7.32.3-foss-2022b
ml Python/3.10.8-GCCcore-12.2.0
ml Pysam/0.21.0-GCC-12.2.0

# Run workflow
snakemake -s CRAMboozle.snakefile \
    --latency-wait 60 \
    --keep-going \
    --cluster-config config/cluster_slurm.yaml \
    --cluster "sbatch -p {cluster.partition} --mem={cluster.mem} -t {cluster.time} -c {cluster.ncpus} -n {cluster.ntasks} -o {cluster.output} -J {cluster.JobName}" \
    -j 40

Dry run (recommended first)

Add -np flag to the end of any command to see what would be executed without running it.

Output Files

For each sample, the workflow generates:

  • {sample}_deidentified.cram - De-identified CRAM file
  • {sample}_deidentified.cram.crai - CRAM index file
  • logs/{sample}_cramboozle.log - Processing log
  • CRAMboozle_summary.txt - Summary of all processed files

Configuration Options

In config/config.yaml:

  • strict_mode: Enable additional tag sanitization (default: false)
  • keep_unmapped: Keep unmapped reads in output (default: false)
  • keep_secondary: Keep secondary alignments (default: false)
  • cramboozle.ncpus: CPUs per job (default: 8, auto-detected if available)

File Formats

  • Input: BAM or CRAM files (auto-detected by extension)
  • Output: CRAM files by default (more compressed than BAM)

Monitoring

  • Check cluster job status: squeue -u $USER
  • View logs in: results/logs/
  • Monitor progress: snakemake -s CRAMboozle.snakefile --summary

Troubleshooting

  1. Missing reference: Ensure reference genome FASTA is indexed

    samtools faidx /path/to/reference.fasta
  2. Permission errors: Check file permissions and paths

  3. Memory issues: Increase memory in cluster_slurm.yaml

  4. Failed jobs: Check individual log files in results/logs/

Example Directory Structure

CRAMboozle/
├── CRAMboozle.py
├── CRAMboozle.snakefile
├── config/
│   ├── config.yaml
│   └── cluster_slurm.yaml
└── results/
    ├── sample1_deidentified.cram
    ├── sample1_deidentified.cram.crai
    ├── logs/
    └── CRAMboozle_summary.txt

About

A snakemake workflow for CRAMboozle on Fred Hutch servers. CRAMboozle is a modified version of BAMboozle which de-identifies alignment data and takes either BAM or CRAM format for both input and output (CRAM to CRAM by default).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages