CRAMboozle Snakemake Workflow

A Snakemake workflow for de-identifying BAM and CRAM files using CRAMboozle.py.

Overview

This workflow processes multiple BAM/CRAM files in parallel, de-identifying sequencing data by removing identifying information while preserving the essential alignment data for analysis.

Requirements

Python 3.10+
Snakemake
pysam (0.21.0+ for CRAM v3.1 support)
Reference genome FASTA file (indexed with samtools faidx)

Setup

1. Configure your samples

Edit config/samples.yaml to specify your samples and their input file paths:

samples:
  patient001: /path/to/patient001.bam
  patient002: /path/to/patient002.cram
  control001: /path/to/control001.bam

Edit config/config.yaml to specify:

Reference genome location
Output directory
Processing options

2. Configure cluster settings (Fred Hutch)

The config/cluster_slurm.yaml is pre-configured for Fred Hutch SLURM cluster. Modify if using a different cluster system.

Running the Workflow

Local execution (small datasets)

snakemake -s CRAMboozle.snakefile -j 4

Cluster execution (Fred Hutch)

# Load required modules
ml snakemake/7.32.3-foss-2022b
ml Python/3.10.8-GCCcore-12.2.0
ml Pysam/0.21.0-GCC-12.2.0

# Run workflow
snakemake -s CRAMboozle.snakefile \
    --latency-wait 60 \
    --keep-going \
    --cluster-config config/cluster_slurm.yaml \
    --cluster "sbatch -p {cluster.partition} --mem={cluster.mem} -t {cluster.time} -c {cluster.ncpus} -n {cluster.ntasks} -o {cluster.output} -J {cluster.JobName}" \
    -j 40

Dry run (recommended first)

Add -np flag to the end of any command to see what would be executed without running it.

Output Files

For each sample, the workflow generates:

{sample}_deidentified.cram - De-identified CRAM file
{sample}_deidentified.cram.crai - CRAM index file
logs/{sample}_cramboozle.log - Processing log
CRAMboozle_summary.txt - Summary of all processed files

Configuration Options

In config/config.yaml:

strict_mode: Enable additional tag sanitization (default: false)
keep_unmapped: Keep unmapped reads in output (default: false)
keep_secondary: Keep secondary alignments (default: false)
cramboozle.ncpus: CPUs per job (default: 8, auto-detected if available)

File Formats

Input: BAM or CRAM files (auto-detected by extension)
Output: CRAM files by default (more compressed than BAM)

Monitoring

Check cluster job status: squeue -u $USER
View logs in: results/logs/
Monitor progress: snakemake -s CRAMboozle.snakefile --summary

Troubleshooting

Missing reference: Ensure reference genome FASTA is indexed
```
samtools faidx /path/to/reference.fasta
```
Permission errors: Check file permissions and paths
Memory issues: Increase memory in cluster_slurm.yaml
Failed jobs: Check individual log files in results/logs/

Example Directory Structure

CRAMboozle/
├── CRAMboozle.py
├── CRAMboozle.snakefile
├── config/
│   ├── config.yaml
│   └── cluster_slurm.yaml
└── results/
    ├── sample1_deidentified.cram
    ├── sample1_deidentified.cram.crai
    ├── logs/
    └── CRAMboozle_summary.txt

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
BAMboozle		BAMboozle
config		config
results		results
.DS_Store		.DS_Store
CHANGELOG.txt		CHANGELOG.txt
CRAMboozle.py		CRAMboozle.py
CRAMboozle.snakefile		CRAMboozle.snakefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CRAMboozle Snakemake Workflow

Overview

Requirements

Setup

1. Configure your samples

2. Configure cluster settings (Fred Hutch)

Running the Workflow

Local execution (small datasets)

Cluster execution (Fred Hutch)

Dry run (recommended first)

Output Files

Configuration Options

File Formats

Monitoring

Troubleshooting

Example Directory Structure

About

Uh oh!

Releases

Packages

Languages

GavinHaLab/CRAMboozle

Folders and files

Latest commit

History

Repository files navigation

CRAMboozle Snakemake Workflow

Overview

Requirements

Setup

1. Configure your samples

2. Configure cluster settings (Fred Hutch)

Running the Workflow

Local execution (small datasets)

Cluster execution (Fred Hutch)

Dry run (recommended first)

Output Files

Configuration Options

File Formats

Monitoring

Troubleshooting

Example Directory Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages