A Snakemake workflow for de-identifying BAM and CRAM files using CRAMboozle.py.
This workflow processes multiple BAM/CRAM files in parallel, de-identifying sequencing data by removing identifying information while preserving the essential alignment data for analysis.
- Python 3.10+
- Snakemake
- pysam (0.21.0+ for CRAM v3.1 support)
- Reference genome FASTA file (indexed with
samtools faidx)
Edit config/samples.yaml to specify your samples and their input file paths:
samples:
patient001: /path/to/patient001.bam
patient002: /path/to/patient002.cram
control001: /path/to/control001.bamEdit config/config.yaml to specify:
- Reference genome location
- Output directory
- Processing options
The config/cluster_slurm.yaml is pre-configured for Fred Hutch SLURM cluster.
Modify if using a different cluster system.
snakemake -s CRAMboozle.snakefile -j 4# Load required modules
ml snakemake/7.32.3-foss-2022b
ml Python/3.10.8-GCCcore-12.2.0
ml Pysam/0.21.0-GCC-12.2.0
# Run workflow
snakemake -s CRAMboozle.snakefile \
--latency-wait 60 \
--keep-going \
--cluster-config config/cluster_slurm.yaml \
--cluster "sbatch -p {cluster.partition} --mem={cluster.mem} -t {cluster.time} -c {cluster.ncpus} -n {cluster.ntasks} -o {cluster.output} -J {cluster.JobName}" \
-j 40Add -np flag to the end of any command to see what would be executed without running it.
For each sample, the workflow generates:
{sample}_deidentified.cram- De-identified CRAM file{sample}_deidentified.cram.crai- CRAM index filelogs/{sample}_cramboozle.log- Processing logCRAMboozle_summary.txt- Summary of all processed files
In config/config.yaml:
strict_mode: Enable additional tag sanitization (default: false)keep_unmapped: Keep unmapped reads in output (default: false)keep_secondary: Keep secondary alignments (default: false)cramboozle.ncpus: CPUs per job (default: 8, auto-detected if available)
- Input: BAM or CRAM files (auto-detected by extension)
- Output: CRAM files by default (more compressed than BAM)
- Check cluster job status:
squeue -u $USER - View logs in:
results/logs/ - Monitor progress:
snakemake -s CRAMboozle.snakefile --summary
-
Missing reference: Ensure reference genome FASTA is indexed
samtools faidx /path/to/reference.fasta
-
Permission errors: Check file permissions and paths
-
Memory issues: Increase memory in
cluster_slurm.yaml -
Failed jobs: Check individual log files in
results/logs/
CRAMboozle/
├── CRAMboozle.py
├── CRAMboozle.snakefile
├── config/
│ ├── config.yaml
│ └── cluster_slurm.yaml
└── results/
├── sample1_deidentified.cram
├── sample1_deidentified.cram.crai
├── logs/
└── CRAMboozle_summary.txt