Skip to content

Excessive memory usage (>1 TB) in COMBINE_ANNOTATIONS step with protein FASTA #459

@LSHillary

Description

@LSHillary

Description of the bug

Hi DRAM team,

First off, thanks for all your work on DRAM2 — it’s been a really useful tool for my projects!

I ran into a scaling issue with the COMBINE_ANNOTATIONS step when running Pfam on a large protein FASTA from metagenomic data (~8.1M sequences). I managed to work around it with a custom script, but thought it was worth flagging here in case it’s something you’d want to address upstream.

Description:
I was annotating a protein FASTA of ~8.1 M proteins (8,176,875 sequences) using DRAM2 with Pfam. The pipeline worked, but the COMBINE_ANNOTATIONS step ballooned to >1 TB memory usage before failing under Nextflow, despite the raw inputs totaling ~61 GB and the combined filtered output being ~917 MB.

When re-implementing the combine logic myself (streaming per-sample, selecting only the top Pfam hit per query), the job completed successfully with expected memory usage (<10 GB). This suggests the current implementation is trying to hold all the hits in memory rather than filtering as it goes.

Steps to reproduce:
1. Annotate a protein FASTA (~8.1 M sequences) with DRAM2 using Pfam.
2. Allow the pipeline to reach the COMBINE_ANNOTATIONS step.
3. Observe runaway memory growth (>1 TB) until the job fails.

Suggested fix:
Rather than holding all rows in memory, it may be more efficient if COMBINE_ANNOTATIONS could stream/process per-batch file and apply best-hit filtering as it goes. This would keep memory usage manageable and proportional to the input size (on the order of tens of GB for datasets like mine), and avoid ballooning into the terabyte range.

Additional context:

  • Input: protein FASTA.
  • Each Pfam part file has ~40,000 rows, 205 total parts.
  • Input size across all Pfam annotation files: ~61 GB.
  • Output after filtering top hits: ~917 MB.

System information

  • DRAM version: DRAM2 beta (Nextflow revision 42fdba0, container dram2-beta1.tar.img)
  • Nextflow version: 24.10.4
  • Hardware: HPC cluster
  • Executor: slurm
  • Container engine: Singularity/Apptainer
  • OS: Ubuntu 22.04.5 LTS

Command used and terminal output

# Command

nextflow run DRAM -profile singularity_slurm --annotate --input_genes InputFastas --use_pfam --distill_ecosystem 'eng_sys ag' --outdir Dram2_pfam -resume

# terminal output is a failed run

Relevant files

No response

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions