Excessive memory usage (>1 TB) in COMBINE_ANNOTATIONS step with protein FASTA

### Description of the bug

Hi DRAM team,

First off, thanks for all your work on DRAM2 — it’s been a really useful tool for my projects!

I ran into a scaling issue with the COMBINE_ANNOTATIONS step when running Pfam on a large protein FASTA from metagenomic data (~8.1M sequences). I managed to work around it with a custom script, but thought it was worth flagging here in case it’s something you’d want to address upstream.

**Description:**
I was annotating a protein FASTA of ~8.1 M proteins (8,176,875 sequences) using DRAM2 with Pfam. The pipeline worked, but the COMBINE_ANNOTATIONS step ballooned to >1 TB memory usage before failing under Nextflow, despite the raw inputs totaling ~61 GB and the combined filtered output being ~917 MB.

When re-implementing the combine logic myself (streaming per-sample, selecting only the top Pfam hit per query), the job completed successfully with expected memory usage (<10 GB). This suggests the current implementation is trying to hold all the hits in memory rather than filtering as it goes.

**Steps to reproduce:**
	1.	Annotate a protein FASTA (~8.1 M sequences) with DRAM2 using Pfam.
	2.	Allow the pipeline to reach the COMBINE_ANNOTATIONS step.
	3.	Observe runaway memory growth (>1 TB) until the job fails.

**Suggested fix:**
Rather than holding all rows in memory, it may be more efficient if COMBINE_ANNOTATIONS could stream/process per-batch file and apply best-hit filtering as it goes. This would keep memory usage manageable and proportional to the input size (on the order of tens of GB for datasets like mine), and avoid ballooning into the terabyte range.

**Additional context:**

- Input: protein FASTA.
- Each Pfam part file has ~40,000 rows, 205 total parts.
- Input size across all Pfam annotation files: ~61 GB.
- Output after filtering top hits: ~917 MB.

### System information

- DRAM version: DRAM2 beta (Nextflow revision 42fdba0110, container dram2-beta1.tar.img)
- Nextflow version: 24.10.4
- Hardware: HPC cluster
- Executor: slurm
- Container engine: Singularity/Apptainer
- OS: Ubuntu 22.04.5 LTS

### Command used and terminal output

```console
# Command

nextflow run DRAM -profile singularity_slurm --annotate --input_genes InputFastas --use_pfam --distill_ecosystem 'eng_sys ag' --outdir Dram2_pfam -resume

# terminal output is a failed run
```

### Relevant files

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Excessive memory usage (>1 TB) in COMBINE_ANNOTATIONS step with protein FASTA #459

Description of the bug

System information

Command used and terminal output

Relevant files

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Excessive memory usage (>1 TB) in COMBINE_ANNOTATIONS step with protein FASTA #459

Description

Description of the bug

System information

Command used and terminal output

Relevant files

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions