H200 MLPerf small_llm_pretraining: Unable to perform multi-node pretraining

I’m running MLPerf Llama 3.1 small LLM pretraining (`benchmarks/small_llm_pretraining`) on our H200 DGX cluster.  
Single-node training works correctly inside the provided Docker container (`mlperf-h200:latest`), but multi-node pretraining does not appear to be supported or documented.

The `pretrain_llama31.py` script only uses a `LocalExecutor`, while the large-model pretraining example includes a working `SlurmExecutor`.  
I would like to benchmark 2–8 H200 nodes using SLURM (inside Docker, as per MLPerf setup), but there is no documented way to do so.

---

### Environment
- **Cluster:** DGX H200 (8× H200 141 GB GPUs per node)
- **Nodes:** 2 – 8 planned
- **Container:** `mlperf-h200:latest` from MLPerf reference
- **Inside container:** Python 3.10, NeMo 1.25+, PyTorch 2.3+, CUDA 12.4



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

H200 MLPerf small_llm_pretraining: Unable to perform multi-node pretraining #840

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

H200 MLPerf small_llm_pretraining: Unable to perform multi-node pretraining #840

Description

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions