I’m running MLPerf Llama 3.1 small LLM pretraining (benchmarks/small_llm_pretraining) on our H200 DGX cluster.
Single-node training works correctly inside the provided Docker container (mlperf-h200:latest), but multi-node pretraining does not appear to be supported or documented.
The pretrain_llama31.py script only uses a LocalExecutor, while the large-model pretraining example includes a working SlurmExecutor.
I would like to benchmark 2–8 H200 nodes using SLURM (inside Docker, as per MLPerf setup), but there is no documented way to do so.
Environment
- Cluster: DGX H200 (8× H200 141 GB GPUs per node)
- Nodes: 2 – 8 planned
- Container:
mlperf-h200:latest from MLPerf reference
- Inside container: Python 3.10, NeMo 1.25+, PyTorch 2.3+, CUDA 12.4