Skip to content

H200 MLPerf small_llm_pretraining: Unable to perform multi-node pretraining #840

@Ypp6657

Description

@Ypp6657

I’m running MLPerf Llama 3.1 small LLM pretraining (benchmarks/small_llm_pretraining) on our H200 DGX cluster.
Single-node training works correctly inside the provided Docker container (mlperf-h200:latest), but multi-node pretraining does not appear to be supported or documented.

The pretrain_llama31.py script only uses a LocalExecutor, while the large-model pretraining example includes a working SlurmExecutor.
I would like to benchmark 2–8 H200 nodes using SLURM (inside Docker, as per MLPerf setup), but there is no documented way to do so.


Environment

  • Cluster: DGX H200 (8× H200 141 GB GPUs per node)
  • Nodes: 2 – 8 planned
  • Container: mlperf-h200:latest from MLPerf reference
  • Inside container: Python 3.10, NeMo 1.25+, PyTorch 2.3+, CUDA 12.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions