Skip to content

[Bug/Help] 作业提交时命令被下方选择覆盖 #1509

@Remielyzs

Description

@Remielyzs

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

发生了什么 | What happened

提交作业时GPU数量与CPU核心数量无法在选项中自定义(GPU无法归零,CPU核心数量只能由节点+GPU数量确定)
使用命令指定使用资源后通过

sacct -j 92 --format=JobID,JobName,AllocCPUS,NNodes,NTasks,CPUTime,Elapsed,State

查看任务使用资源

JobID           JobName  AllocCPUS   NNodes   NTasks    CPUTime    Elapsed      State
------------ ---------- ---------- -------- -------- ---------- ---------- ----------
92           job-20251+         19        1            00:01:16   00:00:04  COMPLETED
92.batch          batch         19        1        1   00:01:16   00:00:04  COMPLETED

以下为测试用提交脚本,第一块sbatch由系统自动指定,后续为手动填写指定

#!/bin/bash
#SBATCH -A a_yzs
#SBATCH --partition=debug
#SBATCH --qos=normal
#SBATCH -J job-20251107-185424
#SBATCH --nodes=1
#SBATCH -c 19
#SBATCH --time=30
#SBATCH --chdir=/data/home/yzs/scow/jobs/job-20251107-185424
#SBATCH --output=job.%j.out
#SBATCH --error=job.%j.err
#SBATCH --gres=gpu:1

source /data/software/module/tools/modules/init/profile.sh
#!/bin/bash
#SBATCH -A a_yzs
#SBATCH --partition=debug
#SBATCH --qos=normal
#SBATCH -J job-20251106-204211
#SBATCH --nodes=1
#SBATCH -c 1
#SBATCH --time=30
#SBATCH --chdir=/data/home/yzs
#SBATCH --output=job.%j.out
#SBATCH --error=job.%j.err
#SBATCH --gres=gpu:0

source /data/software/module/tools/modules/init/profile.sh
#!/bin/bash

#SBATCH --job-name=pytorch-gpu-test    # 作业名称
#SBATCH --output=gpu_test_%j.out      # 输出日志文件
#SBATCH --error=gpu_test_%j.err       # 错误日志文件
#SBATCH --partition=debug

#SBATCH --time=00:10:00               # 运行时间限制(10分钟)
#SBATCH --mem=8G                      # 内存限制
source /data/software/module/tools/modules/init/profile.sh
source /data/software/conda/etc/profile.d/conda.sh
# 加载conda模块 (根据你的集群实际情况修改模块名)
#module load conda

# 初始化conda (解决conda: command not found问题)
#source "$(conda info --base)/etc/profile.d/conda.sh" || { echo "无法初始化conda"; exit 1; }
conda init bash
# 激活pytorch环境
conda activate /data/software/conda/envs/pytorch || { echo "无法激活pytorch环境"; exit 1; }

# 检查环境是否激活成功
echo "当前激活的环境: $(conda info --envs | grep '*' | awk '{print $1}')"

# 检查Python路径
which python || { echo "找不到python命令"; exit 1; }

# 检查PyTorch是否可用
python -c "import torch; print('PyTorch版本:', torch.__version__)" || { echo "PyTorch导入失败"; exit 1; }

# 运行测试脚本
echo "开始运行GPU测试..."
python gpu_test.py

# 退出环境
conda deactivate

期望结果 | What did you expect to happen

期望结果应该由命令指定而非选项指定资源

之前运行正常吗? | Did this work before?

第一次运行

复现方法 | Steps To Reproduce

按照文档进行部署,其中log由于驱动初始化原因从fluentd改为docker原生。

运行环境 | Environment

- OS:Ubuntu 24.04
- Scheduler: Slurm 25.05.1
- Docker:28.5.1, build e180ab8
- Docker-compose: v2.40.3
- SCOW cli:1.6.4
- SCOW:1.6.4
- Adapter:1.6.0

备注 | Anything else?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions