-
Notifications
You must be signed in to change notification settings - Fork 70
Open
Labels
bugSomething isn't workingSomething isn't working
Description
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
- 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
发生了什么 | What happened
提交作业时GPU数量与CPU核心数量无法在选项中自定义(GPU无法归零,CPU核心数量只能由节点+GPU数量确定)
使用命令指定使用资源后通过
sacct -j 92 --format=JobID,JobName,AllocCPUS,NNodes,NTasks,CPUTime,Elapsed,State
查看任务使用资源
JobID JobName AllocCPUS NNodes NTasks CPUTime Elapsed State
------------ ---------- ---------- -------- -------- ---------- ---------- ----------
92 job-20251+ 19 1 00:01:16 00:00:04 COMPLETED
92.batch batch 19 1 1 00:01:16 00:00:04 COMPLETED
以下为测试用提交脚本,第一块sbatch由系统自动指定,后续为手动填写指定
#!/bin/bash
#SBATCH -A a_yzs
#SBATCH --partition=debug
#SBATCH --qos=normal
#SBATCH -J job-20251107-185424
#SBATCH --nodes=1
#SBATCH -c 19
#SBATCH --time=30
#SBATCH --chdir=/data/home/yzs/scow/jobs/job-20251107-185424
#SBATCH --output=job.%j.out
#SBATCH --error=job.%j.err
#SBATCH --gres=gpu:1
source /data/software/module/tools/modules/init/profile.sh
#!/bin/bash
#SBATCH -A a_yzs
#SBATCH --partition=debug
#SBATCH --qos=normal
#SBATCH -J job-20251106-204211
#SBATCH --nodes=1
#SBATCH -c 1
#SBATCH --time=30
#SBATCH --chdir=/data/home/yzs
#SBATCH --output=job.%j.out
#SBATCH --error=job.%j.err
#SBATCH --gres=gpu:0
source /data/software/module/tools/modules/init/profile.sh
#!/bin/bash
#SBATCH --job-name=pytorch-gpu-test # 作业名称
#SBATCH --output=gpu_test_%j.out # 输出日志文件
#SBATCH --error=gpu_test_%j.err # 错误日志文件
#SBATCH --partition=debug
#SBATCH --time=00:10:00 # 运行时间限制(10分钟)
#SBATCH --mem=8G # 内存限制
source /data/software/module/tools/modules/init/profile.sh
source /data/software/conda/etc/profile.d/conda.sh
# 加载conda模块 (根据你的集群实际情况修改模块名)
#module load conda
# 初始化conda (解决conda: command not found问题)
#source "$(conda info --base)/etc/profile.d/conda.sh" || { echo "无法初始化conda"; exit 1; }
conda init bash
# 激活pytorch环境
conda activate /data/software/conda/envs/pytorch || { echo "无法激活pytorch环境"; exit 1; }
# 检查环境是否激活成功
echo "当前激活的环境: $(conda info --envs | grep '*' | awk '{print $1}')"
# 检查Python路径
which python || { echo "找不到python命令"; exit 1; }
# 检查PyTorch是否可用
python -c "import torch; print('PyTorch版本:', torch.__version__)" || { echo "PyTorch导入失败"; exit 1; }
# 运行测试脚本
echo "开始运行GPU测试..."
python gpu_test.py
# 退出环境
conda deactivate
期望结果 | What did you expect to happen
期望结果应该由命令指定而非选项指定资源
之前运行正常吗? | Did this work before?
第一次运行
复现方法 | Steps To Reproduce
按照文档进行部署,其中log由于驱动初始化原因从fluentd改为docker原生。
运行环境 | Environment
- OS:Ubuntu 24.04
- Scheduler: Slurm 25.05.1
- Docker:28.5.1, build e180ab8
- Docker-compose: v2.40.3
- SCOW cli:1.6.4
- SCOW:1.6.4
- Adapter:1.6.0备注 | Anything else?
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working