日志

bsub

热度 1已有 5918 次阅读| 2016-11-1 16:52 |个人分类:LSF|系统分类:芯片设计

LSF系统介绍
http://scc.ustc.edu.cn/zh_CN/ 中科大超算中心
http://www.sccas.cn/gb/index.html 中科院超算中心
http://www.ssc.NET.cn/ 上涨超算中心

LSF简介
LSF(Load Sharing Facility)是分布资源管理的工具，用来调度、监视、分析联网计算机的负载。
目的
通过集中监控和调度，充分共享计算机的cpu、内存、磁盘、License等资源
一组安装了LSF软件的计算机组成一个Cluster
Cluster内的资源统一监控和调度
LSF Cluster的组成
LSF 术语
Cluster
一组运行LSF软件的计算机（当然是用TCP/IP网络互连的），跟计算Cluster术语无关
命令
bhosts 列出cluster中的机器
lsid 显示cluster名字
lsclusters 显示cluster状态和规模
LSF 术语
Server Host
Cluster内提交和执行Job的计算机
Client Host
Cluster内仅仅提交Job的计算机
在科大的Cluster中，node1-node32是ServerHost
LSF 术语
Job
提交给LSF 的命令
LSF负责调度、控制、跟踪job
命令
bjobs 查看系统的Job
bsub 提交作业
bhist 查看作业历史
bkill kill一个作业

使用Platform. LSF

LSF使用综述
故障分析
作业提交与管理
资源管理
系统监视

LSF使用综述
设置LSF 环境变量
% login as: test
Using keyboard-interactive authentication.
Password:
Last login: Mon Dec 21 09:31:29 2009 from 11.11.11.241

test@node69:~> env | grep LSF
LSF_SERVERDIR=/public/software/lsf/7.0/linux2.6-glibc2.3-x86_64/etc
LSF_LIBDIR=/public/software/lsf/7.0/linux2.6-glibc2.3-x86_64/lib
LSF_VERSION=7.0
LSF_BINDIR=/public/software/lsf/7.0/linux2.6-glibc2.3-x86_64/bin
XLSF_UIDDIR=/public/software/lsf/7.0/linux2.6-glibc2.3-x86_64/lib/uid
LSF_ENVDIR=/public/software/lsf/conf

作业提交:普通并行作业（要求程序为并行才能并行提交）
作业提交:Gauss作业
test@node69:~/gauss-test> bsub -W 60 -n 32 -q QN_Norm g03.lsf test397.com
Job <716> is submitted to queue <QN_Norm>.
参数说明: g03.lsf, Gauss运行关键字
-W 60 作业最长运行60分钟
- n 32 需要32个CPU
-q QN_Norm 指定QN_Norm队列
隐藏参数:output.%J, 标准输出文件(包括相关错误提示)

作业提交:Dock作业
test@node69:~/dock6-test> bsub –W 12:00 -a openmpi -n 4 mpirun.lsf /public/software/dock6-openmpi/bin/dock6.mpi -i test.in -o test.out
Job <818> is submitted to default queue <QS_Norm>.
参数说明: -a openmpi 指定用openmpi运算
-W 12:00 运行12小时
- n 4 需要32个CPU
-mpirun.lsf, 使用openmpi时的关键字
无-q 参数指定QS_Norm队列
隐藏参数:output.%J, 标准输出文件(包括相关错误提示)

作业提交:blast作业
test@node69:~/mpiblast-test> bsub -W 360 -n 32 -q QN_Norm -a openmpi mpirun.lsf ./blast.sh
Job <819> is submitted to queue <QN_Norm>.

参数说明: -a openmpi 指定用openmpi运算
-W 360 运行360分钟
- n 32 需要32个CPU
-mpirun.lsf, 使用openmpi时的关键字
-q QN_Norm 指定QN_Norm队列
隐藏参数:output.%J, 标准输出文件(包括相关错误提示)

作业提交:普通串行作业
test@node69:~> bsub –W 60 a.out ./bowtie-build.sh 参数
Job <820> is submitted to default queue <QS_Norm>.
参数说明: -W 60 运行60分钟
无 - n 参数使用1个CPU
无-q 参数指定QS_Norm队列
隐藏参数:output.%J, 标准输出文件(包括相关错误提示)

作业提交:普通使用openmpi的MPI作业
与DOCK和blastmpi 相同:
test@node69:~/mpiblast-test> bsub -W 360 -n 32 -q QN_Norm -a openmpi mpirun.lsf ./blast.sh
Job <819> is submitted to queue <QN_Norm>.
参数说明: -a openmpi 指定用openmpi运算
-W 360 运行360分钟
- n 32 需要32个CPU
-mpirun.lsf, 使用openmpi时的关键字
-q QN_Norm 指定QN_Norm队列
隐藏参数:output.%J, 标准输出文件(包括相关错误提示)

作业提交:普通使用openmpi的MPI作业
与DOCK和blastmpi 相同:
test@node69:~/mpiblast-test> bsub -W 360 -n 32 -q QN_Norm -a openmpi mpirun.lsf ./blast.sh
Job <819> is submitted to queue <QN_Norm>.
参数说明: -a openmpi 指定用openmpi运算
-W 360 运行360分钟
- n 32 需要32个CPU
-mpirun.lsf, 使用openmpi时的关键字
-q QN_Norm 指定QN_Norm队列
隐藏参数:output.%J, 标准输出文件(包括相关错误提示)

作业提交:普通使用openmpi的MPI作业
与DOCK和blastmpi 相同:
test@node69:~/mpiblast-test> bsub -W 360 -n 32 -q QN_Norm -a openmpi mpirun.lsf ./blast.sh
Job <819> is submitted to queue <QN_Norm>.
参数说明: -a openmpi 指定用openmpi运算
-W 360 运行360分钟
- n 32 需要32个CPU
-mpirun.lsf, 使用openmpi时的关键字
-q QN_Norm 指定QN_Norm队列
隐藏参数:output.%J, 标准输出文件(包括相关错误提示)

作业提交:交互式图形作业和作业数组
test@node69:~/mpiblast-test> bsub –Ip xclock
Job <819> is submitted to queue <QS_Norm>.

参数说明: -Ip 需要伪终端的交互方式图形支持
作业数组:
> Bsub –J Jobname[1-100] –i input.%I –o outpout.%I Exec.out

作业在线监视

test@node69:~> bjobs -w
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
test RUN QS_Norm node69 4*node10 mpirun.lsf /public/software/dock6-openmpi/bin/dock6.mpi -i test.in -o test.out Dec 21 19:29

test@node69:~> bjobs -l 818

Job <818>, User <test>, Project <default>, Status <RUN>, Queue <QS_Norm>, Comma
nd <mpirun.lsf /public/software/dock6-openmpi/bin/dock6.mp
i -i test.in -o test.out>, Share group charged </test>
Mon Dec 21 19:29:35: Submitted from host <node69>, CWD <$HOME/dock6-test>, Outp
ut File </home/test/dock6-test/output.%J>, 4 Processors Re
quested;
Mon Dec 21 19:29:39: Started on 4 Hosts/Processors <4*node10>, Execution Home <
/home/test>, Execution CWD </home/test/dock6-test>;
Mon Dec 21 19:47:18: Resource usage collected.
The CPU time used is 4174 seconds.
MEM: 96 Mbytes; SWAP: 703 Mbytes; NTHREAD: 23
PGID: 28118; PIDs: 28131 28133 28137 28138 28139 28118
28140 28141 28129
PGID: 28143; PIDs: 28143
PGID: 28142; PIDs: 28142
PGID: 28144; PIDs: 28144
PGID: 28145; PIDs: 28145
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -

作业在线监视

test@node69:~> bjobs -aw
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
818 test RUN QS_Norm node69 4*node10 mpirun.lsf /public/software/dock6-openmpi/bin/dock6.mpi -i test.in -o test.out Dec 21 19:29
115 test DONE lost_and_found node70 node1 sleep 1000 Dec 19 16:31
116 test DONE lost_and_found node70 node62 sleep 100 Dec 19 17:07
117 test DONE lost_and_found node70 node62 sleep 100 Dec 19 17:07
119 test DONE lost_and_found node70 node61 sleep 100 Dec 19 17:07
118 test DONE lost_and_found node70 node69 sleep 100 Dec 19 17:07
120 test DONE lost_and_found node70 node70 sleep 100 Dec 19 17:07
-

作业在线监视
检查作业历史状态

test@node69:~> bhist -aw
Summary of time in seconds spent in various states:
JOBID USER JOB_NAME PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
223 test mpirun.lsf ./cpi-openmpi 6 0 5 0 0 0 11
224 test mpirun.lsf ./cpi-openmpi 5 0 5 0 0 0 10
225 test mpirun.lsf openmpi 3 0 4 0 0 0 7
226 test mpirun.lsf ./cpi-openmpi 5 0 7 0 0 0 12
227 test mpirun.lsf ./cpi-mpich 2 0 0 0 0 0 2
228 test mpirun.lsf ./cpi-mpich 4 0 1994 0 0 0 1998
229 test mpirun.lsf ./cpi-openmpi 6 0 7 0 0 0 13
作业在线监视
检查作业历史状态

test@node69:~> bhist -l 223

Job <223>, User <test>, Project <default>, Command <mpirun.lsf ./cpi-openmpi>
Sun Dec 20 15:05:40: Submitted from host <node69>, to Queue <default>, CWD <$HO
ME>, Output File </home/test/output.%J>, 16 Processors Req
uested;
Sun Dec 20 15:05:46: Dispatched to 16 Hosts/Processors <16*node62>;
Sun Dec 20 15:05:46: Starting (Pid 30493);
Sun Dec 20 15:05:46: Running with execution home </home/test>, Execution CWD </
home/test>, Execution Pid <30493>;
Sun Dec 20 15:05:51: Done successfully. The CPU time used is 9.6 seconds;
Sun Dec 20 15:05:51: Post job process done successfully;

Summary of time in seconds spent in various states by Sun Dec 20 15:05:51
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
6 0 5 0 0 0 11
作业在线监视
检查作业历史状态

test@node69:~> bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
818 test RUN QS_Norm node69 4*node10 * test.out Dec 21 19:29

test@node69:~> bpeek -f 818
<< output from stdout >>
Initializing MPI Routines...
Initializing MPI Routines...
Initializing MPI Routines...
Initializing MPI Routines...
杀掉：
test@node69:~> bkill 1480

作业管理
检查作业历史状态

机器负载状态

机器作业状态

机器分组

队列状态

队列状态

故障分析
Job exit analysis

LSF Keep the job exit as it does
“bhist –l <jobid>” and “bjobs –l <jobid>” check the job exit code
Submit a job with “-o %J.out” and check the output file <jobid>.out
Typical User Problems (cont.d)
“My job dies under LSF”
Check resource limits on queues
Check that the application and its data files are accessible from the execution host(s)
Is an application license available from the execution host?
Check the exit code reported by bjobs –l
Common exit codes
127 – Command not found
128 – Command invoked cannot execute
130 – Scripts terminated by Control-C
Typical User Problems
“My job was rejected by LSF”
Check resource requirement string, run time limit
Submission to an unauthorized queue or host
Requested soft limits exceeding a queues hard limits
Typical User Problems (cont.d)
“My job PENDs forever under LSF”
Has the user requested unrealistic resources?
More memory than any host has
Resource requirements may be too stringent
Is the users id valid on the execution host(s)?
The user may have requested exclusive execution
If FCFS scheduling is used, the user may be last
If fairshare scheduling is used, the user may have exhausted their fairshare allocation