test@node69:~> bjobs -w JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME test RUN QS_Norm node69 4*node10 mpirun.lsf /public/software/dock6-openmpi/bin/dock6.mpi -i test.in -o test.out Dec 21 19:29
test@node69:~> bjobs -l 818
Job <818>, User <test>, Project <default>, Status <RUN>, Queue <QS_Norm>, Comma nd <mpirun.lsf /public/software/dock6-openmpi/bin/dock6.mp i -i test.in -o test.out>, Share group charged </test> Mon Dec 21 19:29:35: Submitted from host <node69>, CWD <$HOME/dock6-test>, Outp ut File </home/test/dock6-test/output.%J>, 4 Processors Re quested; Mon Dec 21 19:29:39: Started on 4 Hosts/Processors <4*node10>, Execution Home < /home/test>, Execution CWD </home/test/dock6-test>; Mon Dec 21 19:47:18: Resource usage collected. The CPU time used is 4174 seconds. MEM: 96 Mbytes; SWAP: 703 Mbytes; NTHREAD: 23 PGID: 28118; PIDs: 28131 28133 28137 28138 28139 28118 28140 28141 28129 PGID: 28143; PIDs: 28143 PGID: 28142; PIDs: 28142 PGID: 28144; PIDs: 28144 PGID: 28145; PIDs: 28145 SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - -
作业在线监视
test@node69:~> bjobs -aw JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 818 test RUN QS_Norm node69 4*node10 mpirun.lsf /public/software/dock6-openmpi/bin/dock6.mpi -i test.in -o test.out Dec 21 19:29 115 test DONE lost_and_found node70 node1 sleep 1000 Dec 19 16:31 116 test DONE lost_and_found node70 node62 sleep 100 Dec 19 17:07 117 test DONE lost_and_found node70 node62 sleep 100 Dec 19 17:07 119 test DONE lost_and_found node70 node61 sleep 100 Dec 19 17:07 118 test DONE lost_and_found node70 node69 sleep 100 Dec 19 17:07 120 test DONE lost_and_found node70 node70 sleep 100 Dec 19 17:07 -
作业在线监视 检查作业历史状态
test@node69:~> bhist -aw Summary of time in seconds spent in various states: JOBID USER JOB_NAME PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 223 test mpirun.lsf ./cpi-openmpi 6 0 5 0 0 0 11 224 test mpirun.lsf ./cpi-openmpi 5 0 5 0 0 0 10 225 test mpirun.lsf openmpi 3 0 4 0 0 0 7 226 test mpirun.lsf ./cpi-openmpi 5 0 7 0 0 0 12 227 test mpirun.lsf ./cpi-mpich 2 0 0 0 0 0 2 228 test mpirun.lsf ./cpi-mpich 4 0 1994 0 0 0 1998 229 test mpirun.lsf ./cpi-openmpi 6 0 7 0 0 0 13 作业在线监视 检查作业历史状态
test@node69:~> bhist -l 223
Job <223>, User <test>, Project <default>, Command <mpirun.lsf ./cpi-openmpi> Sun Dec 20 15:05:40: Submitted from host <node69>, to Queue <default>, CWD <$HO ME>, Output File </home/test/output.%J>, 16 Processors Req uested; Sun Dec 20 15:05:46: Dispatched to 16 Hosts/Processors <16*node62>; Sun Dec 20 15:05:46: Starting (Pid 30493); Sun Dec 20 15:05:46: Running with execution home </home/test>, Execution CWD </ home/test>, Execution Pid <30493>; Sun Dec 20 15:05:51: Done successfully. The CPU time used is 9.6 seconds; Sun Dec 20 15:05:51: Post job process done successfully;
Summary of time in seconds spent in various states by Sun Dec 20 15:05:51 PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 6 0 5 0 0 0 11 作业在线监视 检查作业历史状态
test@node69:~> bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 818 test RUN QS_Norm node69 4*node10 * test.out Dec 21 19:29
LSF Keep the job exit as it does “bhist –l <jobid>” and “bjobs –l <jobid>” check the job exit code Submit a job with “-o %J.out” and check the output file <jobid>.out Typical User Problems (cont.d) “My job dies under LSF” Check resource limits on queues Check that the application and its data files are accessible from the execution host(s) Is an application license available from the execution host? Check the exit code reported by bjobs –l Common exit codes 127 – Command not found 128 – Command invoked cannot execute 130 – Scripts terminated by Control-C Typical User Problems “My job was rejected by LSF” Check resource requirement string, run time limit Submission to an unauthorized queue or host Requested soft limits exceeding a queues hard limits Typical User Problems (cont.d) “My job PENDs forever under LSF” Has the user requested unrealistic resources? More memory than any host has Resource requirements may be too stringent Is the users id valid on the execution host(s)? The user may have requested exclusive execution If FCFS scheduling is used, the user may be last If fairshare scheduling is used, the user may have exhausted their fairshare allocation