雖然python可以批次化submit job,但如果job在跑的同時,有人利用ssh另外submit job,將會發生大災難,因此ubuntu 18.04可以安裝Torque避免悲劇發生。
一、設置無密碼ssh登錄
一、設置無密碼ssh登錄
生成密鑰文件
$ ssh-keygen -t rsa
進入.ssh目錄
$ cd .ssh
將密鑰加入到認證的列表裡
$ cat id_rsa.pub > authorized_keys
二、將靜態ip輸入到/etc/hosts裡
#127.0.0.1 localhost (註解)
#127.0.1.1 chiustin(註解)
140.112.182.xxx chiustin
參考
https://www.douban.com/note/239136237/
三、加入xenial
$ vi /etc/apt/sources.list temporarily
deb http://dk.archive.ubuntu.com/ubuntu/ xenial main
deb http://dk.archive.ubuntu.com/ubuntu/ xenial universe
四、安裝torque
参考
https://blog.csdn.net/jideljd_2010/article/details/46575137
安裝
deb http://dk.archive.ubuntu.com/ubuntu/ xenial main
deb http://dk.archive.ubuntu.com/ubuntu/ xenial universe
四、安裝torque
参考
https://blog.csdn.net/jideljd_2010/article/details/46575137
安裝
$ sudo apt-get update
$ apt-get install torque-server torque-client torque-mom torque-pam
設置
$ /etc/init.d/torque-mom stop
$ /etc/init.d/torque-scheduler stop
$ /etc/init.d/torque-server stop
$ sudo pbs_server -t create
$ sudo killall pbs_server
$ apt-get install torque-server torque-client torque-mom torque-pam
設置
$ /etc/init.d/torque-mom stop
$ /etc/init.d/torque-scheduler stop
$ /etc/init.d/torque-server stop
$ sudo pbs_server -t create
$ sudo killall pbs_server
$ sudo echo $HOSTNAME > /etc/torque/server_name
$ sudo echo $HOSTNAME > /var/spool/torque/server_priv/acl_svr/acl_hosts
$ sudo echo root@$HOSTNAME > /var/spool/torque/server_priv/acl_svr/operators
$ sudo echo root@$HOSTNAME > /var/spool/torque/server_priv/acl_svr/managers
$ sudo echo "$HOSTNAME np=12" > /var/spool/torque/server_priv/nodes (np代表計算節點的處理器個數)
sudo echo $HOSTNAME > /var/spool/torque/mom_priv/config
ps. 這裡的$HOSTNAME都代表chiustin,root@$HOSTNAME代表root@chiustin
開啟pbs服務
$ /etc/init.d/torque-server start
$ /etc/init.d/torque-scheduler start
$ /etc/init.d/torque-mom start
設置調度信息
$ sudo qmgr -c 'set server scheduling = true'
$ sudo qmgr -c 'set server keep_completed = 300'
$ sudo qmgr -c 'set server mom_job_sync = true'
$ sudo qmgr -c 'create queue batch' 創建名為batch的處理列隊(可修改)
$ sudo qmgr -c 'set queue batch queue_type = execution'
$ sudo qmgr -c 'set queue batch started = true'
$ sudo qmgr -c 'set queue batch enabled = true'
$ sudo qmgr -c 'set queue batch resources_default.walltime = 1440:00:00' default運行時間
$ sudo qmgr -c 'set queue batch resources_default.nodes = 1' 計算節點只有一個
$ sudo qmgr -c 'set server default_queue = batch'
$ sudo qmgr -c 'set server submit_hosts = $HOSTNAME'
$ sudo qmgr -c 'set server allow_node_submit = true'
$ sudo qmgr -c 'set server allow_node_submit = true'
參考
参考
http://forum.ubuntu.org.cn/viewtopic.php?t=451723
結束配置,第一次啟動。
首先,關閉所有服務,
$ sudo pkill "pbs_*"
第一次啟動
$ sudo qterm -t quick
如果出現Cannot connect to default server host 'chiustin' - check pbs_server daemon.
qterm: could not connect to server '' (111) Connection refused
$ killall pbs_server
$ reboot$ killall pbs_server
$ pbs_server -t create
啟動所有服務
$ sudo pbs_server
$ sudo pbs_sched
$ sudo pbs_mom
如果重開機後出現pbs_mom: LOG_ERROR::Resource temporarily unavailable (11) in pbs_mom, cannot lock '/var/spool/torque/mom_priv/mom.lock' - another mom running
cannot lock '/var/spool/torque/mom_priv/mom.lock' - another mom running
$ /etc/init.d/torque-mom stop
$ /etc/init.d/torque-mom start
參考
http://bbs.keinsci.com/thread-10134-1-1.html
https://www.cnblogs.com/zjyx/p/8448279.html
參考
http://bbs.keinsci.com/thread-10134-1-1.html
https://www.cnblogs.com/zjyx/p/8448279.html
六、測試
$ echo 'sleep 20' | qsub
$ qstat
script.txt
#!/bin/sh
#PBS -N test
#PBS -o out.txt
#PBS -e err.txt
##PBS -q chiustin
##PBS -l nodes=chiustin:ppn=8
#PBS -l walltime=192:00:00
#PBS -mea
#PBS -r n
#PBS -V
cd $PBS_O_WORKDIR
lmp_ubuntu -in in.shear
$ qstat
參考
https://icme.hpc.msstate.edu/mediawiki/index.php/Running_PBS_script_with_LAMMPS
沒有留言:
張貼留言