How to use the kongull linux cluster

How to use the kongull linux cluster


Obtaining a user and password

Contact Erlend Våtevik (e-mail: erlend.vatevik@ntnu.no) to obtain loggin name and password on kongull.

Logging into kongull

Assuming you are using a linux system, the command to connect to kongull is:


ssh -X kongull.hpc.ntnu.nu

This might look like

rn@barn-desktop:~$ ssh -X kongull.hpc.ntnu.no
barn@kongull.hpc.ntnu.no's password:
Last login: Mon Mar 8 14:55:42 2010 from ipt126.ipt.ntnu.no
Rocks 5.2 (Chimichanga)
Profile built 11:08 27-Jan-2010

Kickstarted 12:25 27-Jan-2010
[barn@kongull ~]$

Available software

To see what software is available, use the module command with the subcommand avail


module avail

which might look something like

rn@kongull ~]$ module avail 

------------------------------------------------------- /share/apps/modulefiles/ -------------------------------------------------------- 
4store/4.4.3(default)    gcc/3.3.6                intel/compilers/11.1.059 paraview/3.6.2           rasqal/0.9.19(default) 
adf/adf2009.01           gcc/3.4.6                jdk/jdk1.6.0_18(default) pcre/8.01(default)       vtk/5.4.2 
cmake/2.8.0              gcc/4.4.3(default)       matlab/R2009b            pgi/8.0-3                           
fftw/3.2.2(default)      hdf5/1.6.10(default)     openfoam/1.6.100218      raptor/1.4.21(default)         
[barn@kongull ~]$ 

To set up the user environment for a specific piece of software type (in this case the intel compiler)

modul load intel/compilers

Seismic Unix and spl

Seismic unix and spl (containing finite-difference seismic modeling and migration) is not controlled by the module system and must be set up separately. To set up the necessary environment variables copy the following text to the file ".profile" in your home directory

#---- SU 
CWPROOT=/home/barn/sim/src/su
export CWPROOT
PATH=$CWPROOT/bin:$PATH

#--- MPI
PATH=/home/barn/sim/src/mpich-64/bin:$PATH

#--- SPL
SYS=x86_64
export SYS
SPL=/home/barn/sim/src/spl
PATH=/home/barn/sim/src/spl/bin/$SYS:$PATH
export SPL

#---- PBS
MANPATH=/opt/torque/man:$MANPATH

To activate the profile, type

. .profile

The batch system

kongull is a cluster of approximately 80 separate machines (nodes) and is set up as a batch system. Jobs are submitted using a queing system. To submitt a job type

  qsub < file.sh

where file is a shell script containing the job and additional commands to the batch system. An example of a job file is shown below


#!/bin/sh
#PBS -N fdmod1
#PBS -l nodes=401
#PBS -q bigmem


#
#-- Mpi resources
#
#
np=401   # Number of mpi-processors (compute nodes)
nodes=8 # Number of mpi-processors pr. physical node
Wrkdir=$HOME/Project/multi-shot  # Working directory

. $HOME/.profile
cd $Wrkdir

#-- Run modeling using mpi
mpirun -np $np  -nodes $nodes -machinefile $PBS_NODEFILE  $SPL/bin/$SYS/splfd2dmod \
  v=1                          \
  logfile=log                  \
  mp=1                         \
  dt=0.00025                   \
  lx=8000.0                    \
  logfile=log                  \
  time=4.0                  \
  smap=smap.m                  \
  kmap=kmap.m                  \
  max=1000,3000000,3000000     \
  min=1,0,0                     

The first three lines are commands to the batch system. The first line contains the name of the job, which could be anything you like, but it is used as a tag for identification. The second line is important and states how many compute nodes (cpu's) you want to use, the maximum is around 1000. The third line identifies the specific batch que to use. At the moment (March 2010) there are four ques:
  1. express: High priority but max time is 1 Hr.
  2. default: This is where the job goes if no que is specified. Max time is 168 Hrs. Only 50% of the cluster nodes is accessible.
  3. optimist: Low priority, the job may be killed. Max time is 628 Hrs. All nodes are accessible.
  4. bigmem: All nodes (approx 50%) with large memory are accessible. Max time is 168 Hrs.
The rest of the file are commands specific to the type of software you are using. After a job is submitted, type

qstat

To see that status of the job. To get a long output type

qstat -f

Full documentation of PBS is provided by the command

man pbs

Running a seismic modelling job

Copy an example job by typing

cp -R $SPL/demos/splfd2dmod/fdmod1 .

You will now have a directory containg scripts and a density and velocity model. Follow the instructions in the README file and modify the mod.sh and mod-small.sh files. When you are ready, type:

./job.sh

You should see an output approximately like this:

*** This is splfd2dgeom of July 2007
  
 *** Input parameters
 --- nfldr     :         400
 --- nrec      :         150
 --- dsx       :   25.00000    
 --- dgx       :   25.00000    
 --- sx_pos    :   3725.000    
 --- gx_pos    :  0.0000000E+00
 --- rectime   :   4.000000    
 --- dt        :  1.0000000E-03
 --- scalco    :          -1
 --- direction :           1
 === i, hwpos:            1              9
 === i, hwpos:            2             73
 === i, hwpos:            3             81
   percentage completed:            1
   percentage completed:            2
   percentage completed:            3
   percentage completed:            4
   percentage completed:            5
   percentage completed:            6
   percentage completed:            7
   percentage completed:            8
   percentage completed:            9
   percentage completed:           10
   percentage completed:           11
   percentage completed:           12
   percentage completed:           13
   percentage completed:           14
   percentage completed:           15
   percentage completed:           16
   percentage completed:           17
   percentage completed:           18
   percentage completed:           19
   percentage completed:           20
   percentage completed:           21
   percentage completed:           22
   percentage completed:           23
   percentage completed:           24
   percentage completed:           25
   percentage completed:           26
   percentage completed:           27
   percentage completed:           28
   percentage completed:           29
   percentage completed:           30
   percentage completed:           31
   percentage completed:           32
   percentage completed:           33
   percentage completed:           34
   percentage completed:           35
   percentage completed:           36
   percentage completed:           37
   percentage completed:           38
   percentage completed:           39
   percentage completed:           40
   percentage completed:           41
   percentage completed:           42
   percentage completed:           43
   percentage completed:           44
   percentage completed:           45
   percentage completed:           46
   percentage completed:           47
   percentage completed:           48
   percentage completed:           49
   percentage completed:           50
   percentage completed:           51
   percentage completed:           52
   percentage completed:           53
   percentage completed:           54
   percentage completed:           55
   percentage completed:           56
   percentage completed:           57
   percentage completed:           58
   percentage completed:           59
   percentage completed:           60
   percentage completed:           61
   percentage completed:           62
   percentage completed:           63
   percentage completed:           64
   percentage completed:           65
   percentage completed:           66
   percentage completed:           67
   percentage completed:           68
   percentage completed:           69
   percentage completed:           70
   percentage completed:           71
   percentage completed:           72
   percentage completed:           73
   percentage completed:           74
   percentage completed:           75
   percentage completed:           76
   percentage completed:           77
   percentage completed:           78
   percentage completed:           79
   percentage completed:           80
   percentage completed:           81
   percentage completed:           82
   percentage completed:           83
   percentage completed:           84
   percentage completed:           85
   percentage completed:           86
   percentage completed:           87
   percentage completed:           88
   percentage completed:           89
   percentage completed:           90
   percentage completed:           91
   percentage completed:           92
   percentage completed:           93
   percentage completed:           94
   percentage completed:           95
   percentage completed:           96
   percentage completed:           97
   percentage completed:           98
   percentage completed:           99
   percentage completed:          100
1483.kongull.hpc.ntnu.no

The last line is the response from the batch system and contains the jobnumber. [barn@kongull fdmod1]$ Typing

qstat 

you should get the following output

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1447.kongull              cfbcycleb        yuefa           55:18:06 R default        
1448.kongull              cfbcyclea        yuefa           55:17:28 R optimist       
1461.kongull              job.sh           arnemort        268:04:1 R optimist       
1468.kongull              job.sh           arnemort        49:08:30 R optimist       
1469.kongull              job.sh           arnemort        46:10:27 R optimist       
1470.kongull              job.sh           arnemort        39:55:08 R optimist       
1471.kongull              job.sh           arnemort        39:51:30 R optimist       
1472.kongull              job.sh           arnemort        35:54:34 R optimist       
1473.kongull              job.sh           arnemort        38:25:09 R optimist       
1474.kongull              job.sh           arnemort        31:53:19 R optimist       
1475.kongull              job.sh           arnemort        26:24:05 R optimist       
1483.kongull              fdmod1           barn                   0 R bigmem 

This shows all jobs running at the moment and the status for each. An R is the S column indicates that the job is running. You should also see some log files with names like log-1, log-2, etc... This is information generated by the modelling program. The modelling is parallelized and runs on 400 cpu's, so there is one log file for each cpu.

When the job is finished, two files will appear in the directory the job started:

[barn@kongull fdmod1-sol]$ ls fdmod*
fdmod1.e1485  fdmod1.o1485

The first file contains the output from the program produced on the unix standard error file, and the second file contains the output from the program produced on the unix standard output file. In our case most of the output are captured in the log files, and only in case of a crash or serious error will these files contain any usefull information.


B. Arntsen NTNU May 2010