Howto

From Wiki

Jump to: navigation, search

Contents

Get an account

All jobs run on BASS must be associated with a project (User_Groups) for accounting purposes. In addition, each user must provide the NIH grant and PI he/she is working under. A proposed paper title and author list is also required. This is to assure that BASS is fulfilling its Purpose and is reported to the Advisory_Board regularly.

To get an account on BASS, send an email to bassaccounts@cs.unc.edu asking for an account on the machine. You must specify the information below. Incomplete requests will be delayed until all information is provided.

  • Desired Username
  • Email address
  • UNC PID
  • User group (selected from User_Groups)
  • NIH Grant number (may be empty for non-NIH researchers)
  • PI name
  • Paper working title ("in preparation for XYZ" is fine)
  • Proposed paper author list

Get on the mailing list

You'll probably want to be on the bass@cs.unc.edu mailing list so that you can get updates about machine status and such. You can get onto the list using the interface at https://fafnir.cs.unc.edu/mailman/listinfo/bass.

Set up your environment

Setting environment for MPI

See more at Using MPI

Some versions of the Message-Passing Interface (MPI) we are using use secure-shell (SSH) connections to launch the parallel jobs. This means that you need to set up an SSH key pair so that you can log on from one grid node to another without a password. This is done using the ssh-keygen program, run man ssh-keygen for details on how this works. The basic approach is as follows:

cd
mkdir .ssh                     (Don't worry if this fails)
chmod 0700 .ssh
ssh-keygen -t dsa              (Press return to accept all defaults)
cd .ssh
touch authorized_keys
cat id_dsa.pub >> authorized_keys

Also, you need to be using the bash shell to run large MPI jobs; tcsh and other shells have too-small limits on the environment for this to work.

Selecting an MPI environment

There is a locally-built utility called pathmunge that can be used to set up your environment in several ways, one of which is to select an MPI version. To set the MPI version to openmpi-1.3.3, put the following in your .bash_profile file right after your PATH is set:

. /usr/local/bin/pathmunge.sh
pathmunge usempi openmpi-1.3.3

To see which versions are currently available, type

. /usr/local/bin/pathmunge.sh
pathmunge usempi list

Selecting a compiler

More details at Compiling

The default compiler is the version of GCC that the operating system shipped with (Gnu 4.1 as of 2009/01/31). There are other compilers available. To use the GCC 4.3 compilers, use gcc43 rather than gcc, and the same for g++43 and gfortran43.

To use the Sun Studio compilers using our locally-built pathmunge utility, place the following in your .bash_profile after your PATH is set:

. /usr/local/bin/pathmunge.sh
pathmunge prepend /opt/sunstudioceres/bin

Access AFS files

You can access files in AFS on the compile node, but not on other nodes (so don't rely on this for submitted jobs). This is because the AFS client has to be tightly coupled with the kernel and tends to make things unstable. Go ahead and copy files to your home directory that need to be accessed from the compute nodes. You can make a link to your AFS home directory from your bass home directory using a command like:

ln -s /afs/cs.unc.edu/home/`whoami` ~/unc_afs

There are two kinds of nodes:

  • The compile node (bass.cs.unc.edu) has access to your AFS directories.
  • The compute nodes (where parallel jobs run) do not have access to your AFS directories. The compute nodes run on the NFS file system local to the BASS, so all files needed to run your jobs must be copied to your BASS home directory space or other scratch space before execution.

Disk space configuration, access and usage

Each bass system has access to the following NFS3 mounted file systems:

/stage       8.0 Tera byte space located on file server bass-thor.cs.unc.edu
/nanoscratch 1.0 Tera byte space located on file server bass-files.cs.unc.edu
/home        1.5 Tera byte space located on file server bass-files.cs.unc.edu

NOTE: only the /home space is backed up to tape!

The bass-files server uses a Sun StorageTek 6140 disk array on a SAN network. The /stage data space is located on a Sun Model X4540 storage server. Each bass node system NFS 3 mounts from the bass-files and bass-thor file servers.

It is recommended to use your home directory for compiling and general work, this space is backed up! None of the other space is backed up! You can access your home directory space for various files and executables. If you need to read/write large amounts of input and output data use the /stage space for this purpose.

Getting data files to and from the bass system

You can use an sftp, secure ftp client, and connect to host bass.cs.unc.edu from anywhere to upload or download data to your home directory, /stage or /nanoscratch space. You may want to try a gui utility like FileZilla which has clients available for Linux, Windows, and MacOSX.

You can access files in /afs/cellname space on host bass.cs.unc.edu. Note that bass.cs.unc.edu is the only bass node that runs afs. For example you can copy files to/from your /afs/cs.unc.edu/home/user account on bass.cs.unc.edu.

You can connect to a samba server that is running on hosts bass-files.cs.unc.edu or host bass-thor.cs.unc.edu from a Windows client:

\\bass-files.cs.unc.edu\home
\\bass-files.cs.unc.edu\nanoscratch
\\bass-thor.cs.unc.edu\stage

You can use the linux smbclient, (ftp like client), utility to access these disk shares directly from a linux machine.

Note that the Windows smb disk share protocol is fire walled from outside the .cs.unc.edu domain. That is, you must have an IP address in the .cs.unc.edu domain to access samba or any Windows share. For example to connect to bass-files from a Windows machine, Click Start->Run and enter "\\bass-files.cs.unc.edu". If you are logged into a Computer Science machine you will not be prompted for a user/password. If you are not logged into a Computer Science machine you will need to enter your user name and password. Enter your user name as "user@cs.unc.edu".

Users in the NSRG group can access nanodata at /nanodata on all nodes. If you are in this group, but don't have access to /nanodata, send an email to the mailing list.

Getting a shell for thread-parallel or CUDA jobs

If you want to develop and run thread-parallel programs that don't require running multiple jobs, and you want an interactive shell on a 16-way node to work on so that you don't clobber the compile node, you can use

qlogin -pe smp 16

to get a shell on one of the comp nodes and allocate 16 CPUs for your use. If you only are going to use 4 CPUs, you can use -pe smp 4 (for example, if you're running Matlab). If you want to run on a particular host (for example, bass-gpu35), use -l hostname="bass-gpu35".

You can request the high-memory node (128GB) by doing:

qlogin -l himem
qlogin -pe smp 16 -l himem

The first entry will allocate one processor for your job. The second will allocate 16. You can put a number other than 16 to reserve only some of the processors. Remember to reserve as many processors as you will use to avoid having other jobs placed on the node to compete with your job.

If you want a shell on a node that has two (or four) GPUs for CUDA work, you can use:

qlogin -l gpus=2,gpu_host
qlogin -l gpus=4,gpu_host

Note: Idle shells will be killed after 1 hour.

Submitting parallel jobs

For an X-windows-based gridengine gui, run the qmon command after setting your DISPLAY variable.

The best way to run jobs with the Sun Grid Engine is to make a special script file that describes the parameters of the run and then submit that script using the qsub command to the grid engine. The contents of that script depend on how the job should communicate: examples are provided here for several common cases.

Running a set of independent non-parallel jobs on the GRID

If you want to run 100 copies of the same program with different inputs, the script could look like the following (from /home/examples/scripts/independent_jobs.bash):

# Special comment lines to the grid engine start
# with #$ and they should contain the command-line arguments for the
# qsub command.  See 'man qsub' for more options.
#
#$ -S /bin/bash
#$ -t 1-10
#$ -o /home/taylorr/tmp/$JOB_NAME.$JOB_ID.$TASK_ID
#$ -j y
#$ -cwd
# The above arguments mean:
#       -S /bin/bash : Run this set of jobs using /bin/bash
#       -t 1-10 : Run 10 separate instances with the SGE_TASK_ID set from 1 through 10
#       -o : Put the output files in ~/tmp, named by job name and ID, and task ID
#       -j y : Join the error and output files for each job
#       -cwd : Run the job in the Current Working Directory (where the script is)

# The following are among the useful environment variables set when each
# job is run:
#       $SGE_TASK_ID : Which job I am from the above range
#       $SGE_TASK_LAST : Last number from the above range
#               (Equal to the number of tasks if range starts with 1
#                and has a stride of 1.)

# This will be run once on each of the compute nodes selected, with the variable "$SGE_TASK" set
# to the correct instance.
echo "This is job $SGE_TASK_ID"

This will produce a number of files in ~/tmp, named after the script with the grid-engine job ID in the name, that list the output from each job. If you want to run from a set of input file, you can name them file1 through file100 and use the following in place of echo:

myprogram < file$SGE_TASK_ID

The Sun Grid Engine scheduler will release each job as resources become available. The jobs will not be able to communicate with each other via either shared memory or MPI, and the jobs must not use multiple threads. If you want to use multiple threads, see the section below on running shared-memory parallel jobs.

Available queues

These are the queues available for user job submission on the machine:

  • comp.q : The default CPU queue with 15 16-way shared-memory CPUs each with 32GB of RAM. This is where groups of single-processor jobs should normally be submitted.
  • himem.q : A 16-way shared-memory node with 128GB of RAM.
  • gpu.q : A queue with the same number of slots as there are GPUs on that node. Some GPU nodes have 4 GPUs; most have two.
  • gpu1.q : A queue with one slot per GPU node, no matter how many GPUs are on that node. Jobs will be allocated in a round-robin fashion on this queue.
  • gpucomp.q : The graphics-processor queue (actually, the CPUs associated with this queue). CPU jobs should not normally be submitted to this queue.
  • all.q : All of the above processors. Special permission is required to submit to this queue.

Run an MPI program on the GRID

The following is a sample (bash) shell script for "hello world" mpi program. Some examples of mpi are in the directory /home/examples/mpi on bass-comp0. This script is saved under the name mpi.sh in that directory. Use qsub mpi.sh to run the hello_world mpi program. The output will be placed in your home directory as mpi.sh.ojobid. Run the qstat command to determine the jobid. These examples can also be downloaded from File:Mpi talk.tar.gz. They were provided by Todd Gamblin from RENCI.

  • Run the following using qsub scriptname, where scriptname is the name you save the script under.
#$ -S /bin/bash
#$ -pe MPI 20
#$ -V
#$ -j y
#$ -cwd
# If using a starred MPI environment (See Using MPI):
mpirun hello_world
# Otherwise:
mpirun -np $NSLOTS -hostfile $TMPDIR/machines hello_world
# ---------------------------

Compiling To compile an MPI program, you must first setup your environment. To compile C/C++ programs so that they can run on the bass, use the 'mpiCC' command to compile in place of g++ or CC. This will know where to find all needed include files and libraries. More information about compiling is available at Using MPI.

Running an MPI job on the grid requires a slightly more complicated launch script. The script itself is run using the qsub command with the script as an argument. The following is an example TCSH script that will run a parallel ray-tracer from the examples directory.

  • Run the following using qsub scriptname, where scriptname is the name you save the script under (or /home/examples/scripts/mpi_raytrace.tcsh).
#$ -S /bin/tcsh -pe MPI 20 -V -o $HOME/tmp/$JOB_NAME.$JOB_ID -j y
#   -S /bin/tcsh : Run the jobs using /bin/tcsh on this script
#   -pe MPI 20 : Run in the "MPI" parallel environment, with 20 job slots
#       (MPI is the compute queue, gMPI is the GPU queue, aMPI is the all queue).
#   -V : All environment variables active within qsub should be exported to the job
#   -o : Put the output files in ~/tmp, named by job name and ID, and task ID
#   -j y : Join the error and output files for each job (must come after -o).
# NOTE: If your main shell is tcsh, you will only be able to submit jobs up
# to about 350 slots before the environment-variable length is increased.  To
# send larger jobs for now both your login shell and the shell used to run the
# job must be bash.

# $TMPDIR/machines is filled in by the Grid Engine 
# $NSLOTS holds the number of processes that have been run.
setenv WDIR /home/examples/tests/sge/mpi
cd $WDIR
# If using a starred MPI environment (See Using MPI):
mpirun -np $NSLOTS -hostfile $TMPDIR/machines $WDIR/shootmpi -s 80 20 -r 1 $WDIR/1M1J.opt.wld ~/MPI_test.ppm
# Otherwise:
mpirun -np $NSLOTS -hostfile $TMPDIR/machines $WDIR/shootmpi -s 80 20 -r 1 $WDIR/1M1J.opt.wld ~/MPI_test.ppm

It should complete in about five minutes once it has begun to run and produce a file named MPI_test.ppm in your home directory and an output file in a tmp directory under your home directory. The image file can be viewed with the GIMP program, or with irfanview

The following is a BASH-shell script to run the same program.

  • Run the following using qsub scriptname, where scriptname is the name you save the script under (or /home/examples/scripts/mpi_raytrace.bash).
#$ -S /bin/bash -pe MPI 20 -V -o $HOME/tmp/$JOB_NAME.$JOB_ID -j y
#   -S /bin/bash : Run the jobs using /bin/bash on this script
#   -pe MPI 20 : Run in the "MPI" parallel environment, with 20 job slots
#       (MPI is the compute queue, gMPI is the GPU queue, aMPI is the all queue).
#   -V : All environment variables active within qsub should be exported to the job
#   -o : Put the output files in ~/tmp, named by job name and ID, and task ID
#   -j y : Join the error and output files for each job (must come after -o).

# $TMPDIR/machines is filled in by the Grid Engine
# $NSLOTS holds the number of processes that have been run.
WDIR=/home/examples/tests/sge/mpi
cd $WDIR
# If using a starred MPI environment (See Using MPI):
mpirun $WDIR/shootmpi -s 80 20 -r 1 $WDIR/1M1J.opt.wld ~/MPI_test.ppm
# Otherwise:
mpirun -np $NSLOTS -hostfile $TMPDIR/machines $WDIR/shootmpi -s 80 20 -r 1 $WDIR/1M1J.opt.wld ~/MPI_test.ppm

If you are going to submit the script from the same directory it should run in, you can avoid the whole $WDIR setting above by adding the line #$ -cwd to the script -- that will run it in the current working directory at the time it is submitted.

Available MPI parallel environments

These are the parallel environments (-pe option) available for MPI user job submission on the machine:

  • MPI This is where MPI jobs should normally be submitted. It attempts to fill a node up with job instances before moving to the next node.
  • rrMPI : This parallel environment distributes job instances in a round-robin fashion to all of the available nodes.

Running a set of shared-memory parallel jobs on the GRID

To submit a number of independent jobs, each of which uses more than one shared-memory thread, submit using the '-t' option, but submit to a parallel environment rather than to a queue. If you want 10 jobs, each of which requires a 16-processor machine, use the following (/home/examples/scripts/independent_smp.bash):

#$ -S /bin/bash
#$ -t 1-10
#$ -o $HOME/tmp/$JOB_NAME.$JOB_ID.$TASK_ID
#$ -j y
#$ -pe smp 16

# Run the job.
echo "I am job $SGE_TASK_ID, and am running with 16 reserved processors"

OpenMP

The g++ compiler supports OpenMP when the -fopenmp flag is used at the compile and link lines. If you want an OpenMP job to only use a specific number of processors rather than all of the available processors when it runs, for example 8, add the line

export OMP_NUM_THREADS=8

into your script before the program-execution line. This is useful if you need be able to run on a host that has one or two jobs already running on it, or if you want to limit the size of each of your jobs so that multiple ones can fit on the same node. Make the number of processors requested match the number your job will use (-pe smp 8).

Compile and run CUDA programs

CUDA is the C-like language and environment developed by NVidia to enable general-purpose programming of the G80 and later series graphics cards. To get the CUDA SDK running on BASS, do the following:

  • Add /usr/local/cuda/open64/bin to your PATH (in tcsh and csh, this can be done by putting 'setenv PATH ${PATH}:/usr/local/cuda/open64/bin' into your .cshrc file).
  • Add /usr/local/cuda/lib64 to your LD_LIBRARY_PATH (in tcsh and csh, this can be done by putting 'setenv LD_LIBRARY_PATH /usr/local/lib64' into your .cshrc file, or if you already have a LD_LIBRARY_PATH then 'setenv LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:/usr/local/lib64').

You should be able to compile and link CUDA programs on bass-comp0.

Testing

  • Download and run the NVIDIA_CUDA_SDK_1.1_Linux.run script. Tell it where you would like to put the resulting SDK source code. You should not have to tell it where to find CUDA, because it is installed in the standard location.

Now you're ready to make the SDK example projects.

  • Type 'make'.

Then you should be able to run the programs (which are placed in bin/linux/release under your CUDA SDK directory).

Submitting your own jobs

Submit jobs to the GPU nodes by requesting a GPU host in your qsub invocation (-l gpu_host=TRUE).

To determine which GPU to run on, we make the assumption that the jobs are allocated sequentially and only one parallel job is on one node. In that case, you can use your job ID (available from SGE or MPI) and the number of GPUs on the node (available from CUDA) to determine which to use:

  • SGE: $TASK_ID % GPUS_on_this_node.
  • MPI: MPI_rank % GPUS_on_this_node.

Details on CUDA mutual exclusion

(Comments from John Stone at UIUC.) There's presently no mutual exclusion mechanism in CUDA at all. Every process that wants to use the CUDA devices has to "fend for itself". In essence, this is the same problem as when one runs two programs on a single node, and they both want to allocate all of one of the shared resources (e.g. RAM, /tmp space, etc). With modern OSs, one can deal with most of these issues now by using kernel-enforced process limits that restrict how much physical/virtual memory a process or process group can use, and these features are now built into most of the queueing systems and enforce resource usage policies. Presently, there's no analogous mechanism on CUDA, though it would certainly be nice to have one. I'd previously suggested to NVIDIA that it'd be nice to have driver flags or other configurable settings to control what CUDA devices show up as available, when processes query for them. At the time my thought was mainly to avoid using GPUs that were already under heavy graphics load for CUDA calculations, but the situation you have both described shows that there would be a benefit to having some form of "limit" system that could interact with queueing systems and such, much like we already have for CPU/memory/disk resources.

Determine The Node Type

It can be useful to know what type of node a process is running on and alter the process accordingly. Csh/tcsh have perl-like string matching functions:

#!/bin/tcsh
set NCPUS=1

if ( $HOSTNAME =~ *gpu* ) then
    echo "this is a gpu node!"
    set NCPUS=4
else if ( $HOSTNAME =~ *comp* ) then
    echo "this is a comp node!"
    set NCPUS=16
else if ( $HOSTNAME =~ *himem* ) then
    echo "this is a himem node!"
    set NCPUS=16
endif 

Be careful with the syntax of the script, as csh/tcsh are picky (especially with where the 'then' goes).

In Linux, the actual processor count on a given host is available via the file /proc/cpuinfo.

#!/bin/tcsh
@ cpuc = `grep processor /proc/cpuinfo | wc -l`

For the Nvidia GPU hosts, /proc/bus/pci/devices can be grepped for the string, "nvidia":

#!/bin/tcsh
@ gpuc = `grep -i nvidia /proc/bus/pci/devices | wc -l`

The bass-comp nodes return 0; the bass-gpu nodes return either 2 or 4, depending on how many cards/quadroplexes are installed.

Resource limits

Wallclock time

Maximum wallclock time limits allow the grid engine to be smarter with how it assigns jobs, as well as allows for reserving parts of the cluster for periods in the future.

The default maximum wallclock time is 2 days. If your job takes longer than this allotted time, it will be killed by the grid engine. If you require more than 2 days per job, send an email to the mailing list. If you know your job will require less time, it's good manners to request a smaller window of time from the grid engine. To do this, use the -l h_rt=<time> argument to qsub. The time argument can either be given in seconds or Hours:Minutes:Seconds. It benefits everybody to make sure your time estimates are conservative, but not too conservative.

Accessing files from a job

  • The compute nodes do not have access to AFS space.
  • They do have access to your BASS home directory, via NFS mounts, using the same paths as the compile node.
  • They also have access to temporary storage local to each node, available in the $TMPDIR environment variable. This points to a local disk partition on each compute node. You can create files within the $TMPDIR directory to store temporary results that do not need to persist beyond the end of the job.
  • They also have access to a grid-wide temporary scratch space in /stage. This can be used to send data between the compile nodes and the compute nodes during a run but must be copied to permanent storage if it is to persist. The /stage partition is not backed up.

Watching the progress of your jobs

qstat: You can watch how the jobs you have submitted to the queue progress using the command qstat. It will show status qw when the job is queued and waiting, and status 'r' when the job is running. To see running and queued jobs from all users, run

qstat -u "*"

To see why one of your jobs (say job number 5534) is not running, use

qalter -w p 5534

This describes what queues have been tried and why they didn't get used.

tail: Also, you can tail -f on the output files in your home directory to see what output each is producing when it is running.

Ganglia: Finally, you can point your web browser at the ganglia server from within the computer-science department to see how busy the whole machine is, or how busy parts of it are. Note that there are background jobs running on the machine, so it may look full when in fact there is little or no wait for new jobs; clicking on one of the subcluster nodes (GPU or CPU) will show which jobs are foreground (blue) and which are background (yellow).

Email Notification

To receive email notification when your job starts and ends, add

-m be -M user@example

to your qsub script or command line.

Killing a job

If you realize that your job is running amok, you can kill it using qdel with the job ID listed when you submitted it (also shown in qstat).

Personal tools