Math Department Computing Cluster
The Math Departement Computing operates a general-purpose computing cluster which is managed using Torque/Maui job scheduling software. The cluster is available to all Math Faculty and Graduate students.
It is currently comprised of 16 dedicated compute nodes made up of quad-core Intel Xeon X5355 2.66 GHz CPU's running in 64-bit mode. Each node contains 2 CPU's providing a total of 24 CPU's in the cluster. Most nodes have 16 Gig RAM, though two nodes have 32 Gig RAM available. Nodes are connected with a high-bandwidth, low-latency interconnection based on Infiniband, as well as 10/100 ethernet.
The cluster runs Linux based opperating system and supports a variety of software including MATLAB, Mathematica, Macaulay2, and an MPI implementation (MVAPICH2). Compiled C and Fortran programs can also be run in either 64 or 32 bit modes.
A cluster is a group of computers that are networked together and are managed by software so that they can be treated as one large machine. The cluster is managed by a program called the scheduler, which determines how best to use the resources (CPU, memory, disk space) provided by the cluster.
When you want to run something on the cluster, you need to let the scheduler what you want to run and what you need for it to run. This is called submitting a job. Your job may not run immediately. If the resources it needs are not currently available, the scheduler will keep your job in a queue until the resources become available.
To get access to the cluster, you will need to login to the frontend server of the cluster. Use your favorite ssh program to access cluster.math.vt.edu with your Math PID and password. You will be placed in your cluster home directory, which is separate from your Math home directory. A file share for your cluster home directory is available, please see the file share help page for information on using this.
In the cluster, your Math home directory is available only on the frontend. It is located as /math/PID where PID is your Math PID. If you want to make files available to the cluster you will need to copy them to your cluster home directory, /home/PID.
Once your job is running, another place it can create or copy files is to scratch space, /scratch. Scratch space is on faster drives and should be used for temporary files or frequently accessed data files your programs are working on. Anything you want to keep should be copied to your home directory before your job is finished.
You will need to do a few things before you are able to use the cluster. You will need access to your Math Home directory, you will need to create an ssh key, and you will need to enable password-less logins to the cluster nodes.
Creating an ssh key
On cluster.math you will want to run the following command:
Just hit the return key for the passphrase, do not set a passphrase. When finished, you need to do the following:
cat id_rsa.pub > authorized_keys
This will allow you to use you ssh key in place of your password.
ssh keys for all the cluster servers are automatically copied to your cluster home when you first login to the cluster. If this file (.ssh/known_hosts) should become lost or corrupted, you can download a new copy here.
Matlab Parallel Configuration
To run Matlab jobs using the Distributed Computing Environment, you need to setup the Paralled Configuration.A sample is provided here as well as located in /opt/matlab/mathcluster.mat on the frontend machine.
To use this configuration:
- Run matlab on the frontend machine, and select Manage Configurations on the Parallel menu.
- In the Configurations Manager window select File->Import
- Browse to the mathcluster.mat file and then click the Open button.
- Back in the Configurations Manager window, select the Math Cluster configuration and then click the Start Validation button to make sure you have everything set up correctly.
Job Submission Files
A job submission file, refered to as a PBS file, is a simple text file that does two things. It tells the cluster what resources you will need and it tells the cluster what to run. A sample file that runs a job on a single CPU looks like this:
#PBS -N MYJOBPID is your Math PID.
#PBS -S /bin/sh
#PBS -M PID@math.vt.edu
#PBS -m ea
matlab -nodisplay -nojvm -r MATLAB_MFILE >& OUTPUT_FILE
You will get an email message when your job starts to be executed and when it finishes. Any output from the job will be in your HOME directory.
A more complicated job that would run on 2 nodes 3 programs in parallel would look something like this:
#PBS -N MYJOBnodes should be set to how many different machines (max of 3 currently) you want your job to run on.
#PBS -S /bin/sh
#PBS -l nodes=2:ppn=3
#PBS -M PID@math.vt.edu
#PBS -m ea
myMPIprogram >& OUTPUT_FILE
ppn is how many processors (cores) you want to use on each requested machine. Each machine has up to 8 available.
The Math cluster supports some specialty options such as high memory servers. Please see Cluster Reference for more information on making use of these.
Jobs can be submitted to the cluster only from the frontend (cluster.math.vt.edu).
You need these things:
- A job submission file
- Program and/or data files copied to your Math home
Assuming that your PBS file and all necessary program files are located properly under your home directory, you would login to cluster.math.vt.edu and then run something like
cat job.pbs | qsub
Job Status and Control
Run the following command on cluster.math.vt.edu
MPI is available on the Math cluster using a package called MVAPICH2. This package supports communications over InfiniBand connections. The infiniband interfaces should be faster then ethernet, but if you are having problems with MPI, it would be advisable to try out the ethernet interfaces.
In general you will use a mpd running on each node. A helper script, mpisetup.sh, has been created to setup and run an mpd on each node your job is assigned to by the scheduler. Each mpd will use an infiniband interface for it's communications by default. To use a different interface, specify eth (ethernet) or ib1 (alternative infiniband) as an arugment to mpisetup.sh.
#PBS -l nodes=3:ppn=2
mpisetup.sh mpdtrace -l
mpirun -np 6 ./a.out
This should run using 2 processor cores on 3 nodes for a total of 6 processor cores.
mpdtrace is an optional step that may help in debugging problems.
Don't forget to have a .mpd.conf file in your home directory with a
line setting the MPD_SECRETWORD variable to some value. You will know
you have to do this if you get an error from mpdboot_frontend like this:
mpdboot_frontend (handle_mpd_output 406): from mpd on 10.10.0.7, invalid port info:
To do this easily, try the following:
echo MPD_SECRETWORD=xxxxxxxxxxx > ~/.mpd.conf
chmod 600 ~/.mpd.conf
where xxxxxxxxxxx is some secret password that you want to use. Be careful if using characters like quotes, *, or other special characters in your password.
Please see the MVAPICH Users Guide for more information on using MPI.