Using a CPU farm
Last Time and Todayy
Last time we discussed strings in the context
of PERL.
Other advanced scripting languages have
similarly powerful tools.
Today we talk about CPU farms.
farms
To maximise their power we will take advantage
of advanced scripting
scripting.
What is a CPU farm.
A CPU ffarm iis a collection
ll ti off processors th
thatt
can be used to process many jobs in parallel.
It is not strictlyy speaking
p gpparallel p
processing
g
as the jobs can be carried out in series.
CPU farms can allow you to carry out much
more intensive analysis than if possible if you
just has a single CPU.
What does a farm possess.
p
A front
f t end
d
This will be the machine that you log onto.
Disk
There should be “a lot” of disk space available that you can
access from the front end and the nodes.
Many farm nodes.
These are the CPUs that do the work.
They will often also have their own local disk.
A network
The nodes will be connected to the front end by a network.
The network capacity can be the limiting factor.
The front end
When using g the farm yyou will spend
p most of yyour
time on the front end.
Typically this will have the same OS as the nodes.
You can compile code here.
You submit jobs from the front end to the nodes.
You manage the disk on the front end
You might take a quick look at the output here
here.
Remember the front end will have many other users,
so try and be as undisruptive as possible
possible.
The nodes
The nodes are where your CPU time occurs.
Usually they will have local disk.
Using this will cut down on network traffic.
Improves farm performance
performance.
Be careful about how much space is available.
On some farms the same box may be several
nodes.
Dual CPU machines
Hyperthreading.
They will have high memory, but watch out for
programs with
ith very hi
high
h memory usage, th
they may
not play well together.
JJobs on a farm
JJobsb on a ffarm may be
b th
thought
ht off a bit lik
like
files on a file system.
There are commands that can
list them
delete them
Theyy have an owner
You cannot
copy them
do (many) operations on them.
An example.
p
To illustrate the use of a farm I will use an
example.
The CPU farm at RAL PPD
Generally you will be able to get an account
here if you need it (hep).
The commands are one implementation of
the grid engine.
Most farms use the same commands.
All farms have commands that do the same thing.
g
Submittingg a job
j
You use the command qsub
This will return a report to the screen, that
includes a unique job id for the job you just
submitted.
Y may need
You d thi
this jjob
b id llater.
t
qsub my_job.scr
Listingg the jobs
j submitted to a farm.
To list the jjobs use the command q
qstat
This will tell you
The job name
The job ID
It’s status
The running time
The owner.
Use qstat –u username to see the jobs belonging to
a particular user.
There are other useful switches
See the man page.
p g
JJob Status
Running
A job that is currently running on a node.
You will be able to see how long it’s running with qstat
Queued
A job that is waiting for a free node.
Terminated
A job that is finished. You won’t see these in qstat
Suspended/Error
Something has happened to the job and it’s in a error state.
This is probably your fault.
But it could be a system error, so it’s worth restarting these
once.
once
Deletingg a job
j
You can use the qdel command
qdel jobid
The job id can be obtained from qsub or
qstat.
t t
You will onlyy be able to delete yyour own jjobs.
Unless you have manager privileges.
Different queues
q
Some farms (like RAL) have different queues.
These are to separate
p resources for different
groups (experiments) from the main queue.
There may also be a fast queue
A queue with few nodes and a short maximum
duration
Good for testing.
The queue is specified in the qsub command.
Maximum duration
Some q
queues expect
p a maximum duration.
When exceeded the job will be terminated.
Set using qsub
At ral qsub –l cput=24:00:00 myjob.scr
The duration you set can control which queue
you use and the resources available.
Be careful when setting the maximum
duration try and keep it short
duration, short, but long
enough that your job will finish.
The local disk
Use the local disk whenever possible
Copy
y data files to it at the beginning.
g g
Use it for output and temp files.
Copy output to main disk at the end.
Take care however not to fill the disk.
O many systems
On t the
th llocall di
diskk iis available
il bl
as $TMPDIR
General advice
Use a shell script to control your job
job.
Don’t directly submit your executable.
Thi gives
This i you more control.
t l
Use an advanced script to control your queue
usage.
Write command files
Write job shell scripts.
Submit jobs
jobs.
Don’t submit too many jobs at once.
A typical
yp job
j script.
p
#!/bin/tcsh Setup my environment
source ~/env
/env_script.csh
script.csh
Use the local disk.
cd $TMPDIR
Copy needed data to local disk
cp ~/input_data/my_data .
$SNO_CODE/snoman.exe
_ –c mycmd.cmd
y
Run my analysis code
cp result.ntp ~/output_data/
Copy my results back to
the data disk
JJob Master Script
p
You
ou ca
can pu
put much
uc that
a we eddiscussed
scussed in the
e
last two lectures into action.
Writing
t g multiple
u t p e co
command
a d files
es a
and
d sshell
e scscripts.
pts
Running system programs and analysing their
output.
Examining the output of your analysis programs.
You can put limits on the number of jobs in
two ways.
Using the sleep command when too many jobs
are submitted.
Usingg a cronjob
j
Cronjobs
j
A cronjob
j b iis a jjob
b th
thatt runs att a scheduled
h d l d
time.
Your cronjobs are controlled by your crontab.
Not allowed on all systems (including RAL).
To edit your crontab use
crontab –e
You will use your $EDITOR variable to decide the
editor
You need to exit the editor for the change to take
effect.
A typical
yp contrab. Redirect output.
p
Otherwise you’ll get
Time to run job an email
# My program
0 * * * * my_program
y_p g > /dev/null 2>> /dev/null
#End of crontab.
Comment to end crontab. Need a newline
at the end of each command
Time is specified by five variables.
mhdwm
* is a wild card that means any
y
When the system time equals this time the job will
run.
The GRID
The GRID is an extension of single site
farms.
A farm of farms.
It will be used extensively in LHC and other
experiments.
Th use off this
The thi iis b
beyond
d th
the scope off thi
this
course.
When the time comes to use it you can talk to
your collaborators.
Exercises
Adapt your multiple command file script to
control job submission to a farm.
Write a program to send you a simple email
email,
and add it to your crontab so it will send it at
midnight.
midnight