GridX1
From ATLAS-TRIUMF
GridX1 is a Canadian grid project. The main project page is here. This area is for the development of documentation, installation and user instructions.
Contents |
[edit] Installation Instructions
[edit] User Interface
- Download the 6.7.10 developments release, for you OS, from Condor downloads. There is some simple registration procedure.
- The rpm installs into /opt/condor-6.7.10. Make a soft link
ln -s /opt/condor-6.7.10 /opt/condor
- These directories store the persistant queue info and should be kept when upgrading, hence the path.
mkdir -p /opt/condor_var/spool /opt/condor_var/log
- Then you need to edit /opt/condor/etc/condor_config
CONDOR_HOST=condorg.triumf.ca RELEASE_DIR = /opt/condor LOCAL_DIR = /opt/condor_var CONDOR_ADMIN = whatever ENABLE_GRID_MONITOR = TRUE GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE=5000 GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE=5 GRIDMANAGER_MAX_PENDING_REQUESTS=1000 GRIDMANAGER_GAHP_CALL_TIMEOUT = 900
- Paste below to /etc/init.d/condor
#! /bin/sh
#
# condor Start/Stop condor processes
#
# chkconfig: 345 90 10
# description: all condor related processes are bootstrapped here.
# Source function library.
. /etc/init.d/functions
export CONDOR_CONFIG=/opt/condor/etc/condor_config
export PATH=$PATH:/opt/condor/bin:/opt/condor/sbin
export CONDOR_IDS=0.0
prog=condor
lock=/var/lock/subsys/condor
RETVAL=0
pids=`/sbin/pidof condor_master`
if [ "$pids" != "" ] ; then
running=1
else
running=0
fi
start() {
echo -n $"Starting $prog: "
## being extra careful : check both lock file and PIDs
startit=0
if [ ! -f "$lock" ]; then
if [ "$running" = "1" ] ; then
echo $" is already running..."
else
startit=1
fi
else ## lock file exists - should it?
if [ "$running" = "1" ] ; then
echo $" is already running..."
else
startit=1
fi
fi
if [ "$startit" = "1" ] ; then
daemon /opt/condor/sbin/condor_master
RETVAL=$?
echo
[ $RETVAL -eq 0 ] && touch $lock
return $RETVAL
fi
}
stop() {
echo -n $"Stopping $prog: "
stopit=0
if [ "$running" = "1" ] ; then
stopit=1
else
echo $" is not running..."
rm -f $lock 2>/dev/null
fi
if [ "$stopit" = "1" ] ; then
condor_off -master
RETVAL=$?
echo
[ $RETVAL -eq 0 ] && rm -f $lock
return $RETVAL
fi
}
restart() {
stop
start
}
# See how we were called.
case "$1" in
start)
start
;;
stop)
stop
;;
status)
if [ "$running" = "1" ] ; then
echo $"$prog is running ..."
condor_status -master
else
echo $"$prog is not running..."
fi
;;
restart)
restart
;;
condrestart)
if [ -f $lock ] ; then
stop
start
fi
;;
*)
echo $"Usage: $0 {start|stop|status|restart|condrestart}"
exit 1
esac
exit $?
- Start condor and make it start on reboot.
/etc/init.d/condor start chkconfig condor on
[edit] Compute Element
[edit] Condor
- Download the 6.7.10 developments release, for you OS, from Condor downloads. There is some simple registration procedure.
- The rpm installs into /opt/condor-6.7.10. Make a soft link
ln -s /opt/condor-6.7.10 /opt/condor
[edit] BLAHP
This is the interface between Condor and the local batch system(only PBS or LSF supported). The condor local batch system doesn`t need an interface.
- Install glite
[edit] Submitting jobs
This section deals with submitting jobs to GridX1 via CondorG. There are two ways that one can do this: (1) defining one's own JDL (job definition language) file and job executable script, and (2) using some predefined templates.
[edit] Method I
The JDL file is used by CondorG to submit the job executable. Included below is a rather basic template that can be used to submit a job to hep.westgrid.ca. To use this template, replace anything between << >>. Note that the job executable is sent over the network by CondorG to a worker node, so it is best to use a small shell script which takes care of fetching input files, executables, etc. once the job has landed on a node.
Executable = <<local path to executable>> Dir = <<location for job output>> Output = $(Dir)/<<stdout output>>.$(Cluster) Error = $(Dir)/<<stderr output>>.$(Cluster) Log = $(Dir)/<<CondorG logfile for this job>> globusscheduler = hep.westgrid.ca:2119/jobmanager-pbs globusrsl = (maxWalltime=<<max walltime, in minutes>>) periodic_release = ((CurrentTime-EnteredCurrentStatus) > 10) && (HoldReason =!= "via condor_hold (by user $(USER))") globus_resubmit = NumGlobusSubmits <= NumSystemHolds leave_in_queue = jobstatus == 4 Universe = Globus Notification = Never Copy_to_Spool = False Environment = <<environment variables, specified as NAME=VALUE>> Arguments = <<arguments to pass to the executable at runtime>> Transfer_Executable = True +stream_output = false +stream_error = false +Type = "job" queue
To submit a job to different job-managers on GridX1, find the appropriate URL in the GridMonitor/ClassAd section on the GridX1 website, and use this value for the JDL variable 'globusscheduler' in the above template.
Since the range of jobs differs greatly there is no surefire way to define what a job executable should do. However for ATLAS jobs, there are a number of things that it should include:
- setting up the ATLAS environment
- staging input datasets
- running Athena
- staging the output data
- reporting detailed log information in a meaningful way
When a job lands on an ATLAS node, an environment variable $LCG_GC_ENV is available, which points to a setup script. Sourcing this setup script defines, among others, the following self-explanatory environment variables:
- $VO_ATLAS_SW_DIR
- $VO_ATLAS_DEFAULT_SE
- $X509_CERT_DIR
ATLAS releases are stored in $VO_ATLAS_SW_DIR/software/<<VERSION>>. To setup the entire release, add the lines
source $VO_ATLAS_SW_DIR/software/<<VERSION>>/setup.sh source $VO_ATLAS_SW_DIR/software/<<VERSION>>/dist/<<VERSION>>/AtlasRelease/*/cmt/setup.sh
to the job executable script. This sets the $LD_LIBRARY_PATH, $PATH, $CMTPATH and other ATLAS environment settings, and also makes athena.py accessible.
In order to keep track of what the job is doing, it is a good idea to be very verbose in echoing environment variables, command outputs, etc. to standard out (STDOUT). All the STDOUT and STDERR is trapped by the job-manager and stored - when the job is completed - on the the local machine that submitted the job.
To submit a job simply use
condor_submit JOB.jdl
Other useful commands:
- condor_q user - show job queue for user
- condor_q -l cluster - show detailed information for cluster
- condor_rm user - remove all of user jobs from the queue
- condor_rm cluster - remove job cluster from the queue
[edit] Method II
You can download a tarball from here which contains a number of predefined templates and shells scripts to modify and submit jobs via CondorG. The templates included in this package are geared towards production-like jobs and rely extensively on JobTransforms. Unpack the tarball in a directory on the machine that you wish to submit jobs from. The directory structure is
- condorg/
- scripts/ - contains scripts for defining and submitting jobs
- templates/ - contains template files
Begin by setting up the package, and trying to define a job
~/some-path> mkdir -p $HOME/condorg/jobs
~/some-path> echo "export CONDORG_JOBS=\$HOME/condorg/jobs" >> $HOME/condorg/setup.sh
~/some-path> cd condorg/
~/some-path/condorg> source setup.sh
~/some-path/condorg> mk_jobwrap
Configures an ATLAS transformation job wrapper for CondorG.
Usage: mk_jobwrap -t transform executable [ -i include(s) ]
[ -d input file(s) ] [ -o output file(s) ]
[ -s script(s) ] [ -n ]
-t transform executable: the transformation to run
mk_jobwrap attaches the executable to the job wrapper, which
executes it after setting up the ATLAS environment
[ -d input file(s) ]: data to stage in from the grid
[ -o output file(s) ]: expected output files to stage out from the node
[ -i include file(s) ]: files to include with the transformation
[ -s script(s) ]: pre-job scripts to include with the wrapper
if 'bootstrap.def' is found in the path CONDORG_TEMPLATES it
is automatically added as a pre-job script. This is useful for
defining required functions 'stagein' and 'stageout'
[ -n ]: specifies that STD, ERR and LOG from job not be saved
~/some-path/condorg>
The mk_jobwrap command used above is just a shell script that creates a job executable script (wrapper) and a JDL file to submit the job with. Note: the job wrapper that mk_jobwrap creates does not run Athena. Instead, you specify a JobTransform with the [ -t ] flag. The specified JobTransform is attached to the wrapper and is executed on the worker node after the job wrapper has
- setup the ATLAS release
- run pre-job scripts
- staged the input files
See the condorg/templates/transforms for details on using the JobTransforms package. There are a couple of things to note about the job wrapper:
- the wrapper expects three aliases/functions to be defined named 'stagein', 'stageout', and 'savelog' which take arguments <source> and <destination>. Both stagein and stageout should be defined either in 'bootstrap.def' (in condorg/templates) or in some other pre-job script that is specified with the [ -s ] flag in mk_jobwrap. These functions are responsible for retrieving and storing job data on the grid.
- job output (data and logfiles) is staged as a tarball to the location specified by savelog (it's important to verify that the location is accessible - it would be a shame to run a long job and then lose the output)
An example is the best way to illustrate some key points. Begin by setting up the package
~/some-path/condorg> source setup.sh
Now update your grid proxy certificate
~/some-path/condorg> grid-proxy-init -valid 24:00 Your identity: /C=CA/O=Grid/OU=westgrid.ca/CN=... Enter GRID pass phrase for this identity: Creating proxy ..................................................... Done Your proxy is valid until: Wed
Define a job that simply runs the Hello World! test using Athena and retrieves an input file 'mypoolfile.pool.root' (before running the job make sure that mypoolfile.pool.root is accessible in PATH on gridstore.westgrid.ca, or whatever location you choose to fetch input from):
~/some-path/condorg> mk_jobwrap -t templates/transforms/AthExHelloWorld.trf -d /home/myusername/data/mypoolfile.pool.root -o /home/myusername/data/mypoolfile.root Generating 0927134405 ... done. Attaching file: templates/transforms/AthExHelloWorld.trf Attaching file: templates/bootstrap.def Transformation is: AthExHelloWorld.trf JobID: 0927134405
Of course, make sure that /home/myusername/data exists on (in this case) gridstore.ca. The mk_jobwrap script will query for a number of parameters. Just hit ↵ to accept defaults in brackets
Configure job submission... +Gridmanager (hep.westgrid.ca:2119/jobmanager-pbs): +Max walltime, in minutes (240): 60 +ATLAS release (10.0.1): +JobTransforms version (10.0.1.7): +Job description: a Hello World! test job in Athena
The job wrapper and JDL files are created and stored in condorg/jobs (note that the job id number is created from the computer date and time):
~/some-path/condorg> ls jobs/0927134405 -rwxrwxr-x 1 dschoute dschoute 19K Sep 27 13:46 0927134658 -rwxrwxr-x 1 dschoute dschoute 108 Sep 27 13:49 0927134658.cfg -rw-rw-r-- 1 dschoute dschoute 120 Sep 27 13:46 0927134658.log drwxrwxr-x 2 dschoute dschoute 4.0K Sep 27 13:46 err/ -rw-rw-r-- 1 dschoute dschoute 92 Sep 27 13:49 job.id drwxrwxr-x 2 dschoute dschoute 4.0K Sep 27 13:46 out/
STDOUT output from the job is saved in out/ and STDERR output is saved in err/. The job.id file contains information about the job
~/some-path/condorg> cat jobs/0927134405/job.id 0927134658 => AthExHelloWorld.trf Input: mypoolfile.root Output: mypoolfile.root Comment: a Hello World! test job in Athena
Now submit the job using submit_job
~/some-path/condorg> submit_job -j 0927134658 0927134658.jdl.1 submitted to cluster 4050.
Verify that the job is queued
~/some-path/condorg> condor_q ${USER}
-- Submitter: condorg.triumf.ca : <142.90.97.148:36536> : condorg.triumf.ca
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
4050.0 9/27 13:54 0+00:00:00 I 0 0.0 0927134658 ''
1 jobs; 1 idle, 0 running, 0 held
Now wait a while for the job to complete and look at the standard output. The logfiles and job output are stored at the location specified by stageout defined above (in this case )
~/some-path/condorg> cat out/out.4050 ************************************************************************** * Welcome to UBC/TRIUMF WestGrid: glacier.westgrid.ca * * Please report any issues or comments to: support@westgrid.ca * * Local Contacts: roman@chem.ubc.ca, brent@guide.westgrid.ca * ************************************************************************** Documentation: http://guide.westgrid.ca (comments & suggestions are welcome) ************************************************************************** Of 840 compute nodes, the following are unavailable: ice41-14 ice60_6 (testing) ************************************************************************** CURRENT NOTICES: ************************************************************************** **** CHECK transformation arguments **** '' **** CHECK shell environment **** TEC100HOME=/global/software/tecplot-10.0 MANPATH=/export/LHC/software/LCG-2_6_0/globus/man:... HOSTNAME=ice25_3.westgrid.ubc PVM_RSH=/usr/bin/rsh LCG_LOCATION_VAR=/export/LHC/software/LCG-2_6_0/lcg/var SHELL=/bin/bash HISTSIZE=1000 GLOBUS_PATH=/export/LHC/software/LCG-2_6_0/globus SSH_CLIENT=192.168.25.3 53583 22 GLOBUS_LOCATION=/export/LHC/software/LCG-2_6_0/globus EDG_WL_SCRATCH=/scratch LCG_GC_ENV=/export/LHC/software/lcg_env.sh EDG_TMP=/tmp GMXMAN=/global/software/gromacs-3.2/man QTDIR=/usr/lib/qt-3.1 X509_CERT_DIR=/export/LHC/software/certificates MPICH=/global/software/mpich-1.2.5.2/ssh T_PACKAGE=10.0.1.7/JobTransforms NCPUS=2 GLITE_LOCATION_LOG=/export/LHC/software/LCG-2_6_0/glite/log USER=dschoute JAVA_INSTALL_PATH=/usr/java/j2sdk1.4.2_04 LS_COLORS= LD_LIBRARY_PATH=/export/LHC/software/LCG-2_6_0/lcg/lib:... LCG_LOCATION=/export/LHC/software/LCG-2_6_0/lcg GLITE_LOCATION_TMP=/export/LHC/software/LCG-2_6_0/glite/tmp GMXLIB=/global/software/gromacs-3.2/share/top EDG_WL_TMP=/var/edgwl PVM_ROOT=/usr/share/pvm3 CLASSADJ_INSTALL_PATH=/usr LIBPATH=/export/LHC/software/LCG-2_6_0/globus/lib:/usr/lib:/lib USERNAME= GMXDATA=/global/software/gromacs-3.2/share VERBOSE_LEVEL=LOG VO_ATLAS_DEFAULT_SE=bigmac-lcg-se.physics.utoronto.ca EDG_WL_USER=edguser GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://hep.westgrid.ca:58030/ PGI=/global/software/pgi-6.0 NLSPATH=:/global/software/intel/fortran-8.0/lib/ifcore_msg.cat MAIL=/var/spool/mail/dschoute PATH=/export/LHC/software/LCG-2_6_0/lcg/bin:/export/LHC/software/LCG-2_6_0/globus/bin:... EDG_WL_LOCATION=/export/LHC/software/LCG-2_6_0/edg VO_DTEAM_DEFAULT_SE=bigmac-lcg-se.physics.utoronto.ca LCG_TMP=/tmp EDG_LOCATION=/export/LHC/software/LCG-2_6_0/edg GL_SWAP_TYPE=NODAMAGE LCG_JAVA_HOME=/global/software/j2sdk1.4.2_02 JOB=0927142934 INPUTRC=/etc/inputrc PWD=/global/home/dschoute/gram_scratch_sgvUr0X5Vu JAVA_HOME=/global/software/j2sdk1.4.2_02 LANG=en_CA GLOBUS_REMOTE_IO_URL=/global/home/dschoute/.globus/.gass_cache... SASL_PATH=/export/LHC/software/LCG-2_6_0/globus/lib/sasl ABSOFT=/global/software/absoft-8.2/client PERLLIB=/export/LHC/software/LCG-2_6_0/edg/lib/perl:/export/LHC/software/LCG-2_6_0/glite/lib/perl5 LM_LICENSE_FILE=/global/software/pgi-6.0/license.dat TLMHOST=@zodiac.chem.ubc.ca CREX_ROOT=/global/software/deMon.1.5/deMon SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass VO_LHCB_SW_DIR=/export/LHC/software/lhcb EDG_WL_LOCATION_VAR=/export/LHC/software/LCG-2_6_0/edg/var GLITE_LOCATION_VAR=/export/LHC/software/LCG-2_6_0/glite/var TEC80HOME=/global/software/tecplot-8.0 SHLVL=3 HOME=/global/home/dschoute GLOBUS_TCP_PORT_RANGE=20000 25000 XPVM_ROOT=/usr/share/pvm3/xpvm X509_USER_PROXY=/global/home/dschoute/.globus/.gass_cache/local/md5/5e/... LUMERICAL_LICENSE_DIR=/global/software/lumerical-3.1 COG_INSTALL_PATH=/usr EDG_LOCATION_VAR=/export/LHC/software/LCG-2_6_0/edg/var BASH_ENV=/global/home/dschoute/.bashrc SCRATCH_DIRECTORY=/global/home/dschoute//gram_scratch_sgvUr0X5Vu LCG_GFAL_INFOSYS=lcg-bdii.cern.ch:2170 PYTHONPATH=/export/LHC/software/LCG-2_6_0/edg/lib:/export/LHC/software/LCG-2_6_0/edg/lib/python GMXBIN=/global/software/gromacs-3.2/intel-fftw-2.1/i686-pc-linux-gnu/bin LOGNAME=dschoute GMXLDLIB=/global/software/gromacs-3.2/intel-fftw-2.1/i686-pc-linux-gnu/lib SSH_CONNECTION=192.168.25.3 53583 192.168.25.3 22 NPX_PLUGIN_PATH=/global/software/j2re1.4.2_02/plugin/i386/ns4:/usr/lib/netscape/plugins GL_OPTIONS=DEFAULT LESSOPEN=|/usr/bin/lesspipe.sh %s ATLAS_RELEASE=10.0.1 SHLIB_PATH=/export/LHC/software/LCG-2_6_0/globus/lib VO_ATLAS_SW_DIR=/export/LHC LOG4J_INSTALL_PATH=/usr GRACE_HOME=/global/software/grace-5.1.18 GLITE_LOCATION=/export/LHC/software/LCG-2_6_0/glite GLOBUS_GRAM_JOB_CONTACT=https://hep.westgrid.ca:58029/11012/1127856648/ G_BROKEN_FILENAMES=1 GMXFONT=10x20 T_OUTPUTID=1 _=/bin/env # 0927142934 # DEBUG 0927142934 landed on ice25_3 (Tue Sep 27 14:31:04 PDT 2005) # 0927142934 # DEBUG running job from directory: /scratch/atlas_job.ZE5WPx # 0927142934 # DEBUG sourcing pre-job script bootstrap.def # 0927142934 # DEBUG siteroot is: /global/home/LHC/software/10.0.1 # 0927142934 # DEBUG pacman: http://physics.bu.edu/pacman/sample_cache/tarballs/pacman-2.116.tar.gz # 0927142934 # DEBUG transformations: https://classis01.roma1.infn.it/pacman/cache # 0927142934 # DEBUG preparing ATHENA runtime environment # 0927142934 # DEBUG retrieving mypoolfile.pool.root <------ POOLFILECATALOG.XML -------> <------ -------> ---------- /scratch/atlas_job.ZE5WPx listing: total 140K -rw-rw-r-- 1 dschoute dschoute 720 Sep 27 14:31 0927142934.err drwxrwxr-x 2 dschoute dschoute 4.0K Sep 27 14:31 0927142934-IN -rw-rw-r-- 1 dschoute dschoute 361 Sep 27 14:31 0927142934.log drwxrwxr-x 2 dschoute dschoute 4.0K Sep 27 14:31 0927142934-OUT-1 -rwxrwxr-x 1 dschoute dschoute 752 Sep 27 14:29 AthExHelloWorld.trf -rw-rw-r-- 1 dschoute dschoute 192 Sep 27 14:29 bootstrap.def -rw-rw-r-- 1 dschoute dschoute 694 Sep 27 14:31 caches drwxrwxr-x 2 dschoute dschoute 4.0K Sep 27 14:31 doc drwxrwxr-x 3 dschoute dschoute 4.0K Sep 27 14:31 JobTransforms drwxr-xr-x 5 dschoute dschoute 4.0K Oct 28 2003 pacman-2.116 -rw-rw-r-- 1 dschoute dschoute 75K Nov 29 2004 pacman-2.116.tar.gz -rw-rw-r-- 1 dschoute dschoute 1.1K Sep 27 14:31 Pacman.db -rw-rw-r-- 1 dschoute dschoute 15 Sep 27 14:31 platform -rw-rw-r-- 1 dschoute dschoute 389 Sep 27 14:31 setup.csh -rw-rw-r-- 1 dschoute dschoute 428 Sep 27 14:31 setup.ksh -rw-rw-r-- 1 dschoute dschoute 428 Sep 27 14:31 setup.sh ---------- /scratch/atlas_job.ZE5WPx/0927142934-IN listing: total 0 # 0927142934 # DEBUG running ./AthExHelloWorld.trf &> AthExHelloWorld.trf.log # 0927142934 # DEBUG transformation returned with exit status 0 # 0927142934 # DEBUG saving /scratch/atlas_job.ZE5WPx/0927142934.err, /scratch/atlas_job.ZE5WPx/0927142934.log # 0927142934 # DEBUG mypoolfile.root not found in /scratch/atlas_job.ZE5WPx! # 0927142934 # DEBUG 0927142934 finished on ice25_3 (Tue Sep 27 14:31:46 PDT 2005)

