GridX1

From ATLAS-TRIUMF

Jump to: navigation, search

GridX1 is a Canadian grid project. The main project page is here. This area is for the development of documentation, installation and user instructions.

Contents

[edit] Installation Instructions

[edit] User Interface

  • Download the 6.7.10 developments release, for you OS, from Condor downloads. There is some simple registration procedure.
  • The rpm installs into /opt/condor-6.7.10. Make a soft link
ln -s /opt/condor-6.7.10 /opt/condor
  • These directories store the persistant queue info and should be kept when upgrading, hence the path.
mkdir -p /opt/condor_var/spool /opt/condor_var/log
  • Then you need to edit /opt/condor/etc/condor_config
CONDOR_HOST=condorg.triumf.ca
RELEASE_DIR             = /opt/condor
LOCAL_DIR               = /opt/condor_var
CONDOR_ADMIN            = whatever
ENABLE_GRID_MONITOR = TRUE
GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE=5000
GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE=5
GRIDMANAGER_MAX_PENDING_REQUESTS=1000
GRIDMANAGER_GAHP_CALL_TIMEOUT = 900
  • Paste below to /etc/init.d/condor
#! /bin/sh
#
# condor       Start/Stop condor processes
#
# chkconfig: 345 90 10
# description: all condor related processes are bootstrapped here.

# Source function library.
. /etc/init.d/functions

export CONDOR_CONFIG=/opt/condor/etc/condor_config
export PATH=$PATH:/opt/condor/bin:/opt/condor/sbin
export CONDOR_IDS=0.0

prog=condor
lock=/var/lock/subsys/condor

RETVAL=0

pids=`/sbin/pidof condor_master`
if [ "$pids" != "" ] ; then
        running=1
else
        running=0
fi
start() {
        echo -n $"Starting $prog: "
        ## being extra careful : check both lock file and PIDs
        startit=0

        if [ ! -f "$lock" ]; then
          if [ "$running" = "1" ] ; then
            echo $" is already running..."
          else
            startit=1
          fi
        else ## lock file exists - should it?
          if [ "$running" = "1" ] ; then
            echo $" is already running..."
          else
            startit=1
          fi
        fi
        if [ "$startit" = "1" ] ; then
          daemon /opt/condor/sbin/condor_master
          RETVAL=$?
          echo
          [ $RETVAL -eq 0 ] && touch $lock
          return $RETVAL
        fi
}


stop() {
        echo -n $"Stopping $prog: "
        stopit=0
        if [ "$running" = "1" ] ; then
          stopit=1
        else
          echo $" is not running..."
          rm -f $lock 2>/dev/null
        fi
        if [ "$stopit" = "1" ] ; then
          condor_off -master
          RETVAL=$?
          echo
          [ $RETVAL -eq 0 ] && rm -f $lock
          return $RETVAL
        fi
}

restart() {
        stop
        start
}

# See how we were called.
case "$1" in
  start)
        start
        ;;
  stop)
        stop
        ;;
  status)
        if [ "$running" = "1" ] ; then
          echo $"$prog is running ..."
          condor_status -master
        else
          echo $"$prog is not running..."
        fi
        ;;
  restart)
        restart
        ;;
  condrestart)
        if [ -f $lock ]  ; then
           stop
           start
        fi
        ;;
  *)
        echo $"Usage: $0 {start|stop|status|restart|condrestart}"
        exit 1
esac

exit $?
  • Start condor and make it start on reboot.
/etc/init.d/condor start
chkconfig condor on

[edit] Compute Element

[edit] Condor

  1. Download the 6.7.10 developments release, for you OS, from Condor downloads. There is some simple registration procedure.
  2. The rpm installs into /opt/condor-6.7.10. Make a soft link

ln -s /opt/condor-6.7.10 /opt/condor

[edit] BLAHP

This is the interface between Condor and the local batch system(only PBS or LSF supported). The condor local batch system doesn`t need an interface.

  1. Install glite

[edit] Submitting jobs

This section deals with submitting jobs to GridX1 via CondorG. There are two ways that one can do this: (1) defining one's own JDL (job definition language) file and job executable script, and (2) using some predefined templates.

[edit] Method I

The JDL file is used by CondorG to submit the job executable. Included below is a rather basic template that can be used to submit a job to hep.westgrid.ca. To use this template, replace anything between << >>. Note that the job executable is sent over the network by CondorG to a worker node, so it is best to use a small shell script which takes care of fetching input files, executables, etc. once the job has landed on a node.

Executable = <<local path to executable>>
Dir = <<location for job output>> 
Output = $(Dir)/<<stdout output>>.$(Cluster) 
Error = $(Dir)/<<stderr output>>.$(Cluster) 
Log = $(Dir)/<<CondorG logfile for this job>> 
globusscheduler = hep.westgrid.ca:2119/jobmanager-pbs
globusrsl = (maxWalltime=<<max walltime, in minutes>>)
periodic_release = ((CurrentTime-EnteredCurrentStatus) > 10) && 
(HoldReason =!= "via condor_hold (by user $(USER))")
globus_resubmit = NumGlobusSubmits <= NumSystemHolds
leave_in_queue  = jobstatus == 4
Universe = Globus
Notification = Never
Copy_to_Spool = False
Environment = <<environment variables, specified as NAME=VALUE>>
Arguments = <<arguments to pass to the executable at runtime>>
Transfer_Executable = True
 
+stream_output = false
+stream_error = false
+Type = "job"
 
queue

To submit a job to different job-managers on GridX1, find the appropriate URL in the GridMonitor/ClassAd section on the GridX1 website, and use this value for the JDL variable 'globusscheduler' in the above template.

Since the range of jobs differs greatly there is no surefire way to define what a job executable should do. However for ATLAS jobs, there are a number of things that it should include:

  • setting up the ATLAS environment
  • staging input datasets
  • running Athena
  • staging the output data
  • reporting detailed log information in a meaningful way

When a job lands on an ATLAS node, an environment variable $LCG_GC_ENV is available, which points to a setup script. Sourcing this setup script defines, among others, the following self-explanatory environment variables:

  • $VO_ATLAS_SW_DIR
  • $VO_ATLAS_DEFAULT_SE
  • $X509_CERT_DIR

ATLAS releases are stored in $VO_ATLAS_SW_DIR/software/<<VERSION>>. To setup the entire release, add the lines

source $VO_ATLAS_SW_DIR/software/<<VERSION>>/setup.sh
source $VO_ATLAS_SW_DIR/software/<<VERSION>>/dist/<<VERSION>>/AtlasRelease/*/cmt/setup.sh

to the job executable script. This sets the $LD_LIBRARY_PATH, $PATH, $CMTPATH and other ATLAS environment settings, and also makes athena.py accessible.

In order to keep track of what the job is doing, it is a good idea to be very verbose in echoing environment variables, command outputs, etc. to standard out (STDOUT). All the STDOUT and STDERR is trapped by the job-manager and stored - when the job is completed - on the the local machine that submitted the job.

To submit a job simply use

condor_submit JOB.jdl

Other useful commands:

  1. condor_q user - show job queue for user
  2. condor_q -l cluster - show detailed information for cluster
  3. condor_rm user - remove all of user jobs from the queue
  4. condor_rm cluster - remove job cluster from the queue
[edit] Method II

You can download a tarball from here which contains a number of predefined templates and shells scripts to modify and submit jobs via CondorG. The templates included in this package are geared towards production-like jobs and rely extensively on JobTransforms. Unpack the tarball in a directory on the machine that you wish to submit jobs from. The directory structure is

  • condorg/
    • scripts/ - contains scripts for defining and submitting jobs
    • templates/ - contains template files

Begin by setting up the package, and trying to define a job

~/some-path> mkdir -p $HOME/condorg/jobs
~/some-path> echo "export CONDORG_JOBS=\$HOME/condorg/jobs" >> $HOME/condorg/setup.sh
~/some-path> cd condorg/
~/some-path/condorg> source setup.sh
~/some-path/condorg> mk_jobwrap
Configures an ATLAS transformation job wrapper for CondorG.
Usage: mk_jobwrap  -t transform executable  [ -i include(s) ]
                  [ -d input file(s) ]  [ -o output file(s) ]
                  [ -s script(s) ]  [ -n ]
      -t transform executable: the transformation to run
            mk_jobwrap attaches the executable to the job wrapper, which
            executes it after setting up the ATLAS environment
 
      [ -d input file(s) ]: data to stage in from the grid
 
      [ -o output file(s) ]: expected output files to stage out from the node
 
      [ -i include file(s) ]: files to include with the transformation
 
      [ -s script(s) ]: pre-job scripts to include with the wrapper
            if 'bootstrap.def' is found in the path CONDORG_TEMPLATES it
            is automatically added as a pre-job script. This is useful for
            defining required functions 'stagein' and 'stageout'
 
      [ -n ]: specifies that STD, ERR and LOG from job not be saved

~/some-path/condorg>

The mk_jobwrap command used above is just a shell script that creates a job executable script (wrapper) and a JDL file to submit the job with. Note: the job wrapper that mk_jobwrap creates does not run Athena. Instead, you specify a JobTransform with the [ -t ] flag. The specified JobTransform is attached to the wrapper and is executed on the worker node after the job wrapper has

  • setup the ATLAS release
  • run pre-job scripts
  • staged the input files

See the condorg/templates/transforms for details on using the JobTransforms package. There are a couple of things to note about the job wrapper:

  1. the wrapper expects three aliases/functions to be defined named 'stagein', 'stageout', and 'savelog' which take arguments <source> and <destination>. Both stagein and stageout should be defined either in 'bootstrap.def' (in condorg/templates) or in some other pre-job script that is specified with the [ -s ] flag in mk_jobwrap. These functions are responsible for retrieving and storing job data on the grid.
  2. job output (data and logfiles) is staged as a tarball to the location specified by savelog (it's important to verify that the location is accessible - it would be a shame to run a long job and then lose the output)

An example is the best way to illustrate some key points. Begin by setting up the package

~/some-path/condorg> source setup.sh

Now update your grid proxy certificate

~/some-path/condorg> grid-proxy-init -valid 24:00
Your identity: /C=CA/O=Grid/OU=westgrid.ca/CN=...
Enter GRID pass phrase for this identity:
Creating proxy ..................................................... Done
Your proxy is valid until: Wed

Define a job that simply runs the Hello World! test using Athena and retrieves an input file 'mypoolfile.pool.root' (before running the job make sure that mypoolfile.pool.root is accessible in PATH on gridstore.westgrid.ca, or whatever location you choose to fetch input from):

~/some-path/condorg> mk_jobwrap -t templates/transforms/AthExHelloWorld.trf -d /home/myusername/data/mypoolfile.pool.root -o /home/myusername/data/mypoolfile.root
Generating 0927134405 ... done.
Attaching file: templates/transforms/AthExHelloWorld.trf
Attaching file: templates/bootstrap.def
Transformation is: AthExHelloWorld.trf
JobID: 0927134405

Of course, make sure that /home/myusername/data exists on (in this case) gridstore.ca. The mk_jobwrap script will query for a number of parameters. Just hit ↵ to accept defaults in brackets

Configure job submission...
  +Gridmanager (hep.westgrid.ca:2119/jobmanager-pbs): 
  +Max walltime, in minutes (240): 60
  +ATLAS release (10.0.1): 
  +JobTransforms version (10.0.1.7):
  +Job description: a Hello World! test job in Athena

The job wrapper and JDL files are created and stored in condorg/jobs (note that the job id number is created from the computer date and time):

~/some-path/condorg> ls jobs/0927134405
-rwxrwxr-x    1 dschoute dschoute      19K Sep 27 13:46 0927134658
-rwxrwxr-x    1 dschoute dschoute      108 Sep 27 13:49 0927134658.cfg
-rw-rw-r--    1 dschoute dschoute      120 Sep 27 13:46 0927134658.log
drwxrwxr-x    2 dschoute dschoute     4.0K Sep 27 13:46 err/
-rw-rw-r--    1 dschoute dschoute       92 Sep 27 13:49 job.id
drwxrwxr-x    2 dschoute dschoute     4.0K Sep 27 13:46 out/

STDOUT output from the job is saved in out/ and STDERR output is saved in err/. The job.id file contains information about the job

~/some-path/condorg> cat jobs/0927134405/job.id
0927134658 => AthExHelloWorld.trf
Input: mypoolfile.root
Output: mypoolfile.root
Comment: a Hello World! test job in Athena

Now submit the job using submit_job

~/some-path/condorg> submit_job -j 0927134658
0927134658.jdl.1 submitted to cluster 4050.

Verify that the job is queued

~/some-path/condorg> condor_q ${USER}
-- Submitter: condorg.triumf.ca : <142.90.97.148:36536> : condorg.triumf.ca
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
4050.0                    9/27 13:54   0+00:00:00 I  0   0.0  0927134658 ''
 
1 jobs; 1 idle, 0 running, 0 held

Now wait a while for the job to complete and look at the standard output. The logfiles and job output are stored at the location specified by stageout defined above (in this case )

 
~/some-path/condorg> cat out/out.4050
 **************************************************************************
 *          Welcome to UBC/TRIUMF WestGrid: glacier.westgrid.ca           *
 *      Please report any issues or comments to: support@westgrid.ca      *
 *       Local Contacts: roman@chem.ubc.ca, brent@guide.westgrid.ca       *
 **************************************************************************

 Documentation: http://guide.westgrid.ca (comments & suggestions are welcome)

 **************************************************************************
 Of 840 compute nodes, the following are unavailable:
 ice41-14
 ice60_6 (testing)
 **************************************************************************
 CURRENT NOTICES: 
 **************************************************************************
**** CHECK transformation arguments ****
''
**** CHECK shell environment ****
TEC100HOME=/global/software/tecplot-10.0
MANPATH=/export/LHC/software/LCG-2_6_0/globus/man:...
HOSTNAME=ice25_3.westgrid.ubc
PVM_RSH=/usr/bin/rsh
LCG_LOCATION_VAR=/export/LHC/software/LCG-2_6_0/lcg/var
SHELL=/bin/bash
HISTSIZE=1000
GLOBUS_PATH=/export/LHC/software/LCG-2_6_0/globus
SSH_CLIENT=192.168.25.3 53583 22
GLOBUS_LOCATION=/export/LHC/software/LCG-2_6_0/globus
EDG_WL_SCRATCH=/scratch
LCG_GC_ENV=/export/LHC/software/lcg_env.sh
EDG_TMP=/tmp
GMXMAN=/global/software/gromacs-3.2/man
QTDIR=/usr/lib/qt-3.1
X509_CERT_DIR=/export/LHC/software/certificates
MPICH=/global/software/mpich-1.2.5.2/ssh
T_PACKAGE=10.0.1.7/JobTransforms
NCPUS=2
GLITE_LOCATION_LOG=/export/LHC/software/LCG-2_6_0/glite/log
USER=dschoute
JAVA_INSTALL_PATH=/usr/java/j2sdk1.4.2_04
LS_COLORS=
LD_LIBRARY_PATH=/export/LHC/software/LCG-2_6_0/lcg/lib:...
LCG_LOCATION=/export/LHC/software/LCG-2_6_0/lcg
GLITE_LOCATION_TMP=/export/LHC/software/LCG-2_6_0/glite/tmp
GMXLIB=/global/software/gromacs-3.2/share/top
EDG_WL_TMP=/var/edgwl
PVM_ROOT=/usr/share/pvm3
CLASSADJ_INSTALL_PATH=/usr
LIBPATH=/export/LHC/software/LCG-2_6_0/globus/lib:/usr/lib:/lib
USERNAME=
GMXDATA=/global/software/gromacs-3.2/share
VERBOSE_LEVEL=LOG
VO_ATLAS_DEFAULT_SE=bigmac-lcg-se.physics.utoronto.ca
EDG_WL_USER=edguser
GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://hep.westgrid.ca:58030/
PGI=/global/software/pgi-6.0
NLSPATH=:/global/software/intel/fortran-8.0/lib/ifcore_msg.cat
MAIL=/var/spool/mail/dschoute
PATH=/export/LHC/software/LCG-2_6_0/lcg/bin:/export/LHC/software/LCG-2_6_0/globus/bin:...
EDG_WL_LOCATION=/export/LHC/software/LCG-2_6_0/edg
VO_DTEAM_DEFAULT_SE=bigmac-lcg-se.physics.utoronto.ca
LCG_TMP=/tmp
EDG_LOCATION=/export/LHC/software/LCG-2_6_0/edg
GL_SWAP_TYPE=NODAMAGE
LCG_JAVA_HOME=/global/software/j2sdk1.4.2_02
JOB=0927142934
INPUTRC=/etc/inputrc
PWD=/global/home/dschoute/gram_scratch_sgvUr0X5Vu
JAVA_HOME=/global/software/j2sdk1.4.2_02
LANG=en_CA
GLOBUS_REMOTE_IO_URL=/global/home/dschoute/.globus/.gass_cache...
SASL_PATH=/export/LHC/software/LCG-2_6_0/globus/lib/sasl
ABSOFT=/global/software/absoft-8.2/client
PERLLIB=/export/LHC/software/LCG-2_6_0/edg/lib/perl:/export/LHC/software/LCG-2_6_0/glite/lib/perl5
LM_LICENSE_FILE=/global/software/pgi-6.0/license.dat
TLMHOST=@zodiac.chem.ubc.ca
CREX_ROOT=/global/software/deMon.1.5/deMon
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
VO_LHCB_SW_DIR=/export/LHC/software/lhcb
EDG_WL_LOCATION_VAR=/export/LHC/software/LCG-2_6_0/edg/var
GLITE_LOCATION_VAR=/export/LHC/software/LCG-2_6_0/glite/var
TEC80HOME=/global/software/tecplot-8.0
SHLVL=3
HOME=/global/home/dschoute
GLOBUS_TCP_PORT_RANGE=20000 25000
XPVM_ROOT=/usr/share/pvm3/xpvm
X509_USER_PROXY=/global/home/dschoute/.globus/.gass_cache/local/md5/5e/...
LUMERICAL_LICENSE_DIR=/global/software/lumerical-3.1
COG_INSTALL_PATH=/usr
EDG_LOCATION_VAR=/export/LHC/software/LCG-2_6_0/edg/var
BASH_ENV=/global/home/dschoute/.bashrc
SCRATCH_DIRECTORY=/global/home/dschoute//gram_scratch_sgvUr0X5Vu
LCG_GFAL_INFOSYS=lcg-bdii.cern.ch:2170
PYTHONPATH=/export/LHC/software/LCG-2_6_0/edg/lib:/export/LHC/software/LCG-2_6_0/edg/lib/python
GMXBIN=/global/software/gromacs-3.2/intel-fftw-2.1/i686-pc-linux-gnu/bin
LOGNAME=dschoute
GMXLDLIB=/global/software/gromacs-3.2/intel-fftw-2.1/i686-pc-linux-gnu/lib
SSH_CONNECTION=192.168.25.3 53583 192.168.25.3 22
NPX_PLUGIN_PATH=/global/software/j2re1.4.2_02/plugin/i386/ns4:/usr/lib/netscape/plugins
GL_OPTIONS=DEFAULT
LESSOPEN=|/usr/bin/lesspipe.sh %s
ATLAS_RELEASE=10.0.1
SHLIB_PATH=/export/LHC/software/LCG-2_6_0/globus/lib
VO_ATLAS_SW_DIR=/export/LHC
LOG4J_INSTALL_PATH=/usr
GRACE_HOME=/global/software/grace-5.1.18
GLITE_LOCATION=/export/LHC/software/LCG-2_6_0/glite
GLOBUS_GRAM_JOB_CONTACT=https://hep.westgrid.ca:58029/11012/1127856648/
G_BROKEN_FILENAMES=1
GMXFONT=10x20
T_OUTPUTID=1
_=/bin/env

# 0927142934 # DEBUG 0927142934 landed on ice25_3 (Tue Sep 27 14:31:04 PDT 2005)
# 0927142934 # DEBUG running job from directory: /scratch/atlas_job.ZE5WPx
# 0927142934 # DEBUG sourcing pre-job script bootstrap.def
# 0927142934 # DEBUG siteroot is: /global/home/LHC/software/10.0.1
# 0927142934 # DEBUG pacman: http://physics.bu.edu/pacman/sample_cache/tarballs/pacman-2.116.tar.gz
# 0927142934 # DEBUG transformations: https://classis01.roma1.infn.it/pacman/cache
# 0927142934 # DEBUG preparing ATHENA runtime environment
# 0927142934 # DEBUG retrieving mypoolfile.pool.root

<------ POOLFILECATALOG.XML ------->
<------                     ------->

----------
/scratch/atlas_job.ZE5WPx listing:
total 140K
-rw-rw-r--    1 dschoute dschoute      720 Sep 27 14:31 0927142934.err
drwxrwxr-x    2 dschoute dschoute     4.0K Sep 27 14:31 0927142934-IN
-rw-rw-r--    1 dschoute dschoute      361 Sep 27 14:31 0927142934.log
drwxrwxr-x    2 dschoute dschoute     4.0K Sep 27 14:31 0927142934-OUT-1
-rwxrwxr-x    1 dschoute dschoute      752 Sep 27 14:29 AthExHelloWorld.trf
-rw-rw-r--    1 dschoute dschoute      192 Sep 27 14:29 bootstrap.def
-rw-rw-r--    1 dschoute dschoute      694 Sep 27 14:31 caches
drwxrwxr-x    2 dschoute dschoute     4.0K Sep 27 14:31 doc
drwxrwxr-x    3 dschoute dschoute     4.0K Sep 27 14:31 JobTransforms
drwxr-xr-x    5 dschoute dschoute     4.0K Oct 28  2003 pacman-2.116
-rw-rw-r--    1 dschoute dschoute      75K Nov 29  2004 pacman-2.116.tar.gz
-rw-rw-r--    1 dschoute dschoute     1.1K Sep 27 14:31 Pacman.db
-rw-rw-r--    1 dschoute dschoute       15 Sep 27 14:31 platform
-rw-rw-r--    1 dschoute dschoute      389 Sep 27 14:31 setup.csh
-rw-rw-r--    1 dschoute dschoute      428 Sep 27 14:31 setup.ksh
-rw-rw-r--    1 dschoute dschoute      428 Sep 27 14:31 setup.sh

----------
/scratch/atlas_job.ZE5WPx/0927142934-IN listing:
total 0

# 0927142934 # DEBUG running ./AthExHelloWorld.trf  &> AthExHelloWorld.trf.log
# 0927142934 # DEBUG transformation returned with exit status 0
# 0927142934 # DEBUG saving /scratch/atlas_job.ZE5WPx/0927142934.err, /scratch/atlas_job.ZE5WPx/0927142934.log
# 0927142934 # DEBUG mypoolfile.root not found in /scratch/atlas_job.ZE5WPx!
# 0927142934 # DEBUG 0927142934 finished on ice25_3 (Tue Sep 27 14:31:46 PDT 2005)
Personal tools