Submit jobs

From ATLAS-TRIUMF

Jump to: navigation, search

I still need to understand what quotas, priorities, etc. are associated with running private analysis or production jobs on the Worldwide LCG grid, on the whole Canadian GridX1, or just on WestGrid. By default, the LCG method of job submission will run jobs anywhere in the world that members of the atlas VO can run jobs. The GridX1/Condor method will run them in Canada, and it isn't too difficult to force them to run on WestGrid.


[edit] Submitting Jobs to the LCG GRID

You need access to a user interface (UI) machine. Even if you plan to set up your own machine as a UI, it is useful to have access to another UI that works. For example, when my certificates get inexplicably messed up and I can't submit jobs, the easiest (perhaps not best) way to fix this problem is to blow off the whole /etc/grid-security/certificates directory and copy the one from lxplus.

Alternatively (better, if you will submit many jobs) you can set up your own machine as a UI as explained at How to setup an LCG UI on your desktop (this one includes cron jobs for certificates etc.) or the CERN version.

Log on to your UI. You will be in your home directory.

Create a job file helloworld.jdl containing:

#############Hello World#################
Executable = "/bin/echo";
Arguments = "Hello welcome to the Grid ";
StdOutput = "hello.out";
StdError = "hello.err";
OutputSandbox = {"hello.out","hello.err"};
#########################################

Get a grid proxy: grid-proxy-init, then your grid passphrase.

Submit the job to the grid: edg-job-submit --vo atlas -o jobIDfile helloworld.jdl

Watch its progress: edg-job-status -i jobIDfile (you may have to select a job from a list if you have run several jobs with the same jobIDfile...)

When it has finished, retrieve the output. This will go to a default directory, defined in /opt/edg/etc/edg_wl_ui_cmd_var.conf; when I did this test, the default directory was /tmp/jobOutput, and it didn't exist yet, so I had to create it to get the next bit to work. The retrieve command is edg-job-get-output -i jobIDfile. Then you can look at the output file in a subdirectory of /tmp/jobOutput, and it should say Hello welcome to the Grid.

Now you can try again with something more useful...

I finally got the RecExCommon default job options to run as an lcg grid job. See: Running RecExCommon on the Grid.

Here is a more complicated example where I run a TopView analysis job on the grid. It involves checked out packages, user-defined job options and retrieving ntuple files.

The Workbook has examples for AthenaHelloWorld.

There is also a nice example of submitting an analysis Job.

User guides for the LCG are available here.


[edit] Submit jobs to GridX1 with Condor

This section is still a bit confused because I had some problems (which should be solved now) and tried to get around them by following two different sets of instructions. It is probably easier to follow the instructions for setting up an LCG UI, as they will set up condor as a by-product. But there is a lot of very useful information in the GridX1/condor instructions.

download condor (see the GridX1 detailed instructions); you also get it when you do the UI setup above (How to setup an LCG UI on your desktop).

Probably you should not do both (although you may get a much older condor version with the LCG UI installation than you do by downloading directly from condor), so skip the rpm installation if you did the full UI setup.

rpm -i condor-6.7.18-linux-x86-glibc23-dynamic-1.i386.rpm (or whatever current version is)

export CONDOR_CONFIG=/opt/condor-6.7.18/etc/condor_config (or =/home/yourusername/LCG/condor/etc/condor_config for LCG UI install)

ln -s /opt/condor-6.7.18 /opt/condor (If you did the LCG UI installation, you could do ln -s /home/yourusername/LCG/condor /opt/condor)

mkdir -p /opt/condor_var/spool /opt/condor_var/log

vi /opt/condor/etc/condor_config and edit to make sure:

CONDOR_HOST=condorg.triumf.ca
RELEASE_DIR             = /opt/condor
LOCAL_DIR               = /opt/condor_var
CONDOR_ADMIN            = your.email@triumf.ca 
UID_DOMAIN              = triumf.ca
FILESYSTEM_DOMAIN       = $(FULL_HOSTNAME)
ENABLE_GRID_MONITOR = TRUE
GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE=5000
GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE=5
GRIDMANAGER_MAX_PENDING_REQUESTS=1000
GRIDMANAGER_GAHP_CALL_TIMEOUT = 900

Flip through section 2 and edit things if it seems necessary.

vi /etc/init.d/condor and paste in contents as shown on GridX1 page.

chmod 755 /etc/init.d/condor

/etc/init.d/condor start

chkconfig condor on

Now make a test job, test.jdl:

#############Hello World#################
Executable = /bin/echo
Dir = /home/itrigger/condorg/jobs
Output = $(Dir)/hello.out.$(Cluster)
Error = $(Dir)/hello.err.$(Cluster)
Log = $(Dir)/hello.log
globusscheduler = hep.westgrid.ca:2119/jobmanager-pbs
globusrsl = (maxWalltime=5)
periodic_release = ((CurrentTime-EnteredCurrentStatus) > 10) && (HoldReason =!= "via condor_hold (by user $(USER))")
globus_resubmit = NumGlobusSubmits <= NumSystemHolds
leave_in_queue  = jobstatus == 4
Universe = Globus
Notification = Never
Copy_to_Spool = False
#Environment = <<environment variables, specified as NAME=VALUE>>
Arguments = Hello welcome to the Grid
#Transfer_Executable = True
Transfer_Executable = False
 
+stream_output = false
+stream_error = false
+Type = "job"
  
queue 
#########################################

submit it with:

~/LCG/condor/bin/condor_submit test.jdl (or wherever condor_submit ended up in your path)

check its status with:

~/LCG/condor/bin/condor_q yourusername

or

~/LCG/condor/bin/condor_q -l yourusername

If it keeps hanging, something is probably wrong. Check your certificates with:

globusrun -a -r hep.westgrid.ca  (globusrun should be in your path if you did the UI installation)

Otherwise, just wait until it finishes and look at the file hello.out.(some number) which should contain the line "Hello welcome to the Grid".

Now try something more useful.

It is probably helpful to add the following to .zshrc so that condor_submit etc end up in your path:

export CONDOR_CONFIG=/opt/condor-6.7.18/etc/condor_config
export CONDORPATH=/opt/condor
export PATH=$PATH:$CONDORPATH/bin

--Isabel 11:05, 30 March 2006 (PST)