wiki:JavaParty/OpenPBS

OpenPBS Support

JavaParty does support running applications within a batch system like OpenPBS. The support consists basically of three components: a batch file creation script, a batch file execution script and the JavaParty environment itself. The following describes briefly how you submit jobs to OpenPBS, and how the internal scripts work to allow you to adjust them to your needs. Requirements

JavaParty assumes that the following commands are available and reachable from your path:

qsub

Is part of OpenPBS and submits a job to the queue.

ssh

Secure shell. This is used internally to spawn the distributed runtime environment on your cluster when the batch system starts your job.

Make sure, you can login from each cluster node to each other without typing a password, otherwise your batch job will fail, because the distributed runtime environment can not be started. You can achieve that by using an appropriate .shosts or /etc/ssh/shosts.equiv file. Please consult man ssh for details.

Quick Tour

The following assumes that your application classes are compiled to a directory /home/user/classes, your main class is mypackage.MainClass, and you are using the tcsh shell.

  1. Log on the front-end machine of your cluster, where OpenPBS is installed.
  2. Set your CLASSPATH environment variable to your application class path.
    setenv CLASSPATH /home/user/classes
    
  3. Submit your job using the script jpsub that can be found in the bin/ directory of the JavaParty distribution.
    jpsub -np <number of processors> mypackage.MainClass <additional arguments> 
    
  4. If everything went fine, you will find files containing the standard output and standard error of your job in the directory where you submitted the job. The standard output is contained in jpsub.o<jobnr> and the standard error in jpsub.e<jobnr>.

Batch system internals

If something did not work as expected, you may want to consult this section to adjust the JavaParty batch system support to your needs. In the following the components of the batch system support are explained:

jpsub

Submits a JavaParty class to the batch system for execution. After parsing its arguments, it invokes

qsub -l nodes=&lt;number of nodes&gt; 

from OpenPBS and passes an batch script that is created on the fly. This batch script restores some environment variables and calls jpq for actually executing the application under control of the batch system. The created batch file looks like the following:

#!/bin/tcsh 
# 
# Restore environment variables 
# 
setenv CLASSPATH <value from invocation environment> 
setenv LD_LIBRARY_PATH <value from invocation environment>
#
# Invoke jpq 
# 
jpq <class name> <additional arguments> 

jpq

Executes a JavaParty class under control of the batch system. It expects the environment variables PBS_NODEFILE and PBS_JOBID to be set accordingly. PBS_NODEFILE must point to a file containing a list of hosts that should be used for the computation. This file is created by OpenPBS at the time the job is stared. PBS_JOBID is set to an arbitrary string that serves as key to identify all parts of the distributed runtime environment that belong to this program execution.

jpq starts a runtime manager, a JavaParty virtual machine and the main JavaParty class on the local node and logs into all other hosts specified in PBS_NODEFILE using ssh and starts a JavaParty virtual machine there.

After the main class terminates, the distributed environment is shut down automatically and the batch job is finished. To learn more about how to set up a JavaParty runtime environment for a single application execution using a node file, please look in the jpq script. Basically the following commands are executed:

On the master node:

javaparty \ 
  rm -host <master> -port 1099 -code <JOBID> \
     -passive -nodefile <nodefile> \
  vm -host <master> -port 1099 -code <JOBID> \
     -nodename <nodename[0]> \
  exec -host <master> -port 1099 -code <JOBID> \
       -killonexit <class> <additional arguments> 

On each slave node:

javaparty \
  vm -host <master> -port 1099 -code <JOBID> -nodename <node[n]> 
Last modified 12 years ago Last modified on Mar 23, 2006 12:41:23 AM