BigJob is a flexible Pilot framework that takes the headache out of job and data management for clusters, grids, and clouds. BigJob is written in the Python programming language, and as such, is easily deployable into user space. It allows user-level control of Pilots and supports a wide range of application types. It is built on top of The Simple API for Grid Applications, a high-level, easy-to-use API for accessing distributed resources, meaning BigJob works on a variety of backends such as PBS, SGE, Amazon EC2, etc.
Pilot-Jobs support the decoupling of workload submission from resource assignment. This results in a flexible execution model, which in turn enables the distributed scale-out of applications on multiple and possibly heterogeneous resources. Pilot-Jobs support the use of container jobs with sophisticated workflow management to coordinate the launch and interaction of actual computational tasks within the container. It allows the execution of jobs without the necessity to queue each individual job.
BigJob's main focus is to be a flexible and extensible Pilot-based system for job submission and data management. Unlike many other Pilot-Job systems, BigJob natively supports MPI jobs and, because of its integration with saga-python, works on a variety of backend systems.
BigJob is used by scientists and engineers to solve real problems, such as protein-folding, traditional molecular dynamics umbrella sampling, replica exchange, and other data/compute intensive workloads.
Production-grade distributed cyberinfrastructure almost always has a local resource manager installed, such as a batch queuing system. A distributed application often requires many jobs to produce useful output data; these jobs often have the same executable. A traditional way of submitting these jobs would be to submit an individual job for each executable. These jobs (often hundreds) sit in the batch queuing system and may not become active at the same time. Load and scheduling variations may add unnecessary hours to a job's total time to completion due to inefficiencies in scheduling many individual jobs.
A Pilot-Job provides an alternative approach. It can be thought of as a container job for many sub-jobs. A Pilot-Job acquires the resources necessary to execute the sub-jobs (thus, it asks for all of the resources required to run the sub-jobs, rather than just one sub-job). If a system has a batch queue, the Pilot-Job is submitted to this queue. Once it becomes active, it can run the sub-jobs directly, instead of having to wait for each sub-job to queue. This eliminates the need to submit a different job for every executable and significantly reduces the time-to-completion.
The easiest way to install BigJob under your user account is via virtualenv and easy_install or pip.
$ virtualenv $HOME/.bigjob/python $ source $HOME/.bigjob/python/bin/activate $ easy_install bigjob
BigJob requires a working directory in which to place the output files.
We will call this directory agent
. This directory should reside
in whichever location you wish your BigJob script to work; for the purposes of
this tutorial, we will assume that your BigJob working
directory is in $HOME
.
$ cd $HOME $ mkdir agent
BigJob uses a redis server for coordination and task management. If you do not have access to a redis server, easy installation instructions are provided in the manual
More detailed installation instructions, including how to setup SSH keys, can be found in the manual.
This example runs 4 jobs on your local machine using "fork". Each of these jobs uses only one core to execute a simple /bin/echo task, showing how arguments and environment variables can be passed with your script. The example assumes you have a redis server running on your localhost. If your redis server is located remotely, simple change localhost to your remote hostname.
In order to use
BigJob for remote job submission, configure your SSH-keys to allow password-less login,
and change the service_url
parameter on line 14 to a different adaptor scheme.
For example, you can
use slurm[+ssh]://...
to access a SLURM cluster, sge[+ssh]://...
to access an SGE cluster, pbs[+ssh]://
to access a Condor(-G) gateway
or just ssh://...
to access a remote host (e.g., a cloud VM) via SSH.
A full list of available adaptors can be found in the manual.
import os import time import sys from pilot import PilotComputeService, ComputeDataService, State ### This is the number of jobs you want to run NUMBER_JOBS=4 COORDINATION_URL = "redis://localhost:6379" if __name__ == "__main__": pilot_compute_service = PilotComputeService(COORDINATION_URL) pilot_compute_description = { "service_url": "fork://localhost", "number_of_processes": 1, "working_directory": os.getenv("HOME")+"/agent", "walltime":10 } pilot_compute_service.create_pilot(pilot_compute_description) compute_data_service = ComputeDataService() compute_data_service.add_pilot_compute_service(pilot_compute_service) print ("Finished Pilot-Job setup. Submitting compute units") # submit compute units for i in range(NUMBER_JOBS): compute_unit_description = { "executable": "/bin/echo", "arguments": ["Hello","$ENV1","I am CU number "+str(i)], "environment": ['ENV1=World'], "number_of_processes": 1, "spmd_variation":"single", "output": "stdout.txt", "error": "stderr.txt", } compute_data_service.submit_compute_unit(compute_unit_description) print ("Waiting for compute units to complete") compute_data_service.wait() print ("Terminate Pilot Jobs") compute_data_service.cancel() pilot_compute_service.cancel()
Paste the above script into a file and run it. Your command-line output should look similar to this:
$ python ./bigjob.py
Finished Pilot-Job setup. Submitting compute units
Waiting for compute units to complete
Terminate Pilot Jobs
Want to learn more? Check out our application-writing guide in the manual! Or jump right in and try our tutorial!