BigJob

A Pilot-based Job and Data Management System

Get Started



About

BigJob is a flexible Pilot framework that takes the headache out of job and data management for clusters, grids, and clouds. BigJob is written in the Python programming language, and as such, is easily deployable into user space. It allows user-level control of Pilots and supports a wide range of application types. It is built on top of The Simple API for Grid Applications, a high-level, easy-to-use API for accessing distributed resources, meaning BigJob works on a variety of backends such as PBS, SGE, Amazon EC2, etc.


What are Pilot-Jobs?

Pilot-Jobs support the decoupling of workload submission from resource assignment. This results in a flexible execution model, which in turn enables the distributed scale-out of applications on multiple and possibly heterogeneous resources. Pilot-Jobs support the use of container jobs with sophisticated workflow management to coordinate the launch and interaction of actual computational tasks within the container. It allows the execution of jobs without the necessity to queue each individual job.

BigJob's main focus is to be a flexible and extensible Pilot-based system for job submission and data management. Unlike many other Pilot-Job systems, BigJob natively supports MPI jobs and, because of its integration with saga-python, works on a variety of backend systems.

What is BigJob used for?

BigJob is used by scientists and engineers to solve real problems, such as protein-folding, traditional molecular dynamics umbrella sampling, replica exchange, and other data/compute intensive workloads.

  • Parameter Sweeps
  • Many instances of the same task (ensemble applications)
  • Chained tasks
  • Loosely-coupled, distinct tasks
  • Tasks with Data and/or Compute Dependencies


Why Pilot-Jobs?

Production-grade distributed cyberinfrastructure almost always has a local resource manager installed, such as a batch queuing system. A distributed application often requires many jobs to produce useful output data; these jobs often have the same executable. A traditional way of submitting these jobs would be to submit an individual job for each executable. These jobs (often hundreds) sit in the batch queuing system and may not become active at the same time. Load and scheduling variations may add unnecessary hours to a job's total time to completion due to inefficiencies in scheduling many individual jobs.

A Pilot-Job provides an alternative approach. It can be thought of as a container job for many sub-jobs. A Pilot-Job acquires the resources necessary to execute the sub-jobs (thus, it asks for all of the resources required to run the sub-jobs, rather than just one sub-job). If a system has a batch queue, the Pilot-Job is submitted to this queue. Once it becomes active, it can run the sub-jobs directly, instead of having to wait for each sub-job to queue. This eliminates the need to submit a different job for every executable and significantly reduces the time-to-completion.


Get Started

Installation

The easiest way to install BigJob under your user account is via virtualenv and easy_install or pip.

$ virtualenv $HOME/.bigjob/python
$ source $HOME/.bigjob/python/bin/activate
$ easy_install bigjob

BigJob requires a working directory in which to place the output files. We will call this directory agent. This directory should reside in whichever location you wish your BigJob script to work; for the purposes of this tutorial, we will assume that your BigJob working directory is in $HOME.

$ cd $HOME
$ mkdir agent
		

BigJob uses a redis server for coordination and task management. If you do not have access to a redis server, easy installation instructions are provided in the manual

More detailed installation instructions, including how to setup SSH keys, can be found in the manual.

Example Script

This example runs 4 jobs on your local machine using "fork". Each of these jobs uses only one core to execute a simple /bin/echo task, showing how arguments and environment variables can be passed with your script. The example assumes you have a redis server running on your localhost. If your redis server is located remotely, simple change localhost to your remote hostname.

In order to use BigJob for remote job submission, configure your SSH-keys to allow password-less login, and change the service_url parameter on line 14 to a different adaptor scheme. For example, you can use slurm[+ssh]://... to access a SLURM cluster, sge[+ssh]://... to access an SGE cluster, pbs[+ssh]:// to access a Condor(-G) gateway or just ssh://... to access a remote host (e.g., a cloud VM) via SSH.

More information about the BigJob job API can be found in the respective manual section.

A full list of available adaptors can be found in the manual.

import os
import time
import sys
from pilot import PilotComputeService, ComputeDataService, State

### This is the number of jobs you want to run
NUMBER_JOBS=4
COORDINATION_URL = "redis://localhost:6379"

if __name__ == "__main__":

    pilot_compute_service = PilotComputeService(COORDINATION_URL)

    pilot_compute_description = { "service_url": "fork://localhost",
                                  "number_of_processes": 1,
                                  "working_directory": os.getenv("HOME")+"/agent",
                                  "walltime":10
                                }

    pilot_compute_service.create_pilot(pilot_compute_description)

    compute_data_service = ComputeDataService()
    compute_data_service.add_pilot_compute_service(pilot_compute_service)

    print ("Finished Pilot-Job setup. Submitting compute units")

    # submit compute units
    for i in range(NUMBER_JOBS):
        compute_unit_description = {
                "executable": "/bin/echo",
                "arguments": ["Hello","$ENV1","I am CU number "+str(i)],
                "environment": ['ENV1=World'],
                "number_of_processes": 1,
                "spmd_variation":"single",
                "output": "stdout.txt",
                "error": "stderr.txt",
                }
        compute_data_service.submit_compute_unit(compute_unit_description)

    print ("Waiting for compute units to complete")
    compute_data_service.wait()

    print ("Terminate Pilot Jobs")
    compute_data_service.cancel()
    pilot_compute_service.cancel()
                

Paste the above script into a file and run it. Your command-line output should look similar to this:

$ python ./bigjob.py
                    Finished Pilot-Job setup. Submitting compute units
                    Waiting for compute units to complete
                    Terminate Pilot Jobs

Want to learn more? Check out our application-writing guide in the manual! Or jump right in and try our tutorial!