5. Library Reference

5.1. Compute and Data Services

This section is meant to provide a hierarchical overview of the various library components and their interaction. The subsections then provide the API details associated with each component.

The main concepts and classes exposed by the Compute part of the API are:

  • PilotCompute (PC): a pilot job, which can execute some compute workload (ComputeUnit).
  • PilotComputeDescription (PCD): description for specifying the requirements of a PilotCompute.
  • PilotComputeService (PCS): a factory for creating PilotComputes.

The data side of the Pilot API is symmetric to the compute side. The exposed classes for managing Pilot Data are:

  • PilotData (PD): a pilot that manages some data workload (DataUnit)
  • PilotDataDescription (PDD): a abstract description of the requirements of the PD
  • PilotDataService (PDS): a factory (service) which can create PilotDatas according to some specification

The application workload is represented by so called ComputeUnits and DataUnits:

  • ComputeUnit (CU): a work item executed on a PilotCompute.
  • DataUnit (DU): a data item managed by a PilotData

Both Compute and Data Units are specified using an abstract description object:

  • ComputeUnitDescription (CUD): abstract description of a ComputeUnit.
  • DataUnitDescription (DUD): abstract description of a DataUnit.

The ComputeDataService represents the central entry point for the application workload:

  • ComputeDataService (CDS): a service which can map CUs and DUs to a set of Pilot Compute and Pilot Data. The ComputeDataService (CDS) takes care of the placement of Compute and Data Units. The set of PilotComputes and PilotData available to the CDS can be changed during the application’s runtime. The CDS different data-compute affinity and will handle compute/data co-locating for the requested data-compute workload.

5.1.1. PilotComputeService

The PilotComputeService (PCS) is a factory for creating Pilot-Compute objects, where the latter is the individual handle to the resource. The PCS takes the COORDINATION_URL (as defined above) as an argument. This is for coordination of the compute units with the redis database.

class pilot.impl.pilotcompute_manager.PilotComputeService(coordination_url='redis://localhost', pcs_url=None)

B{PilotComputeService (PCS).}

Factory for L{PilotCompute}s..

cancel()

Cancel the PilotComputeService.

This also cancels all the PilotJobs that were under control of this PJS.

Keyword arguments: None

Return value: Result of operation

create_pilot(pilot_compute_description)

Add a PilotJob to the PilotJobService

Keyword arguments: pilot_compute_description – PilotJob Description

Return value: A PilotCompute object

list_pilots()

List managed L{PilotCompute}s.

Return value: A list of L{PilotCompute} urls

5.1.2. PilotComputeDescription

The PCD defines the compute resource on which the Pilot agent will be started . Recall that a Pilot-Job requests resources required to run all of the Compute Units (subjobs). There can be any number of Pilot-Computes instantiated depending on the compute resources available to the application (using two machines rather than 1 requires 2 PilotComputeDescriptions).

An example of a Pilot Compute Description is shown below:

pilot_compute_description = {
               "service_url": 'pbs+ssh://india.futuregrid.org',
               "number_of_processes": 8,
               "processes_per_node":8,
               "working_directory": "/N/u/<username>",
               'affinity_datacenter_label': "us-east-indiana",                                 'affinity_machine_label': "india"
              }
class PilotComputeDescription
service_url

Specifies the SAGA-Python job adaptor (often this is based on the batch queuing system) and resource hostname (for instance, pbs+ssh://lonestar.tacc.utexas.edu) on which jobs can be executed.

type:string
number_of_processes

The number of cores that need to be allocated to run the jobs.

type:string
processes_per_node

Optional. The number of cores per node to be requested from the resource management system.

type:

string

Note

This argument does not limit the number of processes that can run on a node! This field is required by some XSEDE/Torque clusters. If you have to specify a ppn parameter (e.g.`-lnodes=1:ppn=8`) in your qsub script, you must need this field in your BigJob script.

working_directory

The directory in which the Pilot-Job agent executes

Type :string
project

Optional. The project allocation, if running on an XSEDE resource.

Type :string

Note

This field must be removed if you are running somewhere that does not require an allocation.

queue

Optional. The job queue to be used.

Type :string

Note

If you are not submitting to a batch queuing system, remove this parameter.

type:

string

Note

For remote hosts, password-less login must be enabled.

wall_time_limit

Optional. The number of minutes the resources are requested for. Required for some resources (e.g. on TACC machines).

type:string
affinity_datacenter_label

Optional. The data center label used for affinity topology.

type:

string

Note

Data centers and machines are organized in a logical topology tree (similar to the tree spawned by an DNS topology). The further the distance between two resources, the smaller their affinity.

affinity_machine_label

Optional. The machine (resource) label used for affinity topology.

type:

string

Note

Data centers and machines are organized in a logical topology tree (similar to the tree spawned by an DNS topology). The further the distance between two resources, the smaller their affinity.

5.1.3. PilotCompute

A pilot job, which can execute some compute workload (ComputeUnit).

This is the object that is returned by the PilotComputeService when a new PilotCompute is created based on a PilotComputeDescription.

The PilotCompute object can be used by the application to keep track of active pilots.

A PilotCompute has state, can be queried, and cancelled.

class pilot.impl.pilotcompute_manager.PilotCompute(pilot_compute_service=None, bigjob_object=None, pilot_compute_description=None, pilot_url=None)

B{Pilot Compute (PC).}

This is the object that is returned by the PilotComputeService when a new PilotCompute is created based on a PilotComputeDescription.

The PilotCompute object can be used by the application to keep track of active pilots.

A ComputePilot has state, can be queried and cancelled.

Properties:

  • state: The state of the pilot.
  • id: The id may be ‘None’ if the Pilot is not yet in Running state. The returned ID can be used to connect to the CP instance later on, for example from within a different application instance. type: string (url)
  • pilot_compute_description: The PilotComputeDescription used to create this pilot.
cancel()

Terminates the pilot

get_details()

returns a dict that contains the details of the Pilot Compute, - job state - description - ...

get_free_nodes()

Returns the number of free slots available within the pilot

get_state()

Returns the state of the pilot

get_url()

Get unique URL referencing the Pilot Compute This URL can be used to reconnect to the Pilot Compute

list_compute_units()

list managed L{ComputeUnit}s.

Return value: A list of L{ComputeUnit} IDs

The returned list can include units which have been submitted to this pilot.

submit_compute_unit(compute_unit_description)

Submit a CU to this pilot.

@param compute_unit_description: The L{ComputeUnitDescription} or dictionary describing
the compute task

@return: L{ComputeUnit} object

The CUD is (possibly translated and) passed on to the PDS scheduler, which will attempt to instantiate the described workload process on the managed set of Pilot Computes.

On success, the returned CU is in Pending state (or moved into any state downstream from Pending).

The call will will honor all attributes set on the CUD. Attributes which are not explicitly set are interpreted as having default values (see documentation of CUD), or, where default values are not specified, are ignored.

wait()

Wait until Pilot Compute to enter a final state (Done, Cancel or Failed)

It is not an error to call wait() in a final state – the call simply returns immediately.

5.1.4. PilotDataService

The PilotDataService (PDS) is a factory for creating Pilot-Data objects. The PDS takes the COORDINATION_URL as an argument. This is for coordination of the data units with the redis database.

class pilot.impl.pilotdata_manager.PilotDataService(coordination_url='redis://localhost', pds_url=None)

B{PilotDataService (PDS).}

Factory for creating Pilot Data.

cancel()

Cancel the PilotDataService. Release all Pilot Data created by this service.

Keyword arguments: None

Return value: Result of operation

create_pilot(pilot_data_description)

Create a PilotData

Keyword arguments: pilot_data_description – PilotData Description:

{
    'service_url': "ssh://<hostname>/base-url/",               
    'size': "1000"
}

Return value: A PilotData object

get_pilot(pd_id)

Reconnect to an existing pilot.

get_url()

Returns URL of Pilot Data Service

list_pilots()

List all PDs of PDS

to_dict()

Return a Python dictionary containing the representation of the PDS (internal method not part of Pilot API)

wait()

Wait until all managed PD (of this Pilot Data Service) enter a final state

5.1.5. PilotDataDescription

PilotDataDescription objects are used to describe the requirements for a PilotData instance. Currently, the only generic property that can be set is size, all other properties are backend-specific security / authentication hints. Example:

    pilot_data_service = PilotDataService(COORDINATION_URL)
pilot_data_description =    {
                               'service_url': "ssh://localhost/tmp/pilotdata/",
                           }
pilot_data = service.create_pilot(pilot_data_description)
service_url

Specifies the file adaptor and resource hostname on which a Pilot-Data will be created. Supported schemes:

  • SSH: ssh://localhost/tmp/pilotdata/ (Password-less login and password-less private key required)
  • iRODS: irods://gw68/${OSG_DATA}/osg/irods/<username>/?vo=osg&resource-group=osgGridFtpGroup
  • Globus Online: go://<user>:<password>@globusonline.org?ep=xsede#lonestar4&path=/work/01131/tg804093/pilot-data-go
  • Google Storage: gs://google.com
  • Amazon S3: s3://aws.amazon.com
  • Eucalyptus Walrus: walrus://<endpoint-ip>
Type :string
size

Optional. The storage space required (in Megabyte) on the storage resource.

Type :int

Note

The ‘size’ attribute is not supported by all PilotData backends.

userkey

For SSH backend. The SSH private key (for SSH backend). Attention: This key is put into the Redis service in order to make it available at the Pilot agent. Use with caution and not with your production keys. Do not use with shared Redis server! The SSH key delegation mechanism is designed for resources where the worker nodes are not directly accessible to install the private key manually (e.g. OSG).

Type :string

Note

‘userkey’ is only supported by backends where worker nodes need private key access. An example of this is OSG.

access_key_id

For S3/Walrus backend. The ‘username’ for Amazon AWS compliant instances. It is an alphanumeric text string that uniquely identifies a user who owns an account. No two accounts can have the same access key.

Type :string

Note

‘access_key_id’ is only supported by AWS complaint EC2 based connections. This applies to Amazon AWS, Eucalpytus, and OpenStack. Please see Amazon’s documentation to learn how to obtain your access key id and password.

secret_access_key

For S3/Walrus backend. The ‘password’ for Amazon AWS compliant instances. It’s called secret because it is assumed to be known to the owner only.

Type :string

Note

‘secret_access_key’ is only supported by AWS complaint EC2 based connections. This applies to Amazon AWS, Eucalpytus, and OpenStack. Please see Amazon’s documentation to learn how to obtain your access key id and password.

5.1.6. PilotData

A Pilot-Data, which can store some data (DataUnit).

This is the object that is returned by the PilotDataService when a new PilotData is created based on a PilotDataDescription.

The PilotData object can be used by the application to keep track of active pilots.

class pilot.impl.pilotdata_manager.PilotData(pilot_data_service=None, pilot_data_description=None, pd_url=None)

B{PilotData (PD).}

This is the object that is returned by the PilotDataService when a new PilotData is created based on a PilotDataDescription. A PilotData represents a finite amount of physical space on a certain resource. It can be populated with L{DataUnit}s.

The PilotData object can be used by the application to keep track of a pilot. A PilotData has state, can be queried, can be cancelled.

cancel()

Cancel PilotData

copy_du(du, pd_new)

Copy DataUnit to another Pilot Data

create_du(du)

Create a new Data Unit within Pilot

classmethod create_pilot_data_from_dict(pd_dict)

Restore Pilot Data from dictionary

export_du(du, target_url)

Export Data Unit to a local directory

get_du(du_url)

Returns Data Unit if part of Pilot Data

get_state()

Return current state of Pilot Data

get_url()

Get URL of PilotData. Used for reconnecting to PilotData

list_data_units()

List all data units of Pilot Data

put_du(du)

Copy Data Unit to Pilot Data

remove_du(du)

Remove Data Unit from Pilot Data

submit_data_unit(data_unit_description=None, data_unit=None)

creates a data unit object and initially imports data specified in data_unit_description

to_dict()

Internal method that returns a dict with all data contained in this Pilot Data

url_for_du(du)

Get full URL to DataUnit within PilotData

wait()

Wait until PD enters a final state (Done, Canceled or Failed).

5.1.7. ComputeDataService

The Compute Data Service is created to handle both Pilot Compute and Pilot Data entities in a holistic way. It represents the central entry point for the application workload. The CDS takes care of the placement of Compute and Data Units. The set of Pilot Computes and Pilot Data available to the CDS can be changed during the application’s runtime. The CDS handles different data-compute affinity and will handle compute/data co-locating for the requested data-compute workload.

class pilot.impl.pilot_manager.ComputeDataService(cds_url=None)

B{ComputeDataService (CDS).}

The ComputeDataService is the application’s interface to submit ComputeUnits and PilotData/DataUnit to the Pilot-Manager in the P* Model.

add_pilot_compute_service(pcs)

Add a PilotComputeService to this CDS.

@param pcs: The PilotComputeService to which this ComputeDataService will connect.

add_pilot_data_service(pds)

Add a PilotDataService

@param pds: The PilotDataService to add.

cancel()

Cancel the CDS. All associated PD and PC objects are canceled.

get_id()

@return: id of ComputeDataService

get_state()

@return: State of the ComputeDataService

list_data_units()

List all DUs of CDS

list_pilot_compute()

List all pilot compute of CDS

list_pilot_data()

List all pilot data of CDS

remove_pilot_compute_service(pcs)

Remove a PilotJobService from this CDS.

Note that it won’t cancel the PilotComputeService, it will just no longer be connected to this CDS.

Keyword arguments: @param pcs: The PilotComputeService to remove from this ComputeDataService.

remove_pilot_data_service(pds)

Remove a PilotDataService @param pds: The PilotDataService to remove

submit_compute_unit(compute_unit_description)

Submit a CU to this Compute Data Service.

@param compute_unit_description: The ComputeUnitDescription from the application @return: ComputeUnit object

submit_data_unit(data_unit_description)

creates a data unit object and binds it to a physical resource (a pilotdata)

wait()

Waits for CUs and DUs. Return after all DU’s have been placed (i.e. in state Running) and all CU’s have been completed (i.e. in state Done) or if a fault has occurred or the user has cancelled a CU or DU.

5.2. Compute and Data Units

5.2.1. ComputeUnitDescription

The ComputeUnitDescription defines the actual compute unit will be run. The executable specified here is what constitutes the individual jobs that will run within the Pilot. This executable can have input arguments or environment variables that must be passed with it in order for it to execute properly.

Example:

compute_unit_description = {
        "executable": "/bin/cat",
        "arguments": ["test.txt"],
        "number_of_processes": 1,
        "output": "stdout.txt",
        "error": "stderr.txt",
                        "environment": ["MY_SCRATCH_DIR=/tmp"],
        "input_data" : [data_unit.get_url()], # this stages the content of the data unit to the working directory of the compute unit
        "affinity_datacenter_label": "eu-de-south",
        "affinity_machine_label": "mymachine-1"
    }
class ComputeUnitDescription
executable

Specifies the path to the executable that will be run

type:string
arguments

Specifies any arguments that the executable needs. For instance, if running an executable from the command line requires a -p flag, then this -p flag can be added in this section.

type:string
environment

Specifies any environment variables that need to be passed with the compute unit in order for the executable to work, e.g [“MY_SCRATCH_DIR=/tmp”],

type:string
working_directory

The working directory for the executable

type:

string

Note

Recommendation: Do not set the working directory! If none, working directory is a sandbox directory of the CU (automatically created by BigJob)

input

Specifies the capture of <stdin>

type:string
output

Specifies the name of the file who captures the output from <stdout>. Default is stdout.txt

type:string
error

Specifies the name of the file who captures the output from <stderr>. Default is stderr.txt

type:string
number_of_processes

Defines how many CPU cores are reserved for the application process.

For instance, if you need 4 cores for 1 MPI executable, this value would be 4.

type:string
spmd_variation

Defines how the application process is launched. Valid strings for this field are ‘single’ or ‘mpi’. If your executable is a.out, “single” executes it as ./a.out, while “mpi” executes mpirun -np <number_of_processes> ./a.out (note: aprun is used for Kraken, and srun/ibrun is used for Stampede).

type:string
input_data

Specifies the input data flow for a ComputeUnit. This is used in conjunction with PilotData. The format is [<data unit url>, ]

type:string
output_data

Specifies the output data flow for a ComputeUnit. This is used in conjunction with PilotData. The format is [<data unit url>, ]

type:string
affinity_datacenter_label

The data center label used for affinity topology.

type:

string

Note

Data centers and machines are organized in a logical topology tree (similar to the tree spawned by an DNS topology). The further the distance between two resources, the smaller their affinity.

affinity_machine_label

The machine (resource) label used for affinity topology.

type:

string

Note

Data centers and machines are organized in a logical topology tree (similar to the tree spawned by an DNS topology). The further the distance between two resources, the smaller their affinity.

ComputeUnitDescription objects are loosely typed. A dictionary containing the respective keys can be passed instead to the ComputeDataService.

5.2.2. ComputeUnit

A ComputeUnit is a work item executed by a PilotCompute. These are what constitute the individual jobs that will run within the Pilot. Oftentimes, this will be an executable, which can have input arguments or environment variables.

A ComputeUnit is the object that is returned by the ComputeDataService when a new ComputeUnit is submitted based on a ComputeUnitDescription. The ComputeUnit object can be used by the application to keep track of ComputeUnits that are active.

A ComputeUnit has state, can be queried, and can be cancelled.

class pilot.impl.pilotcompute_manager.ComputeUnit(compute_unit_description=None, compute_data_service=None, cu_url=None)

B{ComputeUnit (CU).}

This is the object that is returned by the ComputeDataService when a new ComputeUnit is submitted based on a ComputeUnitDescription.

The ComputeUnit object can be used by the application to keep track of ComputeUnits that are active.

A ComputeUnit has state, can be queried and can be cancelled.

cancel()

Terminates Compute Unit

get_details()

Returns dict with Compute Unit Details (e.g. job description, timings)

get_id()

Returns unique identifier of Compute Unit. Deprecated: Please use get_url() instead.

get_local_working_directory()

Returns the local working directory of this PilotCompute object.

get_state()

Returns current state of Compute Unit

get_url()

Returns URL of Compute Unit. This URL can be used to reconnect to this Compute Unit later on.

wait()

Wait until in Done state (or Failed state)

5.2.3. DataUnitDescription

The data unit description defines the different files to be moved around. There is currently no support for directories.

data_unit_description = {
                                'file_urls': [file1, file2, file3]
                        }
class DataUnitDescription
file_urls
type:string
affinity_datacenter_label

The data center label used for affinity topology.

type:

string

Note

Data centers and machines are organized in a logical topology tree (similar to the tree spawned by an DNS topology). The further the distance between two resources, the smaller their affinity.

affinity_machine_label

The machine (resource) label used for affinity topology.

type:

string

Note

Data centers and machines are organized in a logical topology tree (similar to the tree spawned by an DNS topology). The further the distance between two resources, the smaller their affinity.

5.2.4. DataUnit

A DataUnit is a container for a logical group of data that is often accessed together or comprises a larger set of data; e.g. a data file or chunk.

A DataUnit is the object that is returned by the ComputeDataService when a new DataUnit is submitted based on a DataUnitDescription. The DataUnit object can be used by the application to keep track of DataUnits that are active.

A DataUnit has state, can be queried, and can be cancelled.

class pilot.impl.pilotdata_manager.DataUnit(pilot_data=None, data_unit_description=None, du_url=None)

B{DataUnit (DU).}

This is the object that is returned by the ComputeDataService when a new DataUnit is created based on a DataUnitDescription.

The DataUnit object can be used by the application to keep track of a DataUnit.

A DataUnit has state, can be queried and can be cancelled.

State model:
  • New: PD object created
  • Pending: PD object is currently updated
  • Running: At least 1 replica of PD is persistent in a pilot data
add_files(file_url_list=[])

Add files referenced in list to Data Unit

add_pilot_data(pilot_data)

add this DU (self) to a certain pilot data data will be moved into this data

cancel()

Cancel the Data Unit.

export(target_url)

simple implementation of export: copies file from first pilot data to local machine

get_pilot_data()

get a list of pilot data that have a copy of this PD

get_state()

Return current state of DataUnit

get_url()

Return URL that can be used to reconnect to Data Unit

list()

List all items contained in DU {

“filename” : {
“pilot_data” : [url1, url2], “local” : url

}

}

remove_files(file_urls)

Remove files from Data Unit (NOT implemented yet

to_dict()

Internal method that returns a dict with all data contained in this DataUnit

wait()

Wait until in running state (or failed state)

5.3. State Enumeration

Pilots and Compute Units can have state. These states can be queried using the get_state() function. States are used for PilotCompute, PilotData, ComputeUnit, DataUnit and ComputeDataService. The following table describes the values that states can have.

class State
State
Unknown='Unknown'
New='New'
Running=`Running'
Done=`Done'
Canceled=`Canceled'
Failed=`Failed'
Pending=`Pending'