This section is meant to provide a hierarchical overview of the various library components and their interaction. The subsections then provide the API details associated with each component.
The main concepts and classes exposed by the Compute part of the API are:
The data side of the Pilot API is symmetric to the compute side. The exposed classes for managing Pilot Data are:
The application workload is represented by so called ComputeUnits and DataUnits:
Both Compute and Data Units are specified using an abstract description object:
The ComputeDataService represents the central entry point for the application workload:
The PilotComputeService (PCS) is a factory for creating Pilot-Compute objects, where the latter is the individual handle to the resource. The PCS takes the COORDINATION_URL (as defined above) as an argument. This is for coordination of the compute units with the redis database.
B{PilotComputeService (PCS).}
Factory for L{PilotCompute}s..
Cancel the PilotComputeService.
This also cancels all the PilotJobs that were under control of this PJS.
Keyword arguments: None
Return value: Result of operation
Add a PilotJob to the PilotJobService
Keyword arguments: pilot_compute_description – PilotJob Description
Return value: A PilotCompute object
List managed L{PilotCompute}s.
Return value: A list of L{PilotCompute} urls
The PCD defines the compute resource on which the Pilot agent will be started . Recall that a Pilot-Job requests resources required to run all of the Compute Units (subjobs). There can be any number of Pilot-Computes instantiated depending on the compute resources available to the application (using two machines rather than 1 requires 2 PilotComputeDescriptions).
An example of a Pilot Compute Description is shown below:
pilot_compute_description = {
"service_url": 'pbs+ssh://india.futuregrid.org',
"number_of_processes": 8,
"processes_per_node":8,
"working_directory": "/N/u/<username>",
'affinity_datacenter_label': "us-east-indiana", 'affinity_machine_label': "india"
}
Specifies the SAGA-Python job adaptor (often this is based on the batch queuing system) and resource hostname (for instance, pbs+ssh://lonestar.tacc.utexas.edu) on which jobs can be executed.
type: | string |
---|
The number of cores that need to be allocated to run the jobs.
type: | string |
---|
Optional. The number of cores per node to be requested from the resource management system.
type: | string Note This argument does not limit the number of processes that can run on a node! This field is required by some XSEDE/Torque clusters. If you have to specify a ppn parameter (e.g.`-lnodes=1:ppn=8`) in your qsub script, you must need this field in your BigJob script. |
---|
The directory in which the Pilot-Job agent executes
Type : | string |
---|
Optional. The project allocation, if running on an XSEDE resource.
Type : | string |
---|
Note
This field must be removed if you are running somewhere that does not require an allocation.
Optional. The job queue to be used.
Type : | string |
---|
Note
If you are not submitting to a batch queuing system, remove this parameter.
type: | string Note For remote hosts, password-less login must be enabled. |
---|
Optional. The number of minutes the resources are requested for. Required for some resources (e.g. on TACC machines).
type: | string |
---|
Optional. The data center label used for affinity topology.
type: | string Note Data centers and machines are organized in a logical topology tree (similar to the tree spawned by an DNS topology). The further the distance between two resources, the smaller their affinity. |
---|
Optional. The machine (resource) label used for affinity topology.
type: | string Note Data centers and machines are organized in a logical topology tree (similar to the tree spawned by an DNS topology). The further the distance between two resources, the smaller their affinity. |
---|
A pilot job, which can execute some compute workload (ComputeUnit).
This is the object that is returned by the PilotComputeService when a new PilotCompute is created based on a PilotComputeDescription.
The PilotCompute object can be used by the application to keep track of active pilots.
A PilotCompute has state, can be queried, and cancelled.
B{Pilot Compute (PC).}
This is the object that is returned by the PilotComputeService when a new PilotCompute is created based on a PilotComputeDescription.
The PilotCompute object can be used by the application to keep track of active pilots.
A ComputePilot has state, can be queried and cancelled.
Properties:
- state: The state of the pilot.
- id: The id may be ‘None’ if the Pilot is not yet in Running state. The returned ID can be used to connect to the CP instance later on, for example from within a different application instance. type: string (url)
- pilot_compute_description: The PilotComputeDescription used to create this pilot.
Terminates the pilot
returns a dict that contains the details of the Pilot Compute, - job state - description - ...
Returns the number of free slots available within the pilot
Returns the state of the pilot
Get unique URL referencing the Pilot Compute This URL can be used to reconnect to the Pilot Compute
list managed L{ComputeUnit}s.
Return value: A list of L{ComputeUnit} IDs
The returned list can include units which have been submitted to this pilot.
Submit a CU to this pilot.
@return: L{ComputeUnit} object
The CUD is (possibly translated and) passed on to the PDS scheduler, which will attempt to instantiate the described workload process on the managed set of Pilot Computes.
On success, the returned CU is in Pending state (or moved into any state downstream from Pending).
The call will will honor all attributes set on the CUD. Attributes which are not explicitly set are interpreted as having default values (see documentation of CUD), or, where default values are not specified, are ignored.
Wait until Pilot Compute to enter a final state (Done, Cancel or Failed)
It is not an error to call wait() in a final state – the call simply returns immediately.
The PilotDataService (PDS) is a factory for creating Pilot-Data objects. The PDS takes the COORDINATION_URL as an argument. This is for coordination of the data units with the redis database.
B{PilotDataService (PDS).}
Factory for creating Pilot Data.
Cancel the PilotDataService. Release all Pilot Data created by this service.
Keyword arguments: None
Return value: Result of operation
Create a PilotData
Keyword arguments: pilot_data_description – PilotData Description:
{
'service_url': "ssh://<hostname>/base-url/",
'size': "1000"
}
Return value: A PilotData object
Reconnect to an existing pilot.
Returns URL of Pilot Data Service
List all PDs of PDS
Return a Python dictionary containing the representation of the PDS (internal method not part of Pilot API)
Wait until all managed PD (of this Pilot Data Service) enter a final state
PilotDataDescription objects are used to describe the requirements for a PilotData instance. Currently, the only generic property that can be set is size, all other properties are backend-specific security / authentication hints. Example:
pilot_data_service = PilotDataService(COORDINATION_URL)
pilot_data_description = {
'service_url': "ssh://localhost/tmp/pilotdata/",
}
pilot_data = service.create_pilot(pilot_data_description)
Specifies the file adaptor and resource hostname on which a Pilot-Data will be created. Supported schemes:
- SSH: ssh://localhost/tmp/pilotdata/ (Password-less login and password-less private key required)
- iRODS: irods://gw68/${OSG_DATA}/osg/irods/<username>/?vo=osg&resource-group=osgGridFtpGroup
- Globus Online: go://<user>:<password>@globusonline.org?ep=xsede#lonestar4&path=/work/01131/tg804093/pilot-data-go
- Google Storage: gs://google.com
- Amazon S3: s3://aws.amazon.com
- Eucalyptus Walrus: walrus://<endpoint-ip>
Type : | string |
---|
Optional. The storage space required (in Megabyte) on the storage resource.
Type : | int |
---|
Note
The ‘size’ attribute is not supported by all PilotData backends.
For SSH backend. The SSH private key (for SSH backend). Attention: This key is put into the Redis service in order to make it available at the Pilot agent. Use with caution and not with your production keys. Do not use with shared Redis server! The SSH key delegation mechanism is designed for resources where the worker nodes are not directly accessible to install the private key manually (e.g. OSG).
Type : | string |
---|
Note
‘userkey’ is only supported by backends where worker nodes need private key access. An example of this is OSG.
For S3/Walrus backend. The ‘username’ for Amazon AWS compliant instances. It is an alphanumeric text string that uniquely identifies a user who owns an account. No two accounts can have the same access key.
Type : | string |
---|
Note
‘access_key_id’ is only supported by AWS complaint EC2 based connections. This applies to Amazon AWS, Eucalpytus, and OpenStack. Please see Amazon’s documentation to learn how to obtain your access key id and password.
For S3/Walrus backend. The ‘password’ for Amazon AWS compliant instances. It’s called secret because it is assumed to be known to the owner only.
Type : | string |
---|
Note
‘secret_access_key’ is only supported by AWS complaint EC2 based connections. This applies to Amazon AWS, Eucalpytus, and OpenStack. Please see Amazon’s documentation to learn how to obtain your access key id and password.
A Pilot-Data, which can store some data (DataUnit).
This is the object that is returned by the PilotDataService when a new PilotData is created based on a PilotDataDescription.
The PilotData object can be used by the application to keep track of active pilots.
B{PilotData (PD).}
This is the object that is returned by the PilotDataService when a new PilotData is created based on a PilotDataDescription. A PilotData represents a finite amount of physical space on a certain resource. It can be populated with L{DataUnit}s.
The PilotData object can be used by the application to keep track of a pilot. A PilotData has state, can be queried, can be cancelled.
Cancel PilotData
Copy DataUnit to another Pilot Data
Create a new Data Unit within Pilot
Restore Pilot Data from dictionary
Export Data Unit to a local directory
Returns Data Unit if part of Pilot Data
Return current state of Pilot Data
Get URL of PilotData. Used for reconnecting to PilotData
List all data units of Pilot Data
Copy Data Unit to Pilot Data
Remove Data Unit from Pilot Data
creates a data unit object and initially imports data specified in data_unit_description
Internal method that returns a dict with all data contained in this Pilot Data
Get full URL to DataUnit within PilotData
Wait until PD enters a final state (Done, Canceled or Failed).
The Compute Data Service is created to handle both Pilot Compute and Pilot Data entities in a holistic way. It represents the central entry point for the application workload. The CDS takes care of the placement of Compute and Data Units. The set of Pilot Computes and Pilot Data available to the CDS can be changed during the application’s runtime. The CDS handles different data-compute affinity and will handle compute/data co-locating for the requested data-compute workload.
B{ComputeDataService (CDS).}
The ComputeDataService is the application’s interface to submit ComputeUnits and PilotData/DataUnit to the Pilot-Manager in the P* Model.
Add a PilotComputeService to this CDS.
@param pcs: The PilotComputeService to which this ComputeDataService will connect.
Add a PilotDataService
@param pds: The PilotDataService to add.
Cancel the CDS. All associated PD and PC objects are canceled.
@return: id of ComputeDataService
@return: State of the ComputeDataService
List all DUs of CDS
List all pilot compute of CDS
List all pilot data of CDS
Remove a PilotJobService from this CDS.
Note that it won’t cancel the PilotComputeService, it will just no longer be connected to this CDS.
Keyword arguments: @param pcs: The PilotComputeService to remove from this ComputeDataService.
Remove a PilotDataService @param pds: The PilotDataService to remove
Submit a CU to this Compute Data Service.
@param compute_unit_description: The ComputeUnitDescription from the application @return: ComputeUnit object
creates a data unit object and binds it to a physical resource (a pilotdata)
Waits for CUs and DUs. Return after all DU’s have been placed (i.e. in state Running) and all CU’s have been completed (i.e. in state Done) or if a fault has occurred or the user has cancelled a CU or DU.
The ComputeUnitDescription defines the actual compute unit will be run. The executable specified here is what constitutes the individual jobs that will run within the Pilot. This executable can have input arguments or environment variables that must be passed with it in order for it to execute properly.
Example:
compute_unit_description = {
"executable": "/bin/cat",
"arguments": ["test.txt"],
"number_of_processes": 1,
"output": "stdout.txt",
"error": "stderr.txt",
"environment": ["MY_SCRATCH_DIR=/tmp"],
"input_data" : [data_unit.get_url()], # this stages the content of the data unit to the working directory of the compute unit
"affinity_datacenter_label": "eu-de-south",
"affinity_machine_label": "mymachine-1"
}
Specifies the path to the executable that will be run
type: string
Specifies any arguments that the executable needs. For instance, if running an executable from the command line requires a -p flag, then this -p flag can be added in this section.
type: string
Specifies any environment variables that need to be passed with the compute unit in order for the executable to work, e.g [“MY_SCRATCH_DIR=/tmp”],
type: string
The working directory for the executable
type: string
Note
Recommendation: Do not set the working directory! If none, working directory is a sandbox directory of the CU (automatically created by BigJob)
Specifies the capture of <stdin>
type: string
Specifies the name of the file who captures the output from <stdout>. Default is stdout.txt
type: string
Specifies the name of the file who captures the output from <stderr>. Default is stderr.txt
type: string
Defines how many CPU cores are reserved for the application process.
For instance, if you need 4 cores for 1 MPI executable, this value would be 4.
type: string
Defines how the application process is launched. Valid strings for this field are ‘single’ or ‘mpi’. If your executable is a.out, “single” executes it as ./a.out, while “mpi” executes mpirun -np <number_of_processes> ./a.out (note: aprun is used for Kraken, and srun/ibrun is used for Stampede).
type: string
Specifies the input data flow for a ComputeUnit. This is used in conjunction with PilotData. The format is [<data unit url>, … ]
type: string
Specifies the output data flow for a ComputeUnit. This is used in conjunction with PilotData. The format is [<data unit url>, … ]
type: string
The data center label used for affinity topology.
type: | string Note Data centers and machines are organized in a logical topology tree (similar to the tree spawned by an DNS topology). The further the distance between two resources, the smaller their affinity. |
---|
The machine (resource) label used for affinity topology.
type: | string Note Data centers and machines are organized in a logical topology tree (similar to the tree spawned by an DNS topology). The further the distance between two resources, the smaller their affinity. |
---|
ComputeUnitDescription objects are loosely typed. A dictionary containing the respective keys can be passed instead to the ComputeDataService.
A ComputeUnit is a work item executed by a PilotCompute. These are what constitute the individual jobs that will run within the Pilot. Oftentimes, this will be an executable, which can have input arguments or environment variables.
A ComputeUnit is the object that is returned by the ComputeDataService when a new ComputeUnit is submitted based on a ComputeUnitDescription. The ComputeUnit object can be used by the application to keep track of ComputeUnits that are active.
A ComputeUnit has state, can be queried, and can be cancelled.
B{ComputeUnit (CU).}
This is the object that is returned by the ComputeDataService when a new ComputeUnit is submitted based on a ComputeUnitDescription.
The ComputeUnit object can be used by the application to keep track of ComputeUnits that are active.
A ComputeUnit has state, can be queried and can be cancelled.
Terminates Compute Unit
Returns dict with Compute Unit Details (e.g. job description, timings)
Returns unique identifier of Compute Unit. Deprecated: Please use get_url() instead.
Returns the local working directory of this PilotCompute object.
Returns current state of Compute Unit
Returns URL of Compute Unit. This URL can be used to reconnect to this Compute Unit later on.
Wait until in Done state (or Failed state)
The data unit description defines the different files to be moved around. There is currently no support for directories.
data_unit_description = {
'file_urls': [file1, file2, file3]
}
type: | string |
---|
The data center label used for affinity topology.
type: | string Note Data centers and machines are organized in a logical topology tree (similar to the tree spawned by an DNS topology). The further the distance between two resources, the smaller their affinity. |
---|
The machine (resource) label used for affinity topology.
type: | string Note Data centers and machines are organized in a logical topology tree (similar to the tree spawned by an DNS topology). The further the distance between two resources, the smaller their affinity. |
---|
A DataUnit is a container for a logical group of data that is often accessed together or comprises a larger set of data; e.g. a data file or chunk.
A DataUnit is the object that is returned by the ComputeDataService when a new DataUnit is submitted based on a DataUnitDescription. The DataUnit object can be used by the application to keep track of DataUnits that are active.
A DataUnit has state, can be queried, and can be cancelled.
B{DataUnit (DU).}
This is the object that is returned by the ComputeDataService when a new DataUnit is created based on a DataUnitDescription.
The DataUnit object can be used by the application to keep track of a DataUnit.
A DataUnit has state, can be queried and can be cancelled.
Add files referenced in list to Data Unit
add this DU (self) to a certain pilot data data will be moved into this data
Cancel the Data Unit.
simple implementation of export: copies file from first pilot data to local machine
get a list of pilot data that have a copy of this PD
Return current state of DataUnit
Return URL that can be used to reconnect to Data Unit
List all items contained in DU {
- “filename” : {
“pilot_data” : [url1, url2], “local” : url}
}
Remove files from Data Unit (NOT implemented yet
Internal method that returns a dict with all data contained in this DataUnit
Wait until in running state (or failed state)
Pilots and Compute Units can have state. These states can be queried using the get_state() function. States are used for PilotCompute, PilotData, ComputeUnit, DataUnit and ComputeDataService. The following table describes the values that states can have.
State |
|
|
|
|
|
|
|