API

The implementation of gwf consists of a few main abstractions. Units of work are defined by creating Target instances which also define the files used and produced by the target. A Workflow ties together and allows for easy creation of targets.

When all targets have been defined on a workflow, the workflow is turned into a Graph which will compute the entire dependency graph of the workflow, checking the workflow for inconsistencies and circular dependencies.

A target in a Graph can be scheduled on a Backend using the Scheduler.

Core

gwf.core.graph_from_config(config)[source]

Return graph for the workflow specified by config.

See graph_from_path() for further information.

gwf.core.graph_from_path(path)[source]

Return graph for the workflow given by path.

Returns a Graph object containing the workflow graph of the workflow given by path. Note that calling this function computes the complete dependency graph which may take some time for large workflows.

Parameters:path (str) – Path to a workflow file, optionally specifying a workflow object in that file.
class gwf.core.AnonymousTarget(inputs, outputs, options, working_dir=None, spec='', protect=None)[source]

Represents an unnamed target.

An anonymous target is an unnamed, abstract target much like the tuple returned by function templates. Thus, AnonymousTarget can also be used as the return value of a template function.

Variables:
  • inputs (list) – A list of input paths for this target.
  • outputs (list) – A list of output paths for this target.
  • options (dict) – Options such as number of cores, memory requirements etc. Options are backend-dependent. Backends will ignore unsupported options.
  • working_dir (str) – Working directory of this target.
  • spec (str) – The specification of the target.
  • protect (set) – An iterable of protected files which will not be removed during cleaning, even if this target is not an endpoint.
is_sink

Return whether this target is a sink.

A target is a sink if it does not output any files.

is_source

Return whether this target is a source.

A target is a source if it does not depend on any files.

class gwf.core.Target(name=None, **kwargs)[source]

Represents a target.

This class inherits from AnonymousTarget.

A target is a named unit of work that declare their file inputs and outputs. Target names must be valid Python identifiers.

A script (or spec) is associated with the target. The script must be a valid Bash script and should produce the files declared as outputs and consume the files declared as inputs. Both parameters must be provided explicitly, even if no inputs or outputs are needed. In that case, provide the empty list:

Target('Foo', inputs=[], outputs=[], options={}, working_dir='/tmp')

The target can also specify an options dictionary specifying the resources needed to run the target. The options are consumed by the backend and may be ignored if the backend doesn’t support a given option. For example, we can set the cores option to set the number of cores that the target uses:

Target('Foo', inputs=[], outputs=[], options={'cores': 16}, working_dir='/tmp')

To see which options are supported by your backend of choice, see the documentation for the backend.

Variables:name (str) – Name of the target.
classmethod empty(name)[source]

Return a target with no inputs, outputs and options.

This is mostly useful for testing.

class gwf.core.Workflow(name=None, working_dir=None, defaults=None)[source]

Represents a workflow.

This is the most central user-facing abstraction in gwf.

A workflow consists of a collection of targets and has methods for adding targets to the workflow in two different ways. A workflow can be initialized with the following arguments:

Variables:
  • name (str) – initial value: None The name is used for namespacing when including workflows. See include() for more details on namespacing.
  • working_dir (str) – The directory containing the file where the workflow was initialized. All file paths used in targets added to this workflow are relative to the working directory.
  • defaults (dict) – A dictionary with defaults for target options.

By default, working_dir is set to the directory of the workflow file which initialized the workflow. However, advanced users may wish to set it manually. Targets added to the workflow will inherit the workflow working directory.

The defaults argument is a dictionary of option defaults for targets and overrides defaults provided by the backend. Targets can override the defaults individually. For example:

gwf = Workflow(defaults={
    'cores': 12,
    'memory': '16g',
})

gwf.target('Foo', inputs=[], outputs=[]) << """echo hello"""
gwf.target('Bar', inputs=[], outputs=[], cores=2) << """echo world"""

In this case Foo and Bar inherit the cores and memory options set in defaults, but Bar overrides the cores option.

See include() for a description of the use of the name argument.

glob(pathname, *args, **kwargs)[source]

Return a list of paths matching pathname.

This method is equivalent to glob.glob(), but searches with relative paths will be performed relative to the working directory of the workflow.

iglob(pathname, *args, **kwargs)[source]

Return an iterator which yields paths matching pathname.

This method is equivalent to glob.iglob(), but searches with relative paths will be performed relative to the working directory of the workflow.

include(other_workflow, namespace=None)[source]

Include targets from another gwf.Workflow into this workflow.

This method can be given either an gwf.Workflow instance, a module or a path to a workflow file.

If a module or path the workflow object to include will be determined according to the following rules:

  1. If a module object is given, the module must define an attribute named gwf containing a gwf.Workflow object.

  2. If a path is given it must point to a file defining a module with an attribute named gwf containing a gwf.Workflow object. If you want to include a workflow with another name you can specify the attribute name with a colon, e.g.:

    /some/path/workflow.py:myworkflow
    

    This will include all targets from the workflow myworkflow declared in the file /some/path/workflow.py.

When a gwf.Workflow instance has been obtained, all targets will be included directly into this workflow. To avoid name clashes the namespace argument must be provided. For example:

workflow1 = Workflow()
workflow1.target('TestTarget')

workflow2 = Workflow()
workflow2.target('TestTarget')

workflow1.include(workflow2, namespace='wf1')

The workflow now contains two targets named TestTarget (defined in workflow2) and wf1.TestTarget (defined in workflow1). The namespace parameter can be left out if the workflow to be included has been named:

workflow1 = Workflow(name='wf1')
workflow1.target('TestTarget')

workflow2 = Workflow()
workflow2.target('TestTarget')

workflow1.include(workflow2)

This yields the same result as before. The namespace argument can be used to override the specified name:

workflow1 = Workflow(name='wf1')
workflow1.target('TestTarget')

workflow2 = Workflow()
workflow2.target('TestTarget')

workflow1.include(workflow2, namespace='foo')

The workflow will now contain targets named TestTarget and foo.TestTarget.

include_path(path, namespace=None)[source]

Include targets from another gwf.Workflow into this workflow.

See include().

include_workflow(other_workflow, namespace=None)[source]

Include targets from another gwf.Workflow into this workflow.

See include().

shell(*args, **kwargs)[source]

Return the output of a shell command.

This method is equivalent to subprocess.check_output(), but automatically runs the command in a shell with the current working directory set to the working directory of the workflow.

Changed in version 1.0: This function no longer return a list of lines in the output, but a byte array with the output, exactly like subprocess.check_output(). You may specifically set universal_newlines to True to get a string with the output instead.

target(name, inputs, outputs, **options)[source]

Create a target and add it to the gwf.Workflow.

This is syntactic sugar for creating a new Target and adding it to the workflow. The target is also returned from the method so that the user can directly manipulate it, if necessary. For example, this allows assigning a spec to a target directly after defining it:

workflow = Workflow()
workflow.target('NewTarget', inputs=['test.txt', 'out.txt']) <<< '''
cat test.txt > out.txt
echo hello world >> out.txt
'''

This will create a new target named NewTarget, add it to the workflow and assign a spec to the target.

Parameters:
  • name (str) – Name of the target.
  • inputs (iterable) – List of files that this target depends on.
  • outputs (iterable) – List of files that this target produces.

Any further keyword arguments are passed to the backend.

target_from_template(name, template, **options)[source]

Create a target from a template and add it to the gwf.Workflow.

This is syntactic sugar for creating a new Target and adding it to the workflow. The target is also returned from the method so that the user can directly manipulate it, if necessary.

workflow = Workflow()
workflow.target_from_template('NewTarget', my_template())

This will create a new target named NewTarget, configure it based on the specification in the template my_template, and add it to the workflow.

Parameters:
  • name (str) – Name of the target.
  • template (tuple) – Target specification of the form (inputs, outputs, options, spec).

Any further keyword arguments are passed to the backend and will override any options provided by the template.

class gwf.core.Graph(targets, provides, dependencies, dependents, unresolved)[source]

Represents a dependency graph for a set of targets.

The graph represents the targets present in a workflow, but also their dependencies and the files they provide.

During construction of the graph the dependencies between targets are determined by looking at target inputs and outputs. If a target specifies a file as input, the file must either be provided by another target or already exist on disk. In case that the file is provided by another target, a dependency to that target will be added:

Variables:dependencies (dict) – A dictionary mapping a target to a set of its dependencies.

If the file is not provided by another target, the file is unresolved:

Variables:unresolved (set) – A set containing file paths of all unresolved files.

If the graph is constructed successfully, the following instance variables will be available:

Variables:
  • targets (dict) – A dictionary mapping target names to instances of gwf.Target.
  • provides (dict) – A dictionary mapping a file path to the target that provides that path.
  • dependents (dict) – A dictionary mapping a target to a set of all targets which depend on the target.

The graph can be manipulated in arbitrary, diabolic ways after it has been constructed. Checks are only performed at construction-time, thus introducing e.g. a circular dependency by manipulating dependencies will not raise an exception.

Raises:gwf.exceptions.WorkflowError – Raised if the workflow contains a circular dependency.
dfs(root)[source]

Return the depth-first traversal path through a graph from root.

endpoints()[source]

Return a set of all targets that are not depended on by other targets.

classmethod from_targets(targets)[source]

Construct a dependency graph from a set of targets.

When a graph is initialized it computes all dependency relations between targets, ensuring that the graph is semantically sane. Therefore, construction of the graph is an expensive operation which may raise a number of exceptions:

Raises:gwf.exceptions.FileProvidedByMultipleTargetsError – Raised if the same file is provided by multiple targets.

Since this method initializes the graph, it may also raise:

Raises:gwf.exceptions.WorkflowError – Raised if the workflow contains a circular dependency.
class gwf.core.Scheduler(graph, backend, dry_run=False, file_cache={})[source]

Schedule one or more targets and submit to a backend.

Scheduling a target will determine whether the target needs to run based on whether it already has been submitted and whether any of its dependencies have been submitted.

Targets that should run will be submitted to backend, unless dry_run is set to True.

When scheduling a target, the scheduler checks whether any of its inputs are unresolved, meaning that during construction of the graph, no other target providing the file was found. This means that the file should then exist on disk. If it doesn’t the following exception is raised:

Raises:gwf.exceptions.FileRequiredButNotProvidedError – Raised if a target has an input file that does not exist on the file system and that is not provided by another target.
schedule(target)[source]

Schedule a target and its dependencies.

Returns True if target was submitted to the backend (even when dry_run is True).

Parameters:target (gwf.Target) – Target to be scheduled.
schedule_many(targets)[source]

Schedule multiple targets and their dependencies.

This is a convenience method for scheduling multiple targets. See schedule() for a detailed description of the arguments and behavior.

Parameters:targets (list) – A list of targets to be scheduled.
should_run(target)[source]

Return whether a target should be run or not.

status(target: gwf.core.Target) → gwf.core.TargetStatus[source]

Return the status of a target.

Returns the status of a target where it is taken into account whether the target should run or not.

Parameters:target (Target) – The target to return status for.

Backends

gwf.backends.list_backends()[source]

Return the names of all registered backends.

gwf.backends.backend_from_name(name)[source]

Return backend class for the backend given by name.

Returns the backend class registered with name. Note that the class is returned, not the instance, since not all uses requires initialization of the backend (e.g. accessing the backends’ log manager), and initialization of the backend may be expensive.

Parameters:name (str) – Path to a workflow file, optionally specifying a workflow object in that file.
gwf.backends.backend_from_config(config)[source]

Return backend class for the backend specified by config.

See backend_from_name() for further information.

class gwf.backends.Backend[source]

Base class for backends.

cancel(target)[source]

Cancel target.

Parameters:target (gwf.Target) – The target to cancel.
Raises:gwf.exception.TargetError – If the target does not exist in the workflow.
close()[source]

Close the backend.

Called when the backend is no longer needed and should close all resources (open files, connections) used by the backend.

log_manager = <gwf.backends.logmanager.FileLogManager object>
classmethod logs(target, stderr=False)[source]

Return log files for a target.

If the backend cannot return logs a NoLogFoundError is raised.

By default standard output (stdout) is returned. If stderr=True standard error will be returned instead.

Parameters:
  • target (gwf.Target) – Target to return logs for.
  • stderr (bool) – default: False. If true, return standard error.
Returns:

A file-like object. The user is responsible for closing the returned file(s) after use.

Raises:

gwf.exceptions.NoLogFoundError – if the backend could not find a log for the given target.

status(target)[source]

Return the status of target.

Parameters:target (gwf.Target) – The target to return the status of.
Return gwf.backends.Status:
 Status of target.
submit(target, dependencies)[source]

Submit target with dependencies.

This method must submit the target and return immediately. That is, the method must not block while waiting for the target to complete.

Parameters:
  • target (gwf.Target) – The target to submit.
  • dependencies – An iterable of gwf.Target objects that target depends on and that have already been submitted to the backend.
class gwf.backends.Status[source]

Status of a target.

A target is unknown to the backend if it has not been submitted or the target has completed and thus isn’t being tracked anymore by the backend.

A target is submitted if it has been successfully submitted to the backend and is pending execution.

A target is running if it is currently being executed by the backend.

RUNNING = 2

The target is currently running.

SUBMITTED = 1

The target has been submitted, but is not currently running.

UNKNOWN = 0

The backend is not aware of the status of this target (it may be completed or failed).

Log Managers

class gwf.backends.logmanager.FileLogManager[source]

A file-based log manager.

This log manager stores logs on disk in the log_dir directory (which defaults to .gwf/logs).

open_stderr(target, mode='r')[source]

Return file handle to standard error log file for target.

Raises:LogError – If the log could not be found.
open_stdout(target, mode='r')[source]

Return file handle to the standard output log file for target.

Raises:LogError – If the log could not be found.
stderr_path(target_name)[source]

Return path of the log file containing standard error for target.

stdout_path(target_name)[source]

Return path of the log file containing standard output for target.

class gwf.backends.logmanager.MemoryLogManager[source]

A memory-based log manager.

This log manager stores logs in memory.

Filtering

gwf.filtering.filter_generic(targets, filters)[source]

Filter targets given a list of filters.

Return all targets from targets passing all filters. For example:

matched_targets = filter_generic(
    targets=graph.targets.values(),
    filters=[
        NameFilter(patterns=['Foo*'],
        StatusFilter(scheduler=scheduler, status='running'),
    ]
)

returns a generator yielding all targets with a name matching Foo* which are currently running.

Parameters:
  • targets – A list of targets to be filtered.
  • filters – A list of Filter instances.
gwf.filtering.filter_names(targets, patterns)[source]

Filter targets with a list of patterns.

Return all targets in targets where the target name matches one or more of the patterns in pattern. For example:

matched_targets = filter_names(graph.targets.values(), ['Foo*'])

returns a generator yielding all targets with a name matching the pattern Foo*. Multiple patterns can be provided:

matched_targets = filter_names(graph.targets.values(), ['Foo*', 'Bar*'])

returns all targets with a name matching either Foo* or Bar*.

This function is a simple wrapper around NameFilter.

Helpers for filtering:

class gwf.filtering.ApplyMixin[source]

A mixin for predicate-based filters providing the apply method.

Most filters are predicate-based in the sense that they simply filter targets one by one based on a predicate function that decides whether to include the target or not. Such filters can inherit this mixin and then only need to declare a predicate() method which returns True if the target should be included and False otherwise.

For examples of using this mixin, see the StatusFilter and EndpointFilter filters.

apply(targets)[source]

Apply the filter to all targets.

This method returns a generator yielding all targets in targets for each predicate() returns True.

predicate(target)[source]

Return True if target should be included, False otherwise.

This method must be overriden by subclasses.