Factotum 0.2.0 released


We are pleased to announce release 0.2.0 of Snowplow’s DAG running tool, Factotum. This release introduces variables for jobs and the ability to start jobs from a given task.

In the rest of this post we will cover:

  1. Job configuration variables
  2. Starting a job from a given task
  3. Output improvements
  4. Downloading and running Factotum
  5. Roadmap
  6. Contributing

1. Job configuration variables

Jobs often contain per-run information such as a target hostname or IP address. In Factotum 0.1.0 it was only possible to set this information by editing the factfile manually. In Factotum 0.2.0, we provide the means to supply this information at run time through a job argument. Job configurations are free-form JSON and can contain arbitrarily complex information, which has a designated placeholder in the job specification.

Here’s a quick example of how it works:

{ "schema": "iglu:com.snowplowanalytics.factotum/factfile/jsonschema/1-0-0", "data": { "name": "Variables demo", "tasks": [ { "name": "Say something", "executor": "shell", "command": "echo", "arguments": [ "{{ message }}" ], "dependsOn": [], "onResult": { "terminateJobWithSuccess": [], "continueJob": [ 0 ] } } ] } }

Given the factfile above, you can see there’s now a placeholder denoted with {{ message }} inside the task’s arguments. Passing a configuration JSON with a “message” field will now cause this tasks arguments to change to the supplied “message”. You can supply this configuration using a new argument to Factotum, “-e” or “–env” followed with some JSON.

Check out this example, using the above factfile:

$ factotum samples/variables.factotum -e '{ "message": "hello world" }' Task 'Say something' was started at 2016-06-12 21:04:02.274382495 UTC Task 'Say something' stdout: hello world Task 'Say something': succeeded after 0.0s 1/1 tasks run in 0.0s

This functionality is built using the mustache templating system – which we’re making a standard for Factotum going forwards.

If you find it challenging to construct the JSON for your variables on the command-line, considering adding the excellent jq utility into your pipeline.

2. Starting a job from an arbitrary task

An unfortunate fact of life is that jobs occasionally fail part way through – for example, if your server loses network connectivity during a task. Factotum 0.2.0 includes functionality to (re)start a job from a given point, allowing you to skip tasks that have already been run.

This functionality is provided using the “–start” (or “-s”) command line option. Given the Factfile below:

{ "schema": "iglu:com.snowplowanalytics.factotum/factfile/jsonschema/1-0-0", "data": { "name": "echo order demo", "tasks": [ { "name": "echo alpha", "executor": "shell", "command": "echo", "arguments": [ "alpha" ], "dependsOn": [], "onResult": { "terminateJobWithSuccess": [], "continueJob": [ 0 ] } }, { "name": "echo beta", "executor": "shell", "command": "echo", "arguments": [ "beta" ], "dependsOn": [ "echo alpha" ], "onResult": { "terminateJobWithSuccess": [], "continueJob": [ 0 ] } }, { "name": "echo omega", "executor": "shell", "command": "echo", "arguments": [ "and omega!" ], "dependsOn": [ "echo beta" ], "onResult": { "terminateJobWithSuccess": [], "continueJob": [ 0 ] } } ] } }

You can start from the “echo beta” task using the following:

$ factotum samples/echo.factotum --start "echo beta" Task 'echo beta' was started at 2016-06-12 21:27:34.702377410 UTC Task 'echo beta' stdout: beta Task 'echo beta': succeeded after 0.0s Task 'echo omega' was started at 2016-06-12 21:27:34.704229360 UTC Task 'echo omega' stdout: and omega! Task 'echo omega': succeeded after 0.0s 2/2 tasks run in 0.0s

Which skips the task “echo alpha”, and starts from “echo beta”.

In more complicated DAGs, there are some tasks which cannot currently be the starting point for jobs. Resuming a job from such tasks would be ambiguous, typically because the DAG has parallel execution branches and a single start point does not tell Factotum enough about the start state of all of the branches.

For example, given the following DAG:

dag resume diagram

starting from “B” is not possible, as the dependant task “E” depends on “C”. An error will be thrown if a user attempts to start from “B”; starting from the task “D”, or “E” however is possible, if desired.

In a future release of Factotum we plan on letting a user start from any set of complete coordinates within the DAG (see issue #54).

3. Output improvements

You may have noticed from the previous examples that Factotum now provides a lot more information on job execution. The main changes are:

  • Terminal colours can be switched off by setting the environment variable CLICOLOR to 0 (though we plan on moving this to a CLI argument command, see issue #53)
  • Task durations are now human-readable
  • A summary of the number of tasks run is printed when the job finishes/terminates
  • A task’s output on stdout is only shown if the task produces output
  • A task’s output on stderr is printed to stderr again by the Factotum process (this can simplify capturing output from Factotum)
  • Tasks now have their launch time displayed along with a tidied up summary of the result

This was based on feedback from using Factotum in production at Snowplow!

4. Downloading and running Factotum

Currently Factotum is only available for 64 bit Linux. Get it like so:

wget https://bintray.com/artifact/download/snowplow/snowplow-generic/factotum_0.2.0_linux_x86_64.zip unzip factotum_0.2.0_linux_x86_64.zip wget https://raw.githubusercontent.com/snowplow/factotum/master/samples/echo.factotum

This series of commands will download the 0.2.0 release, unzip it in your current working directory and download a sample job for you to run. You can then run Factotum in the following way:

factotum ./echo.factotum

5. Roadmap for Factotum

We’re taking an iterative approach with Factotum – today Factotum won’t give you an entire stack for monitoring, scheduling and running data pipelines, but we plan on growing it into a set of tools that will.

Factotum will continue to be our “job executor”, but a more complete ecosystem will be developed around it – ideas include an optional scheduler, audit logging, user authentication, Mesos support and more. If you have specific features to suggest, please add a ticket to the GitHub repo.

6. Contributing

Factotum is completely open source – and has been from the start! If you’d like to get involved, or just try your hand at Rust, please check out the repository.