We’re pleased to announce the 0.3.0 release of Snowplow’s DAG running tool Factotum! This release centers around making DAGs easier to create, monitor and reason about, including adding outbound webhooks to Factotum.
In the rest of this post we will cover:
- Improving the workflow when creating DAGs
- Improving job monitoring using webhooks
- Behaviors on task failure
- Downloading and running Factotum
1. Improving the workflow when creating DAGs
We’ve decided that to separate commands effectively, we needed to move to a “subcommand” style arguments system. For this reason, what was originally
factotum <your factfile> is now
factotum run <your factfile>. All new features will follow this scheme.
The other improvements around workflow broadly fall into the following categories: factfile validation, dry runs and Graphviz dotfile exports. These are discussed in the following sections!
Factfiles have always been schema’d and validated against the factfile schema. It’s not always convenient to locate this schema and ensure that the factfile you’re working on is valid, so as of version 0.3.0 we’ve introduced a built-in validation command. This includes checking that: your factfile is valid JSON; that it adheres to the JSON schema; and that each task can be executed.
You can use it like this:
factotum validate <your factfile>
If the factfile is valid, Factotum will respond:
<your Factfile> is a valid Factotum Factfile!
and if it’s not, you’ll get a message explaining the problem:
<your Factfile> is not a valid Factotum Factfile: Invalid JSON at line 1 column 1
Validating a factfile ensures Factotum can process your job. Dry runs are a way to show how your job will be executed, including a full output simulation.
Dry runs can be executed in the following way:
factotum run <your factfile> --dry-run
which, depending on the factfile will return something like this:
COMMAND here is the real command Factotum will use to execute your task, which can be copy-pasted and run in a shell if desired.
Graphviz dot output
For complicated DAGs, it’s not always easy to tell the dependency tree from the text output of a program. That’s why as of 0.3.0 Factotum supports exporting your DAG as a Graphviz dotfile. This export can be used to visualise your Factfile in any of a number of programs, or a web based renderer.
factotum dot <your factfile> --output dag.dot
This will build a dotfile representation of
<your Factfile> and put the result in
dag.dot. Here’s what you can expect to see after you’ve rendered this dotfile:
(I used the command
dot -Tsvg echo.dot -o echo-dot-output.svg to generate this image, using the
graphviz package in Ubuntu.)
2. Improving job monitoring using webhooks
Data pipelines typically run on clusters, with a job or part of a job being assigned to one or more machines (which may or may not be virtual). It won’t necessarily be known in advance which box will run a specific job, or be straightforward to work out where a previous run was executed (or even if the box is still running).
This creates a problem unique to cluster-based software: how do I keep an auditable log of the jobs that have run, and how do I know which are currently running (and what they’re doing)?
There are a number of ways to “bridge” applications which use traditional log files for cluster use, for example using NFS and a central “log store”. However this solution isn’t perfect, and to make a log file really auditable it needs to be structured – a stream of unstructured messages is difficult to reason about (and query).
Many tools such as Airflow or Chronos would at this point bundle in MySQL or Postgres or Cassandra and use that to store state over time. This approach makes technical sense, but it does create a new and opaque data silo within your organisation; all this information is hidden away somewhere, and liable to change format between releases.
We’ve chosen a different path based on the Zen of Factotum and the idea that you should be able to depend on an abstraction rather than a specific implementation or tool. As of release 0.3.0, Factotum now can emit self describing events to a HTTP (or HTTPS) endpoint of your choice with the current state of the running job. This event is also suitable for ingesting into your existing Snowplow pipeline (though this is by no means required!).
Running with webhooks
The new functionality can be run by adding the
--webhook <url> option. For example:
factotum run <your factfile> --webhook "http://my-endpoint.com"
You can ingest these events into Snowplow using the Iglu webhook adapter POST support (requires R83+):
factotum run <your factfile> --webhook "https://my-snowplow-collector.com/com.snowplowanalytics.iglu/v1"
When updates are sent
Updates are split into two different event types. The first is triggered when the state of the job changes, for example when the job is started or finished. The second is when the state of a specific task changes – for example, when a task is started or failed.
Job updates are described by com.snowplowanalytics.factotum/job_update/jsonschema/1-0-0 events, available in Iglu Central.
Here’s an example of a job update:
ask updates are described by com.snowplowanalytics.factotum/task_update/jsonschema/1-0-0 events, also available in Iglu Central.
Here’s an example of a task update:
Both events share many common fields. A description of all the fields in both events is given below (split up into fields common to both events, and then those specifically in task updates and job updates).
|Yes||Self describing event wrapper|
|Yes||Self describing event wrapper|
|Yes||The name of the job, as it appears in the Factfile|
|Yes||An ID unique to the Factfile for this job. If you’re using user defined tags, jobs with the same Factfile and differing tags will have different job IDs as tags are included when calculating job IDs.|
|Yes||A globally unique ID for this run|
|Yes||An object representing any user defined tags for the running job|
|Yes||A base64 encoded copy of the Factfile that’s running|
|Yes||The version of Factotum that’s executing the job|
|Yes||The current state of the job. This can be |
|Yes||The time the job started, in ISO8601 format|
|Yes||The running time of the job so far in ISO8601 duration format|
|Yes||An array of information on the state of each task|
|Yes||The name of the task, as it appears in the Factfile|
|Yes||The current state of the task. This can be |
|No||Optional. The ISO8601 start time of the task|
|No||Optional. The ISO8601 duration of the task|
|No||Optional. The output of the task to |
|No||Optional. The output of the task to |
|No||Optional. The return code of the task|
|No||Optional. The reason the task failed, or was skipped|
job_update events have the following extra fields that provide information about the current state of the job.
|Yes||An object explaining the reason this event was emitted|
|Yes||The state of the job prior to the change occurring. This can be |
|Yes||The state of the job after the change has occurred. This can be |
task_update events have the following extra fields that provide information about the changes in state of tasks.
|Yes||An array of task level changes in execution. Each element represents the change in state for a single task|
|Yes||The name of the task that has changed state (as it appears in the Factfile).|
|Yes||The state the given task was previously in. This can be |
|Yes||The state the given task is now in. This can be |
Tags are a way to add custom meta-data to your job runs. You can add any set of key-value pairs to your jobs – when using webhooks they’ll have the following effects:
- Appear in all webhook events under the
- Be used in addition to the Factfile itself to calculate the job reference
- This means that the same Factfile can generate two or more different job references if required
In both events, custom tags look like this:
You can add tags to your job with the
--tag argument. To add the
foo tag with the value
factotum run samples/echo.factotum --webhook http://localhost --tag "foo,bar"
Multiple tags can be added by repeating the argument:
factotum run samples/echo.factotum --webhook http://localhost --tag "foo,bar" --tag "foo2,bar2"
3. Behaviors on task failure
Fail fast vs continue as far as possible
In previous releases of Factotum, when a task fails Factotum will stop processing your job as soon as possible. We call this behaviour “failing fast”; this is the default behavior of Make too (without the
--keep-going flag being enabled). Failing fast is simple and predictable, however it often results in a lot of tasks that could have been run to not run at all. It’s also difficult to reason about, because the final state of the DAG depends not just on which tasks failed, but how long different tasks ran for.
That’s why as of this release, we’re switching to a different model. Factotum will now “keep going” and complete as many tasks as possible, with the tasks that depend on failing tasks being the only ones which are skipped.
Here’s a few diagrams cataloguing the difference in behavior. On the left is the previous version(s) of Factotum, and on the right is version 0.3.0+:
In trivial DAGs (as shown above) the behavior between this version of Factotum and prior versions is the same.
In DAGs with multiple dependency trees, in prior versions Factotum would stop as soon as possible (left). In this version Factotum will complete as much as possible (right).
When DAGs split into parallel streams of execution, any sub-task that eventually depends on a failed task will now be skipped (right), compared to terminating at the first failure (prior versions, left).
A task that requests early DAG termination has always worked in the same way as failures (except that it’s not a failure!).
To keep things straightforward, we’ve also altered how “no operations” work in version 0.3.0 to match the new way of handling task failures (shown above). They will continue to “skip” subsequent tasks without generating an error.
Factotum 0.3.0 now ships an macOS version. You can see how to get a copy here!
Turning off terminal colours with –no-colours
Colours aren’t for everyone, and they can be distracting if you’re piping data to a file (or another source that doesn’t understand colour codes). In version 0.2.0 we introduced support for the
CLICOLOR environment variable (as described here). In this release we’re complementing that with a command line argument
--no-colour that forces ANSI terminal colours to be turned off.
Eating our own dog food
Factotum is now released using Factotum! This means you can see a real example of using a Factfile here, including an example of “terminating early” and how it applies to builds.
5. Downloading and running Factotum
Factotum is now available for macOS and Linux (x86_64).
If you’re running Linux:
wget https://bintray.com/artifact/download/snowplow/snowplow-generic/factotum_0.3.0_linux_x86_64.zip unzip factotum_0.3.0_linux_x86_64.zip wget https://raw.githubusercontent.com/snowplow/factotum/master/samples/echo.factotum
If you’re running macOS:
wget https://bintray.com/artifact/download/snowplow/snowplow-generic/factotum_0.3.0_darwin_x86_64.zip unzip factotum_0.3.0_darwin_x86_64.zip wget https://raw.githubusercontent.com/snowplow/factotum/master/samples/echo.factotum
This series of commands will download the 0.3.0 release, unzip it in your current working directory and download a sample job for you to run. You can then run Factotum in the following way:
./factotum run ./echo.factotum
6. Roadmap for Factotum
We’re taking an iterative approach with Factotum – today Factotum won’t give you an entire stack for monitoring, scheduling and running data pipelines, but we plan on growing it into a set of tools that will.
Factotum will continue to be our “job executor”, but a more complete ecosystem will be developed around it – ideas include an optional scheduler, audit logging, user authentication, Mesos support and more. If you have specific features to suggest, please add a ticket to the GitHub repo.
Factotum is completely open source – and has been from the start! If you’d like to get involved, or just try your hand at Rust, please check out the repository.