Introducing Dataflow Runner
We are pleased to announce the release of Dataflow Runner, a new open-source system for the creation and running of AWS EMR jobflow clusters and steps. Big thanks to Snowplow intern Manoj Rajandrakumar for all of his hard work on this project!
This release signals the first step in our journey to deconstruct EmrEtlRunner into two separate applications, a Dataflow Runner and
snowplowctl, per our RFC on Discourse.
In the rest of this post we will cover:
- Why Dataflow Runner?
- Dataflow Runner 0.1.0
- Downloading and running Dataflow Runner
- Running a jobflow on EMR
1. Why Dataflow Runner?
Looking around, we could not find a tool for managing and executing data processing jobs on dataflow fabrics such as EMR in a simple, declarative way. Dataflow Runner is designed to fill this gap, and letting us move away from our existing EmrEtlRunner tooling.
Dataflow Runner is a general-purpose executor for data processing jobs on dataflow fabrics. Initially it only supports AWS EMR but will eventually support a host of other fabrics, including Google Cloud Dataproc and Azure HDInsight. We are also interested in supporting on-premise Hadoop, Spark and Flink deployments, most likely via YARN.
For the full rationale, do check out our original RFC on the topic.
The key features of the Dataflow Runner design are as follows:
Separation of concerns
Dataflow Runner aims to solve the fundamental problem of EmrEtlRunner complecting business logic with general-purpose EMR execution. This was problematic for a few reasons:
- It makes it very difficult to test EmrEtlRunner thoroughly, because the side effecting EMR execution and the business logic are mixed together
- The jobflow DAG is only constructed at runtime by code, which makes it very difficult to share with other orchestration tools such as Factotum
- EmrEtlRunner ties the Snowplow batch pipeline needlessly closely to EMR
- It makes it impossible for non-Snowplow projects to use our (in theory general-purpose) EMR execution code
Dataflow Runner is written in Golang, letting us build native binaries for various platforms. This makes Dataflow Runner a zero-dependency application, which is lightweight and places little strain on your system. This allows you to run many instances of Dataflow Runner concurrently on your orchestration cluster.
It is much easier for us to test Dataflow Runner than it was to test EmrEtlRunner. As Dataflow Runner is a general-purpose application for running jobs on dataflow fabrics, it is straightforward for us to run integration tests on AWS without depending on a functional Snowplow pipeline.
2. Dataflow Runner 0.1.0
In version 0.1.0 of Dataflow Runner you will only able to work with AWS EMR - which mimics our current support in
EmrEtlRunner. You can perform three distinct actions with this resource:
- Launching a new cluster which is ready to run custom steps, via the
- Adding steps to a newly created cluster, via the
- Terminating a cluster, via the
There is a fourth command which mimics the current behavior of EmrEtlRunner:
run-transient, which will launch, run steps and terminate in a single blocking action.
The configurations for Dataflow Runner are expressed in self-describing Avro, so they can be versioned and remain human-composable. The Avro Schema for these configs are available from Iglu Central as:
Any Avro schema validator can validate/lint a Dataflow Runner config - which should makes these configurations more manageable compared to EmrEtlRunner’s current YAML-based config.
Dataflow Runner is currently built for
Windows/amd64 - if you require a different platform, please let us know!
Crucially, Dataflow Runner has no install dependencies and doesn’t require a cluster, root access, a database, port 80 and so on. The binaries can be found at the following locations:
- Linux: http://dl.bintray.com/snowplow/snowplow-generic/dataflow_runner_0.1.0_linux_amd64.zip
- Windows: http://dl.bintray.com/snowplow/snowplow-generic/dataflow_runner_0.1.0_windows_amd64.zip
- macOS: http://dl.bintray.com/snowplow/snowplow-generic/dataflow_runner_0.1.0_darwin_amd64.zip
3. Downloading and running Dataflow Runner
To get Dataflow Runner for Linux/amd64:
This series of commands will download the 0.1.0 release and unzip it in your current working directory. You can then run Dataflow Runner in the following way:
4. Running a jobflow on EMR
To use Dataflow Runner you will need to create two configuration files. The first is an EMR cluster configuration, which will be used to launch the cluster; the second is the “playbook” containing a sequential series of steps that you want executed on a given cluster.
Here is the cluster configuration - note that this is EMR-specific:
This is a barebones configuration but you are also able to add tags, bootstrap actions and extra cluster configuration options.
credentials can be fetched from the local environment, an IAM role attached to a server or as plaintext.
To then launch this cluster:
Eventually you will see output like the following -
EMR cluster launched successfully; Jobflow ID: j-2DPBYD87LTGP8 - this means you are ready to submit some steps!
Here is a playbook containing a single step - this playbook is not EMR-specific:
To run this playbook against your cluster:
This will add the steps to your cluster queue - and the step, a file copy, will execute once all previous steps have completed on the cluster.
Once you are done with the cluster you can terminate it using the following:
To execute all of the above, including spinning up and shutting down the cluster, in one command it’s simply:
NOTE: For help and documentation on each command please see the documentation.
5. Roadmap for Dataflow Runner
We’re taking an iterative approach with Dataflow Runner - today it only has support for AWS EMR but we plan on growing it into a tool that can run on YARN, Google Cloud Dataproc, Azure HDInsight and more!
From the Snowplow perspective we are exploring how to migrate our users from EmrEtlRunner to Dataflow Runner, per our RFC. We don’t plan on doing this immediately - Dataflow Runner needs to become more mature first, and there are still important jobflow execution features in EmrEtlRunner that have no equivalent yet in Dataflow Runner. However, expect to see some initial integration efforts between Snowplow and Dataflow Runner soon!
And of course if you have specific features you’d like to suggest, please add a ticket to the GitHub repo.
Dataflow Runner is completely open source - and has been from the start! If you’d like to get involved, or just try your hand at Golang, please check out the repository.