Dataflow Runner 0.2.0 released
Building on the initial release of Dataflow Runner last month, we are proud to announce version 0.2.0, aiming to bring Dataflow Runner up to feature parity with our long-standing EmrEtlRunner application.
As a quick reminder, Dataflow Runner is a cloud-agnostic tool to create clusters and run jobflows which, for the moment, only supports AWS EMR.
If you need a refresher on the rationale behind Dataflow Runner, feel free to checkout the RFC on the subject.
In the rest of this post, we will cover:
- Support for EMR Applications
- Support for Elastic Block Store
- Configurable logging level
- Other updates
1. Support for EMR Applications
EMR Applications are a way to tell EMR what you want installed on your cluster when you launch it. There are various big data applications to choose from such as Flink, Spark or Hive.
As we’re moving the batch pipeline away from Scalding to Spark, as detailed in the Spark RFC, the need to support EMR Applications in Dataflow Runner became apparent, since Spark is not installed by default when launching an EMR cluster.
To specify which applications you want installed on your EMR cluster, you just have to add a JSON array to your cluster configuration as shown:
Note that, compared with version 0.1.0 of Dataflow Runner, the Avro schema version has been changed to 1-1-0. The schema itself has been updated to reflect the improvements made in version 0.2.0 of Dataflow Runner. However, the two schemas being fully backward-compatible, if you do not wish to use the new features introduced in this release you do not have to change anything. You can find the up-to-date schema on GitHub.
You can also find a full example of a cluster configuration on GitHub.
2. Support for Elastic Block Store
In 0.2.0, you’re now able to specify an EBS volume for each instance in your EMR cluster, be it master, core or task instances. To do so, you’ll need to modify the EC2 instances part of your cluster configuration file and add the wanted EBS configurations, an example follows.
Again, you can also refer to the cluster configuration example on GitHub for details.
3. Configurable logging level
We’ve also added a little option to set the logging level to keep Dataflow
Runner from being too noisy. You can set it for any
with the –log-level flag. Supported log levels are
As an example, we could run:
4. Other updates
Dataflow Runner 0.2.0 also brings another couple of changes under the hood:
- It is built against Go 1.8 (issue #13)
- To increase test coverage, we adopted the excellent built-in EMR mocking capabilities of the Go AWS SDK (issue #10)
The major long-term goal for Dataflow Runner is still to support multiple cloud providers such as Google Cloud Dataproc or Azure HDInsight.
In the shorter term, we’ve also started a discussion around finding ways to react to step failures; this is the only remaining feature for Dataflow Runner to reach feature parity with EmrEtlRunner (see issue #15).
If you have other features in mind, feel free to log an issue in the GitHub repository.
You can check out the repository if you’d like to get involved! In particular, any preparatory work getting other cloud providers integrated would be much appreciated.