Dataflow Runner 0.2.0 released

Building on the initial release of Dataflow Runner last month, we are proud to announce version 0.2.0, aiming to bring Dataflow Runner up to feature parity with our long-standing EmrEtlRunner application.

As a quick reminder, Dataflow Runner is a cloud-agnostic tool to create clusters and run jobflows which, for the moment, only supports AWS EMR.

If you need a refresher on the rationale behind Dataflow Runner, feel free to checkout the RFC on the subject.

In the rest of this post, we will cover:

  1. Support for EMR Applications
  2. Support for Elastic Block Store
  3. Configurable logging level
  4. Other updates
  5. Roadmap
  6. Contributing

1. Support for EMR Applications

EMR Applications are a way to tell EMR what you want installed on your cluster when you launch it. There are various big data applications to choose from such as Flink, Spark or Hive.

As we’re moving the batch pipeline away from Scalding to Spark, as detailed in the Spark RFC, the need to support EMR Applications in Dataflow Runner became apparent, since Spark is not installed by default when launching an EMR cluster.

To specify which applications you want installed on your EMR cluster, you just have to add a JSON array to your cluster configuration as shown:

1 {
2   "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-1-0",
3   "data": {
4     "name": "dataflow-runner - cluster name",
5     // omitted for brevity
6     "applications": [ "Hadoop", "Spark" ]
7   }
8 }

Note that, compared with version 0.1.0 of Dataflow Runner, the Avro schema version has been changed to 1-1-0. The schema itself has been updated to reflect the improvements made in version 0.2.0 of Dataflow Runner. However, the two schemas being fully backward-compatible, if you do not wish to use the new features introduced in this release you do not have to change anything. You can find the up-to-date schema on GitHub.

You can also find a full example of a cluster configuration on GitHub.

2. Support for Elastic Block Store

Another feature which was recently added to EmrEtlRunner in Snowplow Chichen Itza is support for Elastic Block Store (EBS for short). We wanted to support this in Dataflow Runner as well.

In 0.2.0, you’re now able to specify an EBS volume for each instance in your EMR cluster, be it master, core or task instances. To do so, you’ll need to modify the EC2 instances part of your cluster configuration file and add the wanted EBS configurations, an example follows.

 1 {
 2   "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-1-0",
 3   "data": {
 4     "name": "dataflow-runner - cluster name",
 5     // omitted for brevity
 6     "ec2": {
 7       // omitted for brevity
 8       "instances": {
 9         "master": {
10           "type": "m1.medium",
11           "ebsConfiguration": {
12             "ebsOptimized": true,
13             "ebsBlockDeviceConfigs": [
14               {
15                 "volumesPerInstance": 1,
16                 "volumeSpecification": {
17                   "iops": 3000,
18                   "sizeInGB": 10,
19                   "volumeType": "io1"
20                 }
21               }
22             ]
23           }
24         },
25         "core": {
26           "type": "m1.medium",
27           "count": 3,
28           "ebsConfiguration": {
29             "ebsOptimized": true,
30             "ebsBlockDeviceConfigs": [
31               {
32                 "volumesPerInstance": 1,
33                 "volumeSpecification": {
34                   "iops": 5000,
35                   "sizeInGB": 20,
36                   "volumeType": "io1"
37                 }
38               }
39             ]
40           }
41         },
42         "task": {
43           "type": "m1.medium",
44           "count": 3,
45           "bid": "0.015",
46           "ebsConfiguration": {
47             "ebsOptimized": true,
48             "ebsBlockDeviceConfigs": [
49               {
50                 "volumesPerInstance": 1,
51                 "volumeSpecification": {
52                   "iops": 5000,
53                   "sizeInGB": 10,
54                   "volumeType": "io1"
55                 }
56               }
57             ]
58           }
59         }
60       }
61     }
62   }
63 }

Again, you can also refer to the cluster configuration example on GitHub for details.

3. Configurable logging level

We’ve also added a little option to set the logging level to keep Dataflow Runner from being too noisy. You can set it for any dataflow-runner command with the –log-level flag. Supported log levels are debug, info, warning, error, fatal and panic.

As an example, we could run:

> dataflow-runner up --emr-config emr-config.json --log-level fatal

4. Other updates

Dataflow Runner 0.2.0 also brings another couple of changes under the hood:

  • It is built against Go 1.8 (issue #13)
  • To increase test coverage, we adopted the excellent built-in EMR mocking capabilities of the Go AWS SDK (issue #10)

5. Roadmap

The major long-term goal for Dataflow Runner is still to support multiple cloud providers such as Google Cloud Dataproc or Azure HDInsight.

In the shorter term, we’ve also started a discussion around finding ways to react to step failures; this is the only remaining feature for Dataflow Runner to reach feature parity with EmrEtlRunner (see issue #15).

If you have other features in mind, feel free to log an issue in the GitHub repository.

6. Contributing

You can check out the repository if you’d like to get involved! In particular, any preparatory work getting other cloud providers integrated would be much appreciated.

Thoughts or questions? Come join us in our Discourse forum!

Ben Fradet

Ben is a data engineer at Snowplow. You can find him on GitHub, Twitter and LinkedIn.