Dataflow Runner 0.3.0 released

30 May 2017  •  Ben Fradet

We are pleased to announce version 0.3.0 of Dataflow Runner, our cloud-agnostic tool to create clusters and run jobflows. This release is centered around new features and usability improvements.

In this post, we will cover:

  1. Preventing overlapping job runs through locks
  2. Tagging playbooks
  3. New template functions
  4. Other updates
  5. Roadmap
  6. Contributing

1. Preventing overlapping job runs through locks

This release introduces a mechanism to prevent two jobs from running at the same time. This is great if you have for example an ETL process that needs to run as a singleton, or you have multiple jobs that each need exclusive access to the same database.

With this feature, Dataflow Runner will acquire a lock before starting the job. Its release will happen when:

  • the job has terminated (whether successfully or with failure) with the --softLock flag
  • the job has succeeded with the --lock flag (“hard lock”)

As the above implies, if a job were to fail and the --lock flag was used, manual cleaning of the lock will be required.

Two strategies for storing the lock have been made available: local and distributed.

1.1 Local lock

You can leverage a local lock when launching your playbook with ./dataflow-runner run using:

./dataflow-runner run            \
  --emr-playbook playbook.json   \
  --emr-cluster  j-21V4W2CSLYUCU \
  --lock         path/to/lock

This prevents anyone on this machine from running another playbook using path/to/lock as lock.

For example, launching the following while the steps above are running:

./dataflow-runner run           \
  --emr-playbook playbook.json  \
  --emr-cluster  j-KJC0LSX73BSF \
  --lock         path/to/lock

fails with:

WARN[0000] lock already held at path/to/lock

You can set the lock name as appropriate to setup locks across different playbooks, job names and/or cluster IDs.

In a local context, the lock will be materialized by a file at the specified path which can be relative or absolute. In case of a relative path, it will be relative to your current working directory.

1.2 Distributed lock

Anoter strategy is to leverage Consul to enforce a distributed lock:

./dataflow-runner run            \
  --emr-playbook playbook.json   \
  --emr-cluster  j-21V4W2CSLYUCU \
  --lock         path/to/lock    \
  --consul       127.0.0.1:8500

That way, anyone using path/to/lock as lock and this Consul server will have to respect the lock.

In a distributed context, the lock will be materialized by a key-value pair in Consul, the key being at the specified path.

2. Tagging playbooks

Much like cluster configurations which can be tagged, versions 0.3.0 introduces the ability to tag playbooks.

As an example, we could have the following playbook.json file:

{
  "schema": "iglu:com.snowplowanalytics.dataflowrunner/PlaybookConfig/avro/1-0-1",
  "data": {
    "region": "us-east-1",
    "credentials": {
      "accessKeyId": "env",
      "secretAccessKey": "env"
    },
    "steps": [
      {
        "type": "CUSTOM_JAR",
        "name": "Combine Months",
        "actionOnFailure": "CANCEL_AND_WAIT",
        "jar": "/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar",
        "arguments": [
          "--src",
          "s3n://my-output-bucket/enriched/bad/",
          "--dest",
          "hdfs:///local/monthly/"
        ]
      }
    ],
    "tags": [
      {
        "key": "environment",
        "value": "production"
      }
    ]
  }
}

However, unlike the cluster configuration tags which actually tag the EMR cluster, playbook tags don’t have any effect in EMR.

Note that, compared with version 0.2.0 of Dataflow Runner, the playbook schema version has changed to 1-0-1. 1-0-1 is fully backward compatible, so if you do not wish to use the tags introduced in this release you do not have to change anything.

The up-to-date playbook schema can be found on GitHub.

3. New template functions

In addition to the already existing nowWithFormat and systemEnv, the 0.3.0 release brings three new template functions: timeWithFormat, base64, and base64File.

3.1 timeWithFormat

Similarly to nowWithFormat, timeWithFormat [time] [format] will format the specified unix time thanks to the format argument.

As an example, if we have the following in our cluster.json:

{
  "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-0-1",
  "data": {
    "name": "dataflow-runner {{timeWithFormat "1495622024" "Mon Jan _2 15:04:05 2006"}}",
    // omitted for brevity
  }
}

it results in:

{
  "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-0-1",
  "data": {
    "name": "dataflow-runner Wed May 24 10:33:44 2017",
    // omitted for brevity
  }
}

3.2 base64

As its name implies, the base64 template function will encode the argument using base 64 encoding.

For example:

{
  "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-0-1",
  "data": {
    "name": "{{base64 "dataflow-runner"}}",
    // omitted for brevity
  }
}

results in:

{
  "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-0-1",
  "data": {
    "name": "ZGF0YWZsb3ctcnVubmVy",
    // omitted for brevity
  }
}

3.3 base64File

Buidling on base64, base64File will encode the contents of the file passed as argument.

Let’s say we have a playbook-name.txt file containing:

dataflow-runner

The following cluster.json:

{
  "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-0-1",
  "data": {
    "name": "{{base64File "playbook-name.txt"}}",
    // omitted for brevity
  }
}

results in:

{
  "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-0-1",
  "data": {
    "name": "ZGF0YWZsb3ctcnVubmVyCg==",
    // omitted for brevity
  }
}

4. Other updates

Some changes have been made to improve usability regarding missing template variables:

4.1 Short-circuit execution on unset template variable

Prior to 0.3.0, if you forgot to specify a template variable, then the string <no value> would be filled into the template.

For example, launching an EMR cluster with the following cluster.json configuration:

{
  "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-0-1",
  "data": {
    "name": "{{.name}} - {{.owner}}",
    // omitted for brevity
  }
}

with:

./dataflow-runner up         \
  --emr-config config.json   \
  --vars       name,snowplow

would have resulted in an EMR cluster named: snowplow - <no value>.

With 0.3.0, forgetting a template variable is not allowed, and the following error will be generated:

FATA[0000] template: cluster.json: executing "cluster.json" at <.owner>: map has no entry for key "owner"

4.2 Short-circuit execution on unset environment variable

In the same vein, referring to an unset environment variable in a template through systemEnv will result in an error instead of an empty string.

Let’s say that we have the following cluster.json:

{
  "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-0-1",
  "data": {
    "name": "{{systemEnv "CLUSTER_NAME"}}",
    // omitted for brevity
  }
}

Before 0.3.0, launching the cluster with the CLUSTER_NAME enviroment variable unset would have resulted in an EMR cluster with no name. Now, it will result in the following error:

FATA[0000] template: cluster.json: executing "cluster.json" at <systemEnv "CLUSTER_NAME">: error calling systemEnv: environment variable CLUSTER_NAME not set

5. Roadmap

As we stated in the blog post for the previous release, we are committed to supporting other clouds such as Azure HDInsight (see issue #22) and Google Cloud Dataproc.

If you have other features in mind, feel free to log an issue in the GitHub repository.

6. Contributing

You can check out the repository if you’d like to get involved! In particular, any preparatory work getting other cloud providers integrated would be much appreciated.