Dataflow Runner 0.4.0 released

16 February 2018  •  Ben Fradet

We are pleased to announce version 0.4.0 of Dataflow Runner, our cloud-agnostic tool to create batch-processing clusters and run jobflows. This small release is centered around usability improvements.

In this post, we will cover:

  1. Fetching logs for failed steps
  2. Reducing logging noise
  3. Roadmap
  4. Contributing

1. Fetching logs for failed steps

When leveraging the run-transient or run commands, it is now possible to access the logs produced by any failed steps through the --log-failed-steps flag.

In the following example, we launch a cluster to performing a couple of S3DistCp steps with the following command:

./dataflow-runner run --emr-playbook playbook.json --emr-cluster j-123 --log-failed-steps

Unfortunately, one of the steps failed to complete successfully. However, thanks to the --log-failed-steps flag, we’ll be able to review its logs without having to access the S3 bucket which contains the logs:

ERRO[0004] Step 'step' with id 'step-id' was FAILED
ERRO[0004] Content of log file 'stderr.gz':
ERRO[0004] Exception in thread "main" java.lang.RuntimeException: Error running job
    at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:927)
    at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:705)
    ...

All of the log files for all of the steps which ended up in the FAILED state will be printed out. Usually, those log files can be located in a bucket conforming to the following pattern s3://my-bucket/emr-logs/j-123/steps/s-123/ where:

  • s3://my-bucket/emr-logs is the log URI you filled out when launching the cluster
  • j-123 is the cluster ID
  • s-123 is the failed step ID

2. Reducing logging noise

We have also reduced the “noisiness” of our logging, with each jobflow step now producing only one informational line throughout the lifetime of the cluster, specifying the step’s concluding status, e.g. whether it completed successfully, was cancelled or failed.

This is in contrast with the previous approach, where Dataflow Runner would output the status of every completed step, whether successfully completed or not, every fifteen seconds.

3. Roadmap

Dataflow Runner continues to evolve at Snowplow.

As we stated in the blog post for the previous release, we are committed to supporting other cloud “big data services” such as Azure HDInsight (see issue #22) and Google Cloud Dataproc (see issue #33).

If you have other features in mind, feel free to log an issue in the GitHub repository.

4. Contributing

You can check out the repository if you’d like to get involved!

In particular, any help integrating other big data services such as HDInsight or Cloud Dataproc would be much appreciated.