We are pleased to announce version 0.4.0 of Dataflow Runner, our cloud-agnostic tool to create batch-processing clusters and run jobflows. This small release is centered around usability improvements.
In this post, we will cover:
1. Fetching logs for failed steps
When leveraging the
run commands, it is now possible to access the logs produced by any failed steps through the
In the following example, we launch a cluster to performing a couple of S3DistCp steps with the following command:
Unfortunately, one of the steps failed to complete successfully. However, thanks to the
--log-failed-steps flag, we’ll be able to review its logs without having to access the S3 bucket which contains the logs:
All of the log files for all of the steps which ended up in the
FAILED state will be printed out. Usually, those log files can be located in a bucket conforming to the following pattern
s3://my-bucket/emr-logsis the log URI you filled out when launching the cluster
j-123is the cluster ID
s-123is the failed step ID
2. Reducing logging noise
We have also reduced the “noisiness” of our logging, with each jobflow step now producing only one informational line throughout the lifetime of the cluster, specifying the step’s concluding status, e.g. whether it completed successfully, was cancelled or failed.
This is in contrast with the previous approach, where Dataflow Runner would output the status of every completed step, whether successfully completed or not, every fifteen seconds.
Dataflow Runner continues to evolve at Snowplow.
As we stated in the blog post for the previous release, we are committed to supporting other cloud “big data services” such as Azure HDInsight (see issue #22) and Google Cloud Dataproc (see issue #33).
If you have other features in mind, feel free to log an issue in the GitHub repository.
You can check out the repository if you’d like to get involved!
In particular, any help integrating other big data services such as HDInsight or Cloud Dataproc would be much appreciated.