The Snowplow EmrEtlRunner uses Rob Slifka’s Elasticity Ruby library to interact with the Elastic MapReduce API. AWS recently altered this API for new AWS users so that it is now based on clusters rather than job flows, breaking the API calls used by Elasticity to check the status of an EMR job.
Rob has moved very fast to put out a new Elasticity release (version 6.0.2) using the all-new EMR APIs. Thanks a lot Rob!
For more information about Elasticity, check out Rob’s guest post from back in 2013.
The EmrEtlRunner is no longer limited to a single bucket. Now you can specify an array of in buckets in the configuration YAML and raw event files from all of them will be moved to the processing bucket. This is helpful when upgrading your collector version: you can process events from your own and new collectors in tandem until all event traffic has moved to the new collector.
See the repository for an example configuration file.
More recent versions of Scala Hadoop Enrich (1.0.0 and later) are stored in a different S3 bucket from previous versions. Unforunately, our previous EmrEtlRunner release (0.15.0 in Release 66 Oriental Skylark) always looked in the new location, no matter what version of Hadoop Enrich was specified.
The new version of EmrEtlRunner decides where to look for the jar based on the jar’s version; this means that you can use the latest EmrEtlRunner version with earlier versions of Hadoop Enrich.
You need to update EmrEtlRunner to the latest version (0.16.0) on GitHub:
For more details on this release, please check out the r68 Turquoise Jay on GitHub.
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.