We are pleased to announce the release of Snowplow 0.9.3, with a whole host of incremental improvements to EmrEtlRunner, plus two important bug fixes for Clojure Collector users.
The first Clojure Collector issue was a problem in the file move functionality in EmrEtlRunner, which was preventing Clojure Collector users from scaling beyond a single instance without data loss. Many thanks to community members Derk Busser and Ryan Doherty for identifying the issue and working with us on a fix.
The second Clojure Collector issue involved the Elastic Beanstalk’s Apache proxy’s IP address(es) showing up in the
atomic.events table in place of the expected end-user’s IPs. We were unable to reproduce this issue when running multiple instances, so we do not believe this problem is as widespread.
Read on below the fold for:
Both Clojure Collector issues related to running multiple instances of the Collector:
- Each Elastic Beanstalk Collector instance was writing its raw event files with consistently the same timestamp-based names. Because the EmrEtlRunner’s file move functionality was not preserving sub-folder names when moving raw event files into the processing directory, all but one instance’s raw events would be lost. So a user with three instances would only see ~33% of their data (issue #717)
- We had a report from community member Iain Gray in this email thread that he was seeing the Elastic Beanstalk Apache proxy’s IP address(es) showing up in his
atomic.eventstable in place of the expected end-user’s IPs (issue #719)
We fixed the first issue with an update to EmrEtlRunner: each raw event file’s parent sub-directory is now prepended to the filename during the move to the Processing bucket. So for example:
We were unable to reproduce Iain’s bug (issue #719) ourselves with a multiple-instance Clojure Collector setup, but we came up with a likely fix and applied this as part of this release. This fix bumps the Snowplow Clojure Collector version to 0.6.0.
We have made a variety of other bug fixes and small improvements to EmrEtlRunner in this release. Most importantly, the bug fixes:
- We fixed an issue where EmrEtlRunner was kicking off the job on EMR even if there were no raw event files loaded into the
- We fixed a bug where it was not possible to disable Cascading’s catching of unexpected exceptions by leaving the
:out_errors:bucket blank (#721)
- We fixed a very tricky issue where the threads used to move files could be killed off one-by-one if they encountered sub-directories whilst processing (#401)
We have also made some changes which impact on the format of the
config.yml file, so do please read these carefully:
- We replaced
:ami_version:, because the Hadoop version on EMR is now determined exclusively by the AMI setting (#701)
- We added in a compulsory new
:region:field alongside the existing
:ec2_subnet_id:fields, and clarified when to use each (#754)
- We updated EmrEtlRunner to use proper Ruby logging, configurable with a new
You can find the updated
config.yml template here. We cover upgrading EmrEtlRunner and your existing
config.yml in the Upgrading section below.
Finally, we made a few under-the-covers improvements to EmrEtlRunner, which should help with maintainability and ease-of-deployment going forwards:
- We added some initial unit tests (#672)
- We added Ruby contracts onto all function signatures (#392)
- We added the ability to bundle EmrEtlRunner as a JRuby fat jar (#674)
- We re-architected EmrEtlRunner so that it’s embeddable in other applications (#128)
Expect to hear more about JRuby and embedding EmrEtlRunner in the coming months as we bring some of these improvements to StorageLoader too.
Upgrading is a two step process:
Let’s take these in turn:
You need to update EmrEtlRunner to the latest code (version 0.7.0, in the Snowplow 0.9.3 release) on GitHub:
You also need to update your EmrEtlRunner’s
config.yml file in a few places. First add a logging section at the top:
Next you need to replace this:
If you need to use a different Hadoop version, check out this handy table to determine the correct AMI version.
Finally, add the region in:
:region: will be your existing
:placement: without the character on the end. Note that if you are running your EMR job in an EC2 subnet, you no longer need to set the
Once you have made these changes, do check your final version against the updated
config.yml template here.
This release bumps the Clojure Collector to version 0.6.0. Upgrading to this release is only necessary if you have been encountering the issue with proxy IPs appearing in
atomic.events, as discussed in this email thread (issue #719).
To upgrade to this release:
- Download the new warfile by right-clicking on this link and selecting “Save As…”
- Log in to your Amazon Elastic Beanstalk console
- Browse to your Clojure Collector’s application
- Click the “Upload New Version” and upload your warfile
And that’s it!
For more details on this release, please check out the 0.9.3 Release Notes on GitHub.