Snowplow 0.9.3 released with Clojure Collector fixes

21 May 2014  •  Alex Dean

We are pleased to announce the release of Snowplow 0.9.3, with a whole host of incremental improvements to EmrEtlRunner, plus two important bug fixes for Clojure Collector users.

The first Clojure Collector issue was a problem in the file move functionality in EmrEtlRunner, which was preventing Clojure Collector users from scaling beyond a single instance without data loss. Many thanks to community members Derk Busser and Ryan Doherty for identifying the issue and working with us on a fix.

The second Clojure Collector issue involved the Elastic Beanstalk’s Apache proxy’s IP address(es) showing up in the atomic.events table in place of the expected end-user’s IPs. We were unable to reproduce this issue when running multiple instances, so we do not believe this problem is as widespread.

This release also includes some documentation updates - many thanks to community members Arthur Cinader and Peter Vandenberk for these.

Read on below the fold for:

  1. Clojure Collector fixes
  2. EmrEtlRunner enhancements
  3. Upgrading
  4. Getting help

Both Clojure Collector issues related to running multiple instances of the Collector:

  1. Each Elastic Beanstalk Collector instance was writing its raw event files with consistently the same timestamp-based names. Because the EmrEtlRunner’s file move functionality was not preserving sub-folder names when moving raw event files into the processing directory, all but one instance’s raw events would be lost. So a user with three instances would only see ~33% of their data (issue #717)
  2. We had a report from community member Iain Gray in this email thread that he was seeing the Elastic Beanstalk Apache proxy’s IP address(es) showing up in his atomic.events table in place of the expected end-user’s IPs (issue #719)

We fixed the first issue with an update to EmrEtlRunner: each raw event file’s parent sub-directory is now prepended to the filename during the move to the Processing bucket. So for example:

// Raw files from two instances with same filename
raw-bucket/resources/environments/logs/publish/e-bgp9nsynv7/i-13aabd52/_var_log_tomcat7_localhost_access_log.txt-1400605261.gz
raw-bucket/resources/environments/logs/publish/e-bgp9nsynv7/i-ec19d9af/_var_log_tomcat7_localhost_access_log.txt-1400605261.gz

// -> are renamed safely in processing with the sub-folder name
processing-bucket/snplow2/processing/i-13aabd52-_var_log_tomcat7_localhost_access_log.txt-1400605261.gz
processing-bucket/snplow2/processing/i-ec19d9af-_var_log_tomcat7_localhost_access_log.txt-1400605261.gz

We were unable to reproduce Iain’s bug (issue #719) ourselves with a multiple-instance Clojure Collector setup, but we came up with a likely fix and applied this as part of this release. This fix bumps the Snowplow Clojure Collector version to 0.6.0.

We have made a variety of other bug fixes and small improvements to EmrEtlRunner in this release. Most importantly, the bug fixes:

  1. We fixed an issue where EmrEtlRunner was kicking off the job on EMR even if there were no raw event files loaded into the :processing: bucket (#409)
  2. We fixed a bug where it was not possible to disable Cascading’s catching of unexpected exceptions by leaving the :out_errors: bucket blank (#721)
  3. We fixed a very tricky issue where the threads used to move files could be killed off one-by-one if they encountered sub-directories whilst processing (#401)

We have also made some changes which impact on the format of the config.yml file, so do please read these carefully:

  • We replaced :hadoop_version: with :ami_version:, because the Hadoop version on EMR is now determined exclusively by the AMI setting (#701)
  • We added in a compulsory new :region: field alongside the existing :placement: and :ec2_subnet_id: fields, and clarified when to use each (#754)
  • We updated EmrEtlRunner to use proper Ruby logging, configurable with a new :logging: section (#194)

You can find the updated config.yml template here. We cover upgrading EmrEtlRunner and your existing config.yml in the Upgrading section below.

Finally, we made a few under-the-covers improvements to EmrEtlRunner, which should help with maintainability and ease-of-deployment going forwards:

  • We added some initial unit tests (#672)
  • We added Ruby contracts onto all function signatures (#392)
  • We added the ability to bundle EmrEtlRunner as a JRuby fat jar (#674)
  • We re-architected EmrEtlRunner so that it’s embeddable in other applications (#128)

Expect to hear more about JRuby and embedding EmrEtlRunner in the coming months as we bring some of these improvements to StorageLoader too.

Upgrading is a two step process:

  1. Update EmrEtlRunner
  2. Update Clojure Collector - optional

Let’s take these in turn:

You need to update EmrEtlRunner to the latest code (version 0.7.0, in the Snowplow 0.9.3 release) on GitHub:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.9.3
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment

You also need to update your EmrEtlRunner’s config.yml file in a few places. First add a logging section at the top:

:logging:
  :level: DEBUG # You can optionally switch to INFO for production

Next you need to replace this:

:emr:
  :hadoop_version: 1.0.3

with this:

:emr:
  :ami_version: 2.4.2

If you need to use a different Hadoop version, check out this handy table to determine the correct AMI version.

Finally, add the region in:

:emr:
  :ami_version: 2.4.2
  :region: us-east-1 # Or your region

Your :region: will be your existing :placement: without the character on the end. Note that if you are running your EMR job in an EC2 subnet, you no longer need to set the :placement: field.

Once you have made these changes, do check your final version against the updated config.yml template here.

This release bumps the Clojure Collector to version 0.6.0. Upgrading to this release is only necessary if you have been encountering the issue with proxy IPs appearing in atomic.events, as discussed in this email thread (issue #719).

To upgrade to this release:

  1. Download the new warfile by right-clicking on this link and selecting “Save As…”
  2. Log in to your Amazon Elastic Beanstalk console
  3. Browse to your Clojure Collector’s application
  4. Click the “Upload New Version” and upload your warfile

And that’s it!

As always, if you do run into any issues or don’t understand any of the above changes, please raise an issue or get in touch with us via the usual channels.

For more details on this release, please check out the 0.9.3 Release Notes on GitHub.