Snowplow 61 Pygmy Parrot released

02 March 2015  •  Alex Dean

We are pleased to announce the immediate availability of Snowplow 61, Pygmy Parrot.

This release has a variety of new features, operational enhancements and bug fixes. The major additions are:

  1. You can now parse Amazon CloudFront access logs using Snowplow
  2. The latest Clojure Collector version supports Tomcat 8 and CORS, ready for cross-domain POST from JavaScript and ActionScript
  3. EmrEtlRunner’s failure handling and Clojure Collector log handling have been improved

The rest of this post will cover the following topics:

  1. CloudFront access log parsing
  2. Clojure Collector updates
  3. Operational improvements to EmrEtlRunner
  4. Bug fixes in Scala Common Enrich
  5. Upgrading
  6. Getting help

1. CloudFront access log parsing

We have added the ability to parse Amazon CloudFront access logs (web distribution format only) to the Snowplow Hadoop-based pipeline.

If you use CloudFront as your CDN for web content, you can now use Snowplow to process your CloudFront access logs. Snowplow will enrich these logs with the user-agent, page URI fragments and geo-location as standard.

To process CloudFront access logs, first create a new EmrEtlRunner config.yml:

  1. Set your :raw:in: bucket to where your logs are written
  2. Set your :etl:collector_format: to tsv/com.amazon.aws.cloudfront/wd_access_log
  3. Provide new bucket paths and a new job name, to prevent this job from clashing with your existing Snowplow job(s)

If you are running the Snowplow batch (Hadoop) flow with Amazon Redshift, you should deploy the relevent event table into your Amazon Redshift database. You can find the table definition here:

You can either load these events using your existing atomic.events table, or if you prefer load into an all-new database or schema. If you load into your existing atomic.events table, make sure to schedule these loads so that they don’t clash with your existing loads.

2. Clojure Collector updates

We have updated the Clojure Collector to run using Tomcat 8, which is now the default Tomcat version when creating a new Tomcat application on Elastic Beanstalk.

As of this release the Clojure Collector supports CORS and the CORS equivalent for ActionScript; this will allow the Clojure Collector to support events being POSTed from new releases of the JavaScript and ActionScript Trackers, coming very soon.

We have also added the ability to disable the setting of third-party cookies altogether: simply configure the cookie duration to 0 and the Clojure Collector will not set its third-party cookie.

3. Operational improvements to EmrEtlRunner

We have made various operational improvements to EmrEtlRunner.

If there are no raw event files to process, EmrEtlRunner will now exit with a specific return code. This return code is then detected by snowplow-runner-and-loader.sh, which will then exit without failure. In other words: an absence of files to process no longer causes snowplow-runner-and-loader.sh to exit with failure.

We have updated EmrEtlRunner’s handling of Clojure Collector logs. The logs now get renamed on move to:

basename.yyyy-MM-dd-HH.region.instance.txt.gz

This new filename format means that the raw logs will be archived to /yyyy-MM-dd sub-folders, just as the CloudFront Collector’s logs are.

Finally, we have updated EmrEtlRunner to also check that the enriched events bucket is empty prior to moving any raw logs into staging. If any enriched events are found, then the move to staging will not start. This makes for much smoother operation when you are running your enrichment process very frequently (e.g. hourly).

4. Bug fixes in Scala Common Enrich

We have fixed various bugs in Scala Common Enrich, mostly related to character encoding issues:

  1. We fixed a bug where our Base64 decoding did not specify UTF-8 charset, causing problems with Unicode text on EMR where the default characterset is US_ASCII (#1403)
  2. We removed an incorrect extra layer of URL decoding from non-Bas64-encoded JSONs (#1396)
  3. There was a mismatch between the Snowplow Tracker Protocol, which mandated ti_nm for transaction item’s names, and Scala Common Enrich, which was expecting ti_na for the same. Scala Common Enrich now supports both options (#1401)
  4. We have updated the SnowplowAdapter component to accept “charset=UTF-8” as a content-type parameter, because some web browsers always attach this content-type parameter to their POSTs (#1424)

5. Upgrading

You need to update EmrEtlRunner to the latest code (0.12.0) on GitHub:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout r61-pygmy-parrot
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment

If you currently use snowplow-runner-and-loader.sh, upgrade to the latest version too.

This release bumps the Hadoop Enrichment process to version 0.13.0 .

In your EmrEtlRunner’s config.yml file, update your hadoop_enrich and hadoop_shred jobs’ versions like so:

  :versions:
    :hadoop_enrich: 0.13.0 # WAS 0.12.0

For a complete example, see our sample config.yml template.

This release bumps the Clojure Collector to version 1.0.0.

You will not be able to upgrade an existing Tomcat 7 cluster to use this version. Instead, to upgrade to this release:

  1. Download the new warfile by right-clicking on this link and selecting “Save As…”
  2. Log in to your Amazon Elastic Beanstalk console
  3. Browse to your Clojure Collector’s application
  4. Click the “Launch New Environment” action
  5. Click the “Upload New Version” and upload your warfile

When you are confident that the new collector is performing as expected, you can choose the “Swap Environment URLs” action to put the new collector live.

6. Getting help

For more details on this release, please check out the r61 Pygmy Parrot Release Notes on GitHub.

If you have any questions or run any problems, please raise an issue or get in touch with us through the usual channels.