Snowplow 60 Bee Hummingbird released

03 February 2015  •  Fred Blundun

We are happy to announce the release of Snowplow 60! Our sixtieth release focuses on the Snowplow Kinesis flow, and includes:

  1. A new Kinesis “sink app” that reads the Scala Stream Collector’s Kinesis stream of raw events and stores these raw events in Amazon S3 in an optimized format
  2. An updated version of our Hadoop Enrichment process that supports as an input format the events stored in S3 by the new Kinesis sink app

Together, these two features let you robustly archive your Kinesis event stream in S3, and process and re-process it at will using our tried-and-tested Hadoop Enrichment process. Huge thanks to community member Phil Kallos from Popsugar for his contributions to this release!

Up until now, all Snowplow releases have used semantic versioning. We will continue to use semantic versioning for Snowplow’s many constituent applications and libraries, but our releases of the Snowplow platform as a whole will be known by their release number plus a codename. This is release 60; the codenames for 2015 will be birds in ascending order of size, starting today with the Bee Hummingbird.

The rest of this post will cover the following topics:

  1. The Kinesis LZO S3 Sink
  2. Support for POSTs and webhooks in the Scala Stream Collector
  3. Scala Stream Collector no longer decodes URLs
  4. Self-describing Thrift
  5. EmrEtlRunner updates
  6. Upgrading
  7. Getting help

1. The Kinesis LZO S3 Sink

The Scala Stream Collector writes Snowplow raw events in a Thrift format to a Kinesis stream. The new Kinesis LZO S3 Sink is a Kinesis app which reads records from a stream, compresses them using splittable LZO and writes the compressed files to S3. Each .lzo file has a corresponding .lzo.index file containing the byte offsets for the LZO blocks, so that the blocks can be processed in parallel using Hadoop.

In fact this new sink is not limited to serialized Snowplow Thrift records - it can store any stream of Kinesis records as splittable LZO files in S3.

To accompany this new sink, we have updated the batch-based Hadoop Enrichment process so that it can now read LZO-compressed Thrift binary records. This means that you can potentially run both the Kinesis and Hadoop Enrichment processes off the same Kinesis stream. To use this feature, just set the collector_format field in the EmrEtlRunner’s YAML configuration file to thrift.

You can see the project here.

For more information on setting up the Kinesis LZO S3 Sink, please see these wiki pages:

2. Support for POSTs and webhooks in the Scala Stream Collector

The Scala Stream Collector was previously limited to standard GET requests of the format historically sent by Snowplow trackers. From this release POST requests containing one or more events are now supported too. This makes the Scala Stream Collector more suitable for tracking events from mobile trackers, server-side trackers and indeed from supported webhooks.

Two further improvements to the Scala Stream Collector’s routing are worth noting:

  1. The 1x1 transparent pixel with which the Scala Stream Collector responds to GET requests has been changed to improve compatibility with webmail providers such as Gmail (#1260)
  2. Snowplow community member James Duncan Davidson added a dedicated /health route to the collector, for easier inter-op with Elastic Load Balancer (#1360). Thanks James!

3. Scala Stream Collector no longer decodes URLs

The Scala Stream Collector used to use Spray’s URI parsing to parse and percent-decode incoming GET requests. Unfortunately the enrichment process also percent-decodes querystrings. This meant that incoming non-Base64-encoded events were decoded twice, introducing errors if certain characters were present. This has now been fixed.

4. Self-describing Thrift

The old SnowplowRawEvent Thrift struct output by the Scala Stream Collector didn’t contain all of the fields we now require, such as the body of POST requests.

We have therefore replaced it with a new and improved CollectorPayload struct. Since we wanted to accept legacy SnowplowRawEvent Thrift records and also to possibly add new fields to CollectorPayload in the future, we have implemented self-describing Thrift to ensure that it is always possible to tell which of Snowplow’s Thrift IDL files was used to generate a given event. See this blog post for more detail.

5. EmrEtlRunner updates

We have fixed two bugs in the EmrEtlRunner related to reporting of failed Elastic MapReduce jobs:

  • We worked around a missing dependency on the time_diff gem by re-implementing that functionality manually, as adding in the missing original gem caused cascading issues (#1310)
  • We fixed a bug where the failure reporting would crash if one or more of the jobflow step had a missing created_at property (#1351)

Bee Hummingbird also updates the EmrEtlRunner so it is aware of the new "thrift" format for Snowplow raw events (as stored by the Kinesis LZO S3 Sink).

6. Upgrading

We recommend upgrading EmrEtlRunner to the latest version, 0.11.0, given the bugs fixed in this release. You also must upgrade if you want to use Hadoop to process the events stored by the Kinesis LZO S3 Sink.

Upgrade is as follows:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout r60-bee-hummingbird
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment

This release bumps the Hadoop Enrichment process to version 0.12.0.

In your EmrEtlRunner’s config.yml file, update your `hadoop_enrich job’s version like so:

  :versions:
    :hadoop_enrich: 0.12.0 # WAS 0.11.0

If you want to run the Hadoop Enrichment process against the output of the Kinesis LZO S3 Sink, you will have to change the collector_format field in the configuration file to thrift:

:collector_format: thrift

For a complete example, see our sample config.yml template.

We are steadily moving over to Bintray for hosting binaries and artifacts which don’t have to be hosted on S3. To make deployment easier, the Kinesis apps (Scala Stream Collector, Scala Kinesis Enrich, Kinesis Elasticsearch Sink, and Kinesis S3 Sink) are now all available in a single zip file here:

http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r60_bee_hummingbird.zip

7. Getting help

Documentation for the new Kinesis LZO S3 Sink is available on the Snowplow wiki:

If you have any questions or run any problems, please raise an issue or get in touch with us through the usual channels.