Snowplow 60 Bee Hummingbird released
We are happy to announce the release of Snowplow 60! Our sixtieth release focuses on the Snowplow Kinesis flow, and includes:
- A new Kinesis “sink app” that reads the Scala Stream Collector’s Kinesis stream of raw events and stores these raw events in Amazon S3 in an optimized format
- An updated version of our Hadoop Enrichment process that supports as an input format the events stored in S3 by the new Kinesis sink app
Together, these two features let you robustly archive your Kinesis event stream in S3, and process and re-process it at will using our tried-and-tested Hadoop Enrichment process. Huge thanks to community member Phil Kallos from Popsugar for his contributions to this release!
Up until now, all Snowplow releases have used semantic versioning. We will continue to use semantic versioning for Snowplow’s many constituent applications and libraries, but our releases of the Snowplow platform as a whole will be known by their release number plus a codename. This is release 60; the codenames for 2015 will be birds in ascending order of size, starting today with the Bee Hummingbird.
The rest of this post will cover the following topics:
- The Kinesis LZO S3 Sink
- Support for POSTs and webhooks in the Scala Stream Collector
- Scala Stream Collector no longer decodes URLs
- Self-describing Thrift
- EmrEtlRunner updates
- Getting help
The Scala Stream Collector writes Snowplow raw events in a Thrift format to a Kinesis stream. The new Kinesis LZO S3 Sink is a Kinesis app which reads records from a stream, compresses them using splittable LZO and writes the compressed files to S3. Each
.lzo file has a corresponding
.lzo.index file containing the byte offsets for the LZO blocks, so that the blocks can be processed in parallel using Hadoop.
In fact this new sink is not limited to serialized Snowplow Thrift records - it can store any stream of Kinesis records as splittable LZO files in S3.
To accompany this new sink, we have updated the batch-based Hadoop Enrichment process so that it can now read LZO-compressed Thrift binary records. This means that you can potentially run both the Kinesis and Hadoop Enrichment processes off the same Kinesis stream. To use this feature, just set the
collector_format field in the EmrEtlRunner’s YAML configuration file to
You can see the project here.
For more information on setting up the Kinesis LZO S3 Sink, please see these wiki pages:
The Scala Stream Collector was previously limited to standard
GET requests of the format historically sent by Snowplow trackers. From this release
POST requests containing one or more events are now supported too. This makes the Scala Stream Collector more suitable for tracking events from mobile trackers, server-side trackers and indeed from supported webhooks.
Two further improvements to the Scala Stream Collector’s routing are worth noting:
- The 1x1 transparent pixel with which the Scala Stream Collector responds to GET requests has been changed to improve compatibility with webmail providers such as Gmail (#1260)
- Snowplow community member James Duncan Davidson added a dedicated
/healthroute to the collector, for easier inter-op with Elastic Load Balancer (#1360). Thanks James!
The Scala Stream Collector used to use Spray’s URI parsing to parse and percent-decode incoming
GET requests. Unfortunately the enrichment process also percent-decodes querystrings. This meant that incoming non-Base64-encoded events were decoded twice, introducing errors if certain characters were present. This has now been fixed.
SnowplowRawEvent Thrift struct output by the Scala Stream Collector didn’t contain all of the fields we now require, such as the body of POST requests.
We have therefore replaced it with a new and improved
CollectorPayload struct. Since we wanted to accept legacy
SnowplowRawEvent Thrift records and also to possibly add new fields to CollectorPayload in the future, we have implemented self-describing Thrift to ensure that it is always possible to tell which of Snowplow’s Thrift IDL files was used to generate a given event. See this blog post for more detail.
We have fixed two bugs in the EmrEtlRunner related to reporting of failed Elastic MapReduce jobs:
- We worked around a missing dependency on the
time_diffgem by re-implementing that functionality manually, as adding in the missing original gem caused cascading issues (#1310)
- We fixed a bug where the failure reporting would crash if one or more of the jobflow step had a missing
Bee Hummingbird also updates the EmrEtlRunner so it is aware of the new
"thrift" format for Snowplow raw events (as stored by the Kinesis LZO S3 Sink).
We recommend upgrading EmrEtlRunner to the latest version, 0.11.0, given the bugs fixed in this release. You also must upgrade if you want to use Hadoop to process the events stored by the Kinesis LZO S3 Sink.
Upgrade is as follows:
This release bumps the Hadoop Enrichment process to version 0.12.0.
In your EmrEtlRunner’s
config.yml file, update your `hadoop_enrich job’s version like so:
If you want to run the Hadoop Enrichment process against the output of the Kinesis LZO S3 Sink, you will have to change the collector_format field in the configuration file to
For a complete example, see our sample
We are steadily moving over to Bintray for hosting binaries and artifacts which don’t have to be hosted on S3. To make deployment easier, the Kinesis apps (Scala Stream Collector, Scala Kinesis Enrich, Kinesis Elasticsearch Sink, and Kinesis S3 Sink) are now all available in a single zip file here:
Documentation for the new Kinesis LZO S3 Sink is available on the Snowplow wiki: