Brandon discussed his Snowplow winternship further in his own excellent blog post.
In the rest of this post we will cover:
The following diagram sets out our planned data flow leveraging Kinesis. The colourized components are part of the 0.9.0 release; components in grey are yet to be implemented:
As you can see, 0.9.0 consists of two key components:
Please note that the 0.9.0 release makes no changes to the existing, batch-based Snowplow flow leveraging Amazon S3 & Elastic MapReduce.
Let’s go over each of the two new components in a little more detail in the following two sections:
The Scala Stream Collector is a Snowplow event collector, written in Scala. The Scala Stream Collector allows near-real time processing of a Snowplow raw event stream. It also sets a third-party cookie, like the Clojure Collector.
The Scala Stream Collector receives raw Snowplow events over HTTP, serializes them to a Thrift record format, and then writes them to a sink. Currently supported sinks are:
stdoutfor a custom stream collection process
To setup the Scala Stream Collector, please see the Scala Stream Collector Setup Guide.
Scala Kinesis Enrich is a Kinesis application, built using the Kinesis Client Library and scalazon, which:
As well as working with Kinesis streams, Scala Kinesis Enrich can also be configured to work with Unix
Under the covers, Scala Kinesis Enrich makes use of scala-common-enrich, our Snowplow Enrichment library which we extracted from our Hadoop Enrichment process in the 0.8.12 Snowplow release.
To setup Scala Kinesis Enrich, please see the Scala Kinesis Enrich Setup Guide.
When developing the new collector and enrichment components, we realized that there were strong parallels between the Kinesis stream processing paradigm and conventional Unix
stdio I/O streams. As a result, we added the ability for:
stdoutinstead of a Kinesis stream
stdin, and write enriched events to
This has a nice side-effect: it is possible to run Snowplow in a “local mode”, where you simply pipe the output of Scala Stream Collector directly into Scala Kinesis Enrich, and can then see the generated enriched events printed to your console. You can run Snowplow in local mode with a shell script like this:
Make sure to set the sources and sinks in your configuration files to the relevant
Snowplow local mode could be helpful for debugging Snowplow tracker implementations before putting tags live - let us know how you get on with it!
As always, if you do run into any issues or don’t understand any of the above changes, please raise an issue or get in touch with us via the usual channels.
For more details on this release, please check out the 0.9.0 Release Notes on GitHub.
We have identified three major tasks required to bring our Kinesis-based, beta-quality implementation towards parity with our production-quality EMR-based implementation:
Necessarily we have to balance these tasks against the work we are committed to for our existing batch-based Snowplow flow leveraging Amazon S3 and Elastic MapReduce.
If you are interested in contributing towards the Kinesis workstream, or sponsoring us to accelerate their release, then do get in touch!
And that’s it - but this post would not be complete without us saying a huge “thank you” to Snowplow wintern Brandon Amos for his massive contribution to this release!
As you can see from our original winternship blog post, the original plan was for Brandon to come onboard for a month and ship the Scala Stream Collector. In fact, Brandon delivered the new collector in his first two weeks and was able to then turn his attention to Scala Kinesis Enrich, which he built in the remaining fortnight. All built using Kinesis, a platform he had never used before, and all with tests and wiki documentation - truly a heroic contribution to Snowplow!
Thanks again Brandon!