We are hugely excited to announce the release of Snowplow 0.9.0. This release introduces our initial beta support for Amazon Kinesis in the Snowplow Collector and Enrichment components, and was developed in close collaboration with Snowplow wintern Brandon Amos.
At Snowplow we are hugely excited about Kinesis’s potential, not just to enable near-real-time event analytics, but more fundamentally to serve as a business’s unified log, aka its “digital nervous system”. This is a concept we introduced recently in our blog post The three eras of business data processing, and further explored at the inaugural Kinesis London meetup.
Before we get further into this release, let’s see what Brandon has to say:
In the rest of this post we will cover:
- Overview of this release
- Scala Stream Collector
- Scala Kinesis Enrich
- Bonus feature: Snowplow local mode
- Getting help
- Roadmap and contributing
The following diagram sets out our planned data flow leveraging Kinesis. The colourized components are part of the 0.9.0 release; components in grey are yet to be implemented:
As you can see, 0.9.0 consists of two key components:
- A new Scala Stream Collector, which collects raw Snowplow events over HTTP and writes them to a Kinesis stream as Thrift-serialized records
- A new Scala Kinesis Enrich component, which reads raw Snowplow events from the raw Kinesis stream, validates and enriches them and writes them back to an enriched Kinesis event stream
Please note that the 0.9.0 release makes no changes to the existing, batch-based Snowplow flow leveraging Amazon S3 & Elastic MapReduce.
Let’s go over each of the two new components in a little more detail in the following two sections:
The Scala Stream Collector is a Snowplow event collector, written in Scala. The Scala Stream Collector allows near-real time processing of a Snowplow raw event stream. It also sets a third-party cookie, like the Clojure Collector.
The Scala Stream Collector receives raw Snowplow events over HTTP, serializes them to a Thrift record format, and then writes them to a sink. Currently supported sinks are:
- Amazon Kinesis
stdoutfor a custom stream collection process
To setup the Scala Stream Collector, please see the Scala Stream Collector Setup Guide.
Scala Kinesis Enrich is a Kinesis application, built using the Kinesis Client Library and scalazon, which:
- Reads raw Snowplow events off a Kinesis stream populated by the Scala Stream Collector
- Validates each raw event
- Enriches each event (e.g. infers the location of the user from his/her IP address)
- Writes the enriched Snowplow event to another Kinesis stream
As well as working with Kinesis streams, Scala Kinesis Enrich can also be configured to work with Unix
Under the covers, Scala Kinesis Enrich makes use of scala-common-enrich, our Snowplow Enrichment library which we extracted from our Hadoop Enrichment process in the 0.8.12 Snowplow release.
To setup Scala Kinesis Enrich, please see the Scala Kinesis Enrich Setup Guide.
When developing the new collector and enrichment components, we realized that there were strong parallels between the Kinesis stream processing paradigm and conventional Unix
stdio I/O streams. As a result, we added the ability for:
- Scala Stream Collector to write Snowplow raw events to
stdoutinstead of a Kinesis stream
- Scala Kinesis Enrich to read Snowplow raw events from
stdin, and write enriched events to
This has a nice side-effect: it is possible to run Snowplow in a “local mode”, where you simply pipe the output of Scala Stream Collector directly into Scala Kinesis Enrich, and can then see the generated enriched events printed to your console. You can run Snowplow in local mode with a shell script like this:
Make sure to set the sources and sinks in your configuration files to the relevant
Snowplow local mode could be helpful for debugging Snowplow tracker implementations before putting tags live – let us know how you get on with it!
For more details on this release, please check out the 0.9.0 Release Notes on GitHub.
We have identified three major tasks required to bring our Kinesis-based, beta-quality implementation towards parity with our production-quality EMR-based implementation:
- Implementing the raw event to S3 sink as a Kinesis app
- Implementing the enriched event to Redshift drip-feeder as a Kinesis app
- Performance testing and tuning (e.g. reviewing our Kinesis checkpointing approach)
Necessarily we have to balance these tasks against the work we are committed to for our existing batch-based Snowplow flow leveraging Amazon S3 and Elastic MapReduce.
If you are interested in contributing towards the Kinesis workstream,
or sponsoring us to accelerate their release, then do get in touch!
And that’s it – but this post would not be complete without us saying a huge “thank you” to Snowplow wintern Brandon Amos for his massive contribution to this release!
As you can see from our original winternship blog post, the original plan was for Brandon to come onboard for a month and ship the Scala Stream Collector. In fact, Brandon delivered the new collector in his first two weeks and was able to then turn his attention to Scala Kinesis Enrich, which he built in the remaining fortnight. All built using Kinesis, a platform he had never used before, and all with tests and wiki documentation – truly a heroic contribution to Snowplow!
Thanks again Brandon!