Support for running Snowplow on Apache Kafka has been one of our longest-standing feature requests - dating back almost as far as our original release of Amazon Kinesis support, in Snowplow v0.9.0.
When we look at companies’ on-premise data processing pipelines, we see a huge amount of diversity in terms of stream processing frameworks, data storage systems, orchestration engines and similar. However, there is one near-constant across all these companies: Apache Kafka. Over the past few years, Kafka has become the dominant on-premise unified log technology, mirroring the role that Amazon Kinesis plays on AWS.
Adding support for Kafka to Snowplow has been a goal of ours for some time, and over the years we have had some great code contributions towards this, from community members such as Greg. Our thanks to all contributors!
Finally, at our company hackathon in Berlin, Josh Beemster and I have had an opportunity to put together our first beta release of Apache Kafka support. We have kept this deliberately very minimal - the smallest useable subset that we could build and test in a single day. This comprises:
Between these two components, it is now possible to stand up a Kafka-based pipeline, from Snowplow event tracking through to a Kafka topic containing Snowplow enriched events. This has been built and tested with Kafka v0.10.1.0. As of this release, we have not gone further than this - for example into sinking events from Kafka into our supported storage targets.
Please note that this Kafka support is extremely beta - we want you to use it and test it; do not use it in production.
In the next three sections we will set out what is available in this release, and what is coming soon.
This release brings support for a new sink target for our Scala Stream Collector in the form of a Kafka topic. This feature maps one-to-one in functionality with the current Kinesis offering.
As a typical example if you have followed the Kafka quickstart guide you would update your config to have the following values:
Note: This application assumes that you have created your topics ahead-of-time; just like with Kinesis streams.
Launching the collector in this configuration will then start sinking raw events to your configured Kafka topic, allowing them to be picked up and consumed by other applications, including Stream Enrich.
This component has also been updated to now support both a Kafka topic as a source, and as a sink. This feature maps one-to-one in functionality with the current Kinesis offering; in fact it also enables emergent configurations such as enriching from Kinesis to Kafka, or from Kafka to Kinesis!
Following on from the Stream Collector section above, you can then configure your Stream Enrich application like so:
Note: This application assumes that you have created your topics ahead-of-time; just like with Kinesis.
Events from the Stream Collector’s raw topic will then start to be picked up and enriched before being dropped back into the
As of this release, we have not yet updated the Snowplow documentation to cover our new Kafka support. Kafka support represents a far-reaching change for us: we need to revise a lot of our documentation, which still discusses Snowplow as an AWS-only platform.
Over the next fortnight we will add in our Kafka documentation and revise the overall structure of our documentation; in the meantime please use this blog post as your guide for trialling Snowplow with Kafka.
We have only made one other change in this release, for Stream Enrich:
The real-time apps for R85 Metamorphosis are available in the following zipfiles:
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_scala_stream_collector_0.9.0.zip http://dl.bintray.com/snowplow/snowplow-generic/snowplow_stream_enrich_0.10.0.zip http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_elasticsearch_sink_0.8.0_1x.zip http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_elasticsearch_sink_0.8.0_2x.zip
Or you can download all of the apps together in this zipfile:
To upgrade the Stream Collector application:
collector.sink.kinesis.buffersection down to
collector.sink.buffer; as this section will be used to configure limits for both Kinesis and Kafka.
To upgrade the Stream Enrich application:
Note: The app-name defined in your config will be used as your Kafka consumer group ID.
We have renamed the upcoming milestones for Snowplow to be more flexible around the ultimate sequencing of releases. Upcoming Snowplow releases, in no particular order, include:
event_fingerprints (synthetic duplicates) in Hadoop Shred
Note that these releases are always subject to change between now and the actual release date.
We were lucky enough to capture the exact moment Josh and Alex got the whole thing running for the first time:
This is an extremely beta release of Apache Kafka support for Snowplow - we encourage you to test it out, and give us feedback on how we can improve it and extend it over the coming months. We are committed to building first-class support for Snowplow on Apache Kafka - and would love your help!
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.