Snowplow 85 Metamorphosis released with beta Apache Kafka support

We are pleased to announce the release of Snowplow 85 Metamorphosis. This release brings initial beta support for using Apache Kafka with the Snowplow real-time pipeline, as an alternative to Amazon Kinesis.

Metamorphosis is one of Franz Kafka’s most famous books, and an apt codename for this release, as our first step towards an implementation of the full Snowplow platform that can be run off the Amazon cloud, on-premise. (We’ll come up with a new non-ornithological codename series for R86 onwards.)

  1. Supporting Apache Kafka
  2. Scala Stream Collector and Kafka
  3. Stream Enrich and Kafka
  4. Kafka documentation
  5. Other changes
  6. Upgrading
  7. Roadmap
  8. Behind the scenes
  9. Getting help


1. Supporting Apache Kafka

Support for running Snowplow on Apache Kafka has been one of our longest-standing feature requests - dating back almost as far as our original release of Amazon Kinesis support, in Snowplow v0.9.0.

When we look at companies’ on-premise data processing pipelines, we see a huge amount of diversity in terms of stream processing frameworks, data storage systems, orchestration engines and similar. However, there is one near-constant across all these companies: Apache Kafka. Over the past few years, Kafka has become the dominant on-premise unified log technology, mirroring the role that Amazon Kinesis plays on AWS.

Adding support for Kafka to Snowplow has been a goal of ours for some time, and over the years we have had some great code contributions towards this, from community members such as Greg. Our thanks to all contributors!

Finally, at our company hackathon in Berlin, Josh Beemster and I have had an opportunity to put together our first beta release of Apache Kafka support. We have kept this deliberately very minimal - the smallest useable subset that we could build and test in a single day. This comprises:

  • Adding a Kafka sink to the Scala Stream Collector
  • Adding a Kafka source and a Kafka sink to Stream Enrich

Between these two components, it is now possible to stand up a Kafka-based pipeline, from Snowplow event tracking through to a Kafka topic containing Snowplow enriched events. This has been built and tested with Kafka v0.10.1.0. As of this release, we have not gone further than this - for example into sinking events from Kafka into our supported storage targets.

Please note that this Kafka support is extremely beta - we want you to use it and test it; do not use it in production.

In the next three sections we will set out what is available in this release, and what is coming soon.

2. Scala Stream Collector and Kafka

This release brings support for a new sink target for our Scala Stream Collector in the form of a Kafka topic. This feature maps one-to-one in functionality with the current Kinesis offering.

As a typical example if you have followed the Kafka quickstart guide you would update your config to have the following values:

 1 collector {
 2   ...
 4   sink {
 6     enabled = "kafka"
 7     ...
 9     kafka {
10       brokers: "localhost:9092"
12       # Data will be stored in the following topics
13       topic {
14         good: "collector-payloads"
15         bad: "bad-1"
16       }
17     }
19     ...
20 }

Note: This application assumes that you have created your topics ahead-of-time; just like with Kinesis streams.

Launching the collector in this configuration will then start sinking raw events to your configured Kafka topic, allowing them to be picked up and consumed by other applications, including Stream Enrich.

3. Stream Enrich and Kafka

This component has also been updated to now support both a Kafka topic as a source, and as a sink. This feature maps one-to-one in functionality with the current Kinesis offering; in fact it also enables emergent configurations such as enriching from Kinesis to Kafka, or from Kafka to Kinesis!

Following on from the Stream Collector section above, you can then configure your Stream Enrich application like so:

 1 enrich {
 2   source = "kafka"
 3   sink = "kafka"
 5   ...
 7   kafka {
 8     brokers: "localhost:9092"
 9   }
11   streams {
12     in: {
13       raw: "collector-payloads"
15       ...
16     }
18     out: {
19       enriched: "enriched-events"
20       bad: "bad-1"
22       ...
23     }
25   ...
26 }

Note: This application assumes that you have created your topics ahead-of-time; just like with Kinesis.

Events from the Stream Collector’s raw topic will then start to be picked up and enriched before being dropped back into the out topic.

4. Kafka documentation

As of this release, we have not yet updated the Snowplow documentation to cover our new Kafka support. Kafka support represents a far-reaching change for us: we need to revise a lot of our documentation, which still discusses Snowplow as an AWS-only platform.

Over the next fortnight we will add in our Kafka documentation and revise the overall structure of our documentation; in the meantime please use this blog post as your guide for trialling Snowplow with Kafka.

5. Other changes

We have only made one other change in this release, for Stream Enrich:

  • Fixed regression issue with parsing S3 urls in enrichment JSONs (#2921)

6. Upgrading

The real-time apps for R85 Metamorphosis are available in the following zipfiles:

Or you can download all of the apps together in this zipfile:

To upgrade the Stream Collector application:

  • Install the new Collector on each server in your auto-scaling group
  • Upgrade your config by:
    • Moving the collector.sink.kinesis.buffer section down to collector.sink.buffer; as this section will be used to configure limits for both Kinesis and Kafka.
    • Adding a new section within the collector.sink block:
 1 collector {
 2   ...
 4   sink {
 5     ...
 7     buffer {
 8       byte-limit: 
 9       record-limit:  # Not supported by Kafka; will be ignored
10       time-limit: 
11     }
12     ...
14     kafka {
15       brokers: ""
17       # Data will be stored in the following topics
18       topic {
19         good: ""
20         bad: ""
21       }
22     }
23     ...
25 }

To upgrade the Stream Enrich application:

  • Install the new Stream Enrich on each server in your auto-scaling group
  • Upgrade your config by:
    • Adding a new section within the enrich block:
 1 enrich {
 2   ...
 4   # Kafka configuration
 5   kafka {
 6     brokers: "localhost:9092"
 7   }
 9   ...
10 }

Note: The app-name defined in your config will be used as your Kafka consumer group ID.

7. Roadmap

We have renamed the upcoming milestones for Snowplow to be more flexible around the ultimate sequencing of releases. Upcoming Snowplow releases, in no particular order, include:

  • R8x [HAD] 4 webhooks, which will add support for 4 new webhooks: Mailgun, Olark, Unbounce, StatusGator
  • R8x [HAD] Synthetic dedupe, which will deduplicate events with the same event_id but different event_fingerprints (synthetic duplicates) in Hadoop Shred

Note that these releases are always subject to change between now and the actual release date.

8. Behind the scenes

We were lucky enough to capture the exact moment Josh and Alex got the whole thing running for the first time:


9. Getting help

This is an extremely beta release of Apache Kafka support for Snowplow - we encourage you to test it out, and give us feedback on how we can improve it and extend it over the coming months. We are committed to building first-class support for Snowplow on Apache Kafka - and would love your help!

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.

Thoughts or questions? Come join us in our Discourse forum!

Alex Dean

Alex is co-founder and technical lead at Snowplow. You can find in him on , Twitter and LinkedIn.