Unified Log London 3 with Apache Kafka and Samza at State
More on the event after the jump:
There were two talks at the meetup:
- I gave a recap on the Unified Log “manifesto” for new ULPers, with my regular presentation on “Why your company needs a Unified Log”
- Mischa Tuffield, CTO at State, gave an excellent talk on implementing a Unified Log at State to meet various operational and analytical data requirements, all using Apache Kafka and Samza
The meetup had a great mix of Unified Log practitioners and people just starting to explore the concept. It was particularly encouraging to see such an interactive, “salon” style atmosphere to the discussion, continuing late into the evening!
1. Why your company needs a Unified Log
In this talk, I summarized the emergence of the Unified Log concept, talking through the “three eras” of data processing and explaining why it makes sense to restructure your company around a Unified Log. Regular readers of this blog may well have seen a version of this presentation already, included here for completeness:
2. Unified Log at State
We were lucky enough to have Mischa Tuffield and Dan Harvey, Data Architect at State, talk us through their implementation of the Unified Log concept at State. Learning about the real-world experience of implementing ULP is a key part of Unified Log London, so it was great to hear Mischa and Dan’s story. Mischa’s slides are here:
Key building blocks of State’s Unified Log implementation are:
- Apache Kafka to act as the distributed commit log
- A custom “tailer” app to mirror their MongoDB oplog to Kafka as entity snapshots
- Apache Samza for stream-stream joins and other use cases
- The Confluent Schema Registry (which shares some similarities to our own Iglu) for storing Avro schemas
Given our focus at Snowplow on the various analytical uses of the Unified Log, it was really helpful for me to get Mischa and Dan’s more operational/transactional-focused perspective on the Unified Log.
3. Big themes
There were some really interesting themes that emerged during the talks and the subsequent discussion. To highlight just three:
- Stream design - specifically, whether to create individual streams (topics in Kafka parlance) for each entity, or whether to have every-entity streams which are tied only to the processing stage. State follow the first approach, Snowplow the second
- Eventsourcing versus entity snapshotting - this really warrants a full blog post, but there was some healthy debate about whether an individual event should capture complete entity snapshots or just deltas (i.e. just the properties that have changed). There was a general feeling (which we share at Snowplow) that entity snapshots are much safer in the face of potentially lossy systems
- The importance of a schema registry - in the Unified Log model, your events’ schemas form the sole contract between your various stream processing applications, and so having a single source of truth for these schemas - a registry/repository - becomes essential
4. Thanks and next event
It was a great meetup - in particular it’s exciting to see the Unified Log patterns becoming such a hot discussion topic. A big thank you to Raj Singh, Peter Mounce and the Just Eat Engineering team for being such excellent hosts, and a warm thanks to Mischa and Dan for giving us the inside track on Unified Log at State!