Schema registries and Strata + Hadoop World NYC 2016

23 October 2016  •  Alex Dean

In late September the Snowplow team attended Strata + Hadoop World in New York City. It was a great opportunity to check in on the US data science and engineering scenes, and I was pleased to also have the opportunity to give a talk on schema registries.

In this blog post we will briefly cover:

  1. What Crimean War gunboats teach us about the need for schema registries
  2. Alex’s session picks
  3. Christophe’s session picks
  4. Some closing thoughts

1. What Crimean War gunboats teach us about the need for schema registries

It was super-exciting to give my first talk at Strata, on the importance of schema registries, drawing parallels with Britain’s industrial standardization of the Crimean War era.

At the start of the Crimean War in 1853, Britain’s Royal Navy needed 90 new gunboats ready to fight in the Baltic in just 90 days. They were able to build the boats in record time thanks to industrial standardization - specifically the Whitworth thread, the world’s first national screw thread standard.

In my talk, I drew on the story of the Crimean War gunboats to argue that our data processing architectures urgently require a standardization of their own, in the form of schema registries. Like the Whitworth screw thread, a schema registry, such as Snowplow’s own Iglu or Confluent Schema Registry, allows enterprises to standardise on a set of business entities which can be used throughout their batch and stream processing architectures:

My closing thought was that every organization should implement a schema registry, whether Iglu, Confluent Schema Registry or an in-house system. The schemas in this registry will provide a common language for all data processing throughout your organization; your registry will allow you to assemble your data pipeline from many smaller micro-services, like the Royal Navy’s disparate machine shops before them.

I really enjoyed giving the talk, and appreciated the audience’s in-depth questions afterwards. Putting the talk together also gave me the chance to step back and take a broader look at the whole schema technology landscape. I am increasingly convinced that Iglu’s support for schema resolution across multiple schema registries (plus associated features such as schema URIs) is going to prove an essential feature in the future.

2. Alex's session picks

  • Karthik Ramasamy from Twitter introduced Heron, Twitter’s Storm replacement, and Distributed Log, an alternative to Kafka. It was a great talk, full of detail - such as on how lagging consumers in Kafka can badly impact on non-lagging consumers; it also inspired me to take a second look at Heron, which was recently open-sourced
  • Maxime Beauchemin gave an engaging talk introducing Caravel, AirBnB’s data engineer-friendly open source BI tool. Caravel’s impressive traction and featureset should grow even faster with four additional AirBnB engineers joining Maximeto work on Caravel. To find out more on Caravel, check out Rob Kingston’s great tutorial, Visualise Snowplow data using Airbnb Caravel & Redshift
  • Xavier Léauté reprised his Strata London talk on the Kafka, Samza and Druid stack at Metamarkets. Metamarkets scale (300 billion events a day) is certainly inspiring, and Xavier makes a great case for using Kafka, Samza, Druid and Spark at that scale. Metamarkets’ Spark usage is particularly encouraging: they use Spark exclusively with S3 (no HDFS in sight), and on spot instances only

3. Christophe's session picks

4. Some closing thoughts

It was great to attend our first Strata + Hadoop World in New York - we will definitely be coming back. It’s an impressive event - bigger and more diverse than the London one. All the major players and vendors are there, and it’s a good opportunity to catch up with big data’s big personalities!

In terms of big trends: the general buzz around Spark seems to be dying down - presumably because Spark usage is so pervasive now, and the platform is maturing. Kafka talks were plentiful and well-attended, with a lot of user appetite to understand how to get the best out of the tool. It was also nice to see the Apache Flink project steadily gaining mind share.