A couple of weeks ago I was very lucky to attend, and speak at Crunch Conference, a practical big data conference in Budapest, organised by the folks at Ustream and Prezi, and headlined by some of the titans of the data industry, including Doug Cutting, the creator of Hadoop (not to mention Lucene and Nutch) and Martin Kleppmann, the creator of Samza.
Emerging best practices in event data pipelines
Being invited to speak gave me the opportunity to step back from my day to day focus at Snowplow on:
- building event data pipelines and
- helping our users to get the most out of them,
and think more broadly about what distinguishes good event data pipelines from bad.
Three years ago when we started Snowplow, our focus, and that of the industry as a whole, was on using frameworks like Hadoop and cloud services like EMR to make pipelines linearly scalable, robust and cost effective. Today these are all a given – the things that mark out best in class event data pipelines from the rest, are:
- A focus on data quality, which means in practice making the data pipeline auditable, and early validation of data using schemas
- The ability for businesses to evolve their data pipelines as they evolve, so that we can change the schemas for the events and entities tracked, and introduce new events and entities, as the activities that take place change, and the questions need to answer in a data-driven way change.
My talk focused on these issues in the context of providing an overview of event pipelines in general from a data processing perspective. The talk was videoed – you can view it below:
I wasn’t the only person talking about event data pipelines. Scott Krueger from Skyscanner gave an excellent talk on the Unified Log Infrastructure at Skyscanner, where they make extensive use of both Kafka and Samza.
Sergii Khomenko from StyLight gave a great presentation on the different cloud technologies available for building event data pipelines.
Data reservoirs, lakes, and swamps. And data provenance
One of the great things about conferences is that they bring people attacking similar problems from different angles together. For me, one of the most interesting talks, for me, was that given by Stephen Brobst from Teradata and Scott Gnau from Hortonworks on “Unified Data Architecture”. They distinguished between a data lake, where data from all parts of the business is accumulated, and “Enterprise Data Products”, which are derived from the data lake, and where data is accessible for production purposes.
This view of the world makes sense if you’re Hortonworks (in which case you sell “data lakes”) and Teradata (in which case you sell “enterprise data products”). But they’re a little bit puzzling if you look at the space from the event data pipeline perspective, because data is taken as a ‘given’ i.e. you have a lot of data, you accumulate it in your data lake, and then over time you use that data to build out your enterprise data product. In practice, many companies do have lots of data they could do more with. But I believe that at least as much effort should be spent capturing good quality data at source than on accumulating what you’ve already got.
The other interesting aspect of this talk was exporation of the difference between a “data swamp” and a “data reservoir”. Metadata management – understanding the source and structure of the data, including what you are and are not allowed to do with the data, are key to ensuring that the data can actually be used effectively. They referred to this as “data provenance”. Again, it seems to me that capturing this metadata with the data at source, and keeping that metadata with the data itself wherever the data happens to be, seems to be essential: again viewing data infrastructure as pipelines seems a much more useful paradigm to me than a focus on the parts of the pipelines where the data accumulates.
Unfortunately a video of that presentation is not currently available – check out the Crunch Conference website to see if that changes.
Building a culture of A/B testing at Pinterest
Speakers at data conferences tend to focus on technology and analytics. Often, those challenges are a lot easier to solve than the organisational challenges associated with getting people to use data to drive decision making in intelligent ways.
It was therefore enormously refreshing to hear Andrea Burbank from Pinterest give a superb presentation on she’d built a culture of A/B testing at Pinterest: this is essential viewing for anyone working looking to make their companies data driven.
Thank you Prezi and UStream!
Enormous thanks to the folks at Prezi and UStream for organising this awesome event. I this is the first of many :-).