Budapest Data round-up

13 June 2014  •  Alex Dean

So the Budapest Data event (aka Budapest DW Forum) is over for another year - a huge thanks to Bence Arató and the whole team for organizing another excellent conference!

In this blog post I want to share my two talks and my “Zero to Hadoop” workshop with the wider Snowplow community.

My first talk was on the Wednesday afternoon, where I spoke about our process of porting Snowplow to Kinesis, to give our users access to their Snowplow event stream in near-real-time. Areas I talked about included:

  • “Hero” use cases for event streaming which drove our adoption of Kinesis
  • Why we waited for Kinesis, and thoughts on how Kinesis fits into the wider streaming ecosystem
  • How Snowplow achieved a lambda architecture with minimal code duplication, allowing Snowplow users to choose which (or both) platforms to use
  • Key considerations when moving from a batch mindset to a streaming mindset – including aggregate windows, recomputation, backpressure

Here are the slides:

Many thanks to Gergely Daróczi of Rapporter for facilitating.

Read on after the fold for the slides from my second talk and workshop…

On the Thursday evening I gave a flash talk about data schemas and Snowplow at the Budapest Big Data Meetup, alongside other talks by Wouter De Bie (Spotify), Claudio Martella (Apache Giraph) and Stephan Ewen (Stratosphere).

Here are the slides:

You can read some more commentary on these slides in Yali’s blog post from last week.

It was great hearing more about the Apache Giraph and (soon to be Apache) Stratosphere projects - we hope to try out both of these for Snowplow use cases soon!

Hadoop is everywhere these days, but it can seem like a complex, intimidating ecosystem to those who have yet to jump in. On the Friday afternoon I gave a three-hour Hadoop workshop, with the goal of getting conference attendees with no prior experience at Hadoop writing and running jobs on Elastic MapReduce.

It was a lot of fun - setting up Virtualbox and Vagrant took a lot longer than I foresaw, but once this was done we were able to work through the first of my three example Hadoop jobs together. Unfortunately we ran out of time to tackle the two tutorial Scalding jobs - next time!

Here are the slides for the workshop (any credentials etc have been deleted since the workshop):

Many thanks to Tamás Izsák for his help organizing the workshop!

I had a great time at Budapest Data - met many new people, learnt about some great open source projects, and had an opportunity to talk about some of the most exciting aspects of what we’re doing with Snowplow.

Giving the Hadoop workshop was also a great experience: it made me realize that the big data and Hadoop communities need to do a lot more in terms of outreach to help seasoned BI and data warehousing practitioners to “jump the fence” into the world of big data, MapReduce and stream processing.

Thanks again to Bence Arató and the whole team for organizing Budapest Data!