The Snowplow development roadmap for the ETL step - from ETL to enrichment

09 January 2013

In this blog post, we outline our plans to develop the ETL (“extract, transform and load”) part of the Snowplow stack. Although in many respects the least sexy element of the stack, it is critical to Snowplow, and we intend to re-architect the ETL step in quite significant ways. In this post, we discuss our plans and the rationale behind them, in the hope to get:

  1. Feedback from the community on them
  2. Ideas for alternative approaches or new features

We will cover:

  1. Recap: the point of the ETL step
  2. Limitations with the current, Hive-based ETL process
  3. From ETL to enrichment: what we want the ETL step to achieve
  4. Towards a real-time ETL: speeding things up
  5. Moving to Cascading / Scalding: what we plan to do
  6. Benefits of this approach: both in the short and long term

To get the conversation started, a conceptual map of the new ETL process is shown below:

The primary purpose of the ETL step is to parse the logs generated by the Snowplow collector(s) and push the data stored into one or more storage facilities (e.g. S3, Infobright) where it can be accessed by analytic tools. However, there are two complexities that have to be dealt with:

  1. Checking data quality and resolving any issues. Sometimes, the Snowplow tracker has not been correctly configured; sometimes, there may even be a bug in a tracker or collector, which means that the log files contain errors. In an ideal world, the ETL step should validate the lines of data in the logs, push data through to storage when the data is good quality, and initiate a process for handling malformed data in the unfortunate cases when it is not. (Note: most web analytics programmes do not support this, so if you haven’t set your tracking up properly and haven’t been logging data correctly for a couple of months - tough - there’s no way of fixing it.) By flagging malformed data quickly, the ETL step should also provide the ops team with a good guide to review the tracker and collector setup, and correct any mistakes.
  2. Supporting multiple storage options. We want Snowplow to support the widest range of analytics: encompassing OLAP style aggregations slicing and dicing of data, Mahout-like machine learning and Sky-like event stream analytics. The ETL step has to be powerful enough to push data into multiple locations in an efficient manner, and support pushing different cuts and structures of the data into each of those different storage options as required.

The current ETL process is based on Hive, which processes Cloudfront-formatted log files containing querystrings matching the Snowplow tracker protocol using a custom deserializer. This was a good option to build an initial prototype of the ETL step: it enabled us to query data in the raw logs directly, and made it relatively straightforward to transfer the data from the Snowplow log format into a more standard format suitable for faster querying in Hive or importing into Infobright.

However, there are a number of limitations to the Hive-based ETL process:

  1. It makes error handling very difficult. Either a row is processed, or it is not. There’s no option to build more sophisticated data processing pipelines including flows to divert malformed data, spot the source of the data quality problem and address it.
  2. It is a tightly-coupled process: all the parsing on the entire row is performed by the custom deserializer. If something goes wrong, it is hard to debug what went wrong. If we want to extend part of the ETL process, we have to go in and upgrade the deserializer or the HiveQL wrapper scripts. As the conceptual map of our proposed ETL shown at the top of this post demonstrates, our ideal ETL process consists of multiple steps. These should be decoupled for robustness and ease of extension.
  3. It is hard to extend the ETL process to build enrichments of the data. (See the next section.)

The initial purpose of the ETL step was quite narrow: to move data generated by the collectors into the different storage options for analytics. Since then, we have realised that there are a number of important enrichments that can be performed on the data, that are best done as part of the ETL step, so that they are available when the data comes to be analysed. Examples include:

  1. Inferring location from user_ipaddress e.g. using Maxmind or Digital Element
  2. Inferring marketing parameters (source, medium, keywords) by processing referrer url and query strings using referer-parser. This would include identifying search engine originated traffic and social network originated traffic, for example

In addition, decomposing some of the fields into constituent elements can make analysis easier: for example, breaking up page_url and referrer_url into host, domain, path and query string, can enable us to easily group visits by referer domain or path, depending on granularity of analysis we’re performing.

The majority of Snowplow users run their ETL process daily, so that yesterday’s data is available today.

We need to move the whole Snowplow stack so that data is available for analytics faster. Doing so will be welcomed by analysts crunching Snowplow data, but perhaps more significantly, it will open up the possibility of building real-time response engines based on Snowplow data: these might include things like retargeting users who’ve performed specific actions with display ads or emails, or personalising the content shown to a user based on their recent browsing history, on the fly.

There is limited scope to speed up the current Hive-based ETL process. However, there are lots of interesting opportunities that arise if we consider an alternative archtiecture, especially one that moves us closer to a stream-based data processing model.

We intend to replace the current Hive-based ETL process with one based on the Scala library that runs on top of Cascading, known as Scalding.

Cascading is an application framework specifically designed to build robust data pipelines using Hadoop. We intend to use it to build the pipeline sketched above.

By rearchitecting the ETL using Scalding / Cascading, we hope to realise the following benefits in the short-term:

  1. Deliver enrichments on the data: in particular, classify visits based on referer, and locate users via geo-ip
  2. Improved handling of malformed data: making it easier to spot bugs in Snowplow, mistakes in tracker or collector setup, and the ability to fix and reprocess malformed data
  3. Make it easier to run the ETL process more frequently, so that Snowplow data is more up-to-date

In the long term there are a number of important benefits we hope moving to Scalding will help us realise:

  1. Expand the ETL to output data in a format suitable for OLAP reporting. Currently, users who want to use OLAP tools e.g. Tableau, Pentaho or Microstrategy, to report on Snowplow data, need to transform that data prior to running those tools on top of it. We want to build out the ETL process to output two versions of the data: the raw event field (as it currently does) and a cube-formatted version that can be used directly with these tools. Delivering this with the current Hive-based process would be incredibly difficult.
  2. Move towards a real-time engine. In order to deliver data in real-time, Snowplow ETL would need to move from a Hadoop, batch-based process into a stream-based process, likely using Storm. Porting the data pipeline from Cascading to Storm should be significantly easier than porting it from Hive to Storm: as such, Cascading provides a useful stepping stone on our journey to deliver real time event-level analytics.
  3. Make it easier to support a wider range of collector log formats. Because the ETL process is decoupled, handling a different log file format means only updating the first processing step in the data pipeline that parses the raw collector logs. That means building out the ETL to support other collectors (e.g. SnowCannon) should be much simpler.
  4. Make it easier to support a growing range of event types. As should be clear from the data pipeline flowchart, seven event types are currently supported, each with their own set of fields. (Page views, page pings, link clicks, custom events, ad impressions, transaction events and transaction items.) That list is only likely to grow over time. By clearly differentiating each of them in the data pipeline, a Scalding-based ETL process should be easier to extend to support a greater range of events.

We want your feedback

We’ve been very lucky to have community members contribute an enormous number of fantastic ideas and code that we’ve been able to incorporate into Snowplow. We’ve shared our roadmap for the ETL step and our rationale for that roadmap to see what you think. Does our approach sound sensible? What should we do differently? What can we add to it to make it more robust and valuable?