We are excited to announce the immediate availability of Snowplow 64, Palila. This is a major release which adds a new data modeling stage to the Snowplow pipeline, as well as fixes a small number of important bugs across the rest of Snowplow.
In this post, we will cover:
- Why model your Snowplow data?
- Understanding how the data modeling takes place
- The basic Snowplow data model
- Implementing the SQL Runner data model
- Implementing the Looker data model
- SQL Runner
- Other updates in this release
- Upgrading your Snowplow pipeline
- Data modeling roadmap
- Documentation and help
The data collection and enrichment process produces an event stream, a long list of packets of data where each packet represents a single event. While it is possible to do analysis directly on this event stream, it is common to:
- join Snowplow data with other data sets (e.g. customer, marketing, CMS, financial data)
- aggregate the event-level data into data tables that represent entities that are commonly queried e.g. visitors, sessions, applications, products, articles etc.
This serves three purposes:
- The same underlying business logic is applied to the data before different users query it, so that all users who query the data are working from the same underlying set of assumptions and definitions. When we join and aggregate the event level data, we are actually applying business logic to the underlying event-level data. How we aggregate the data will be a function of, for example:
- How we identify users
- How we define sessions
- What dimensions we are interested in, for the key entities that matter to our business
- It makes querying more efficient. The aggregate tables are typically simpler to compose queries against, and can be integrated directly into a business intelligence or pivoting tool
- It makes querying faster because the aggregate data sets are smaller than the event-level data, with
sortkeysoptimized to support fast joins.
We call this process data modeling.
The data modeling process will be very familiar to existing Snowplow users who are also using Looker. That is because Looker provides a powerful and flexible framework to enable users to develop and iterate on their data models. Indeed, working with joint Snowplow-Looker customers has been very helpful in driving our thinking about how best to perform data modeling.
It should also be familiar to many other Snowplow users. We’ve worked with a number of Snowplow clients who do not use Looker, and nearly all of them end up creating aggregate tables to power queries.
In spite of Looker’s great data modeling functionality, there are a number of reasons we chose to make data modeling a core step in the Snowplow pipeline:
- For users who are not using Looker, having a formal framework for managing their data modeling process is very valuable, as we have seen with the users we have implemented this for to date.
- Looker users with very large data volumes, or data pipelines that run frequently, will struggle to generate their aggregate tables (persistent derived tables in Looker terminology) fast enough if the process is managed via Looker. The reason is that Looker rebuilds these tables from scratch each time they need to be updated (i.e. with every data load). By moving the process of generating those tables outside of Looker, it is possible to update an existing set of tables based just on the new data that has landed with the last data pipeline run, which is significantly faster than reprocessing based on the entire data set. (We’ve seen queries data modeling queries that took hours compress down to executing in less than 10 minutes.)
- We found that Looker users often wanted to make their derived tables available to other applications and processes.
So this release offers something for Looker and non-Looker customers alike.
The data modeling takes place in the data warehouse, once new data has been loaded by the StorageLoader.
The process works as follows:
- Instead of loading new data directly into the
atomicschema, data is loaded into a new
- A set of SQL statements are run on the incremental data, that effectively update a set of aggregate tables in the
- On completion of those SQL statements, the data is moved from the
snowplow_landingschema to the
atomicschema. (This is done directly via SQL.)
This release comes with a basic data model, a set of SQL queries which aggregate event-level data in Redshift into:
- A sessions table (using client-side sessionization)
- A visitors table (using cookies for identity stitching)
- A page views table
This basic model is meant as an exemplar: it can be useful place for new Snowplow users to start modeling their data. In general, however, we expect data models to be pretty company-specific, reflecting the fact that:
- Different companies record different events across different channels
- Different companies track the state of different entities over time
- Different companies have different business questions they want to ask of the data
Palila also comes with an updated Looker data model, which is based on the same set of SQL queries and can be implemented and modified from the Looker UI.
Both models make minimal assumptions about the internal business logic. What tables are produced and what fields available in each one of them, varies widely between companies in different sectors, and surprisingly even within the same vertical.
The basic data model comes with 2 different sets of SQL queries:
- In full mode, the derived tables are recalculated from scratch (i.e. using all events) each time the pipeline runs
- In incremental mode, the tables are updated using only the most recent events
These two version are usually used at different stages in the development/implementation process. We recommend users start with the basic model setup in full mode. Although it is less efficient to recompute the tables from scratch each time, it is easier to iterate the business logic and underlying SQL in the data modeling process when you recompute the data from scratch. We find that users typically iterate on the models very frequently to start off with, but that this frequency decreases markedly over time.
At the point where the models become relatively stable, it then becomes sensible to migrate to the incremental model. This migration can be delayed until such time that the data volume gets too big to make recomputing the tables from scratch each time practical.
Snowplow users who are getting started or trialling Looker may wish to use the Looker models as a starting point to develop their own models.
It should be reasonably straightforward to copy the model into your LookML repository. (This is easiest if it can be done locally and then pushed to Git, rather than through the Looker UI.) Don’t forget to update the connection name to reflect the name you’ve given to your Snowplow data connection.
SQL Runner is an open source app, written in Go, that makes it easy to execute SQL statements programmatically as part of the Snowplow data pipeline.
To use SQL Runner, you assemble a playbook i.e. a YAML file that lists the different
.sql files to be run and the database they are to be run against. It is possible to specify which sequence the files should be run, and to run files in parallel.
The Palila release includes both the underlying SQL and the associated playbooks for running them. For more information on SQL Runner please view the repo.
There are four important changes in this release:
- Since 6th April 2015, all new Elastic MapReduce users have been required to use IAM roles with EMR. EmrEtlRunner did not previously support this requirement – but thanks to Elasticity author Rob Slifka’s fantastic support, EmrEtlRunner now supports IAM roles for EMR, and we require these to be used (#1232)
- A bug in our new useragent parsing enrichment has been causing jobs to fail with out of memory errors. This has now been fixed for the Hadoop pipeline – many thanks to community member Dani Solà for identifying this so quickly (#1616). We will fix this for the Kinesis pipeline in the next Kinesis release
- Dani also flagged that our new
atomic.eventsis too short to support Google’s
gclidparameter, which can range from 25 to 100 chars. We have widened this column in both Redshift and Postgres (#1606 and #1603 respectively)
- Thanks to community member Morten Petersen for spotting that r63’s Postgres migration script was missing its
user_idfield. This has now been fixed (#1620)
In the next section we will cover how to upgrade Snowplow to include these fixes.
You need to update EmrEtlRunner to the latest code (0.14.0) on GitHub:
From this release onwards, you must specify IAM roles for Elastic MapReduce to use. If you have not already done so, you can create these default EMR roles using the AWS Command Line Interface, like so:
Now update your EmrEtlRunner’s
config.yml file to add the default roles you just created:
This release also bumps the Hadoop Enrichment process to version 0.14.1. Update
config.yml like so:
For a complete example, see our sample
This release widens the
mkt_clickid field in
atomic.events. You need to use the appropriate migration script to update to the new table definition:
And that’s it – you should be fully upgraded.
The data modeling step in Snowplow
64 is still very new and experimental — we’re excited to see how it plays out and look forward to the community’s feedback.
There are a number of ways that we can improve the data modeling functionality – these are just some of our ideas, and we’ve love to bounce them off you, our users:
- Move the data modeling out of SQL (and Redshift in particular) into EMR (for batch-based processing) or Spark Streaming (for users on the real-time pipeline). This would take a lot of load of the database, and mean that we could express the data modeling in a better suited language. We’ve been impressed by users who’ve shown us how they’ve performed this process in tools including Scalding and Cascalog
- Building on the above, we’re very interested to figure out what the best way is of expressing the data modeling process. Potentially we could develop a DSL for this. Ideally, we would want to make it possible to express once, and then implement in a range of environments (i.e. stream processing, batch-processing and in-database)
In the shorter term we also plan to extend our data modeling documentation to cover common design patterns, including:
- Cross platform identity stitching for logged-in users
- Custom sessionization logic, including for users tracked from mobile apps and marrying server and client-side session data
- Integrating third party user identification services
Keep checking the blog and Analytics Cookbook for updates!
For more details on this release, please check out the R64 Palila Release Notes on GitHub.