The data collection and enrichment process produces an event stream, a long list of packets of data where each packet represents a single event. While it is possible to do analysis directly on this event stream, it is common to:
This serves three purposes:
sortkeysoptimized to support fast joins.
We call this process data modeling.
The data modeling process will be very familiar to existing Snowplow users who are also using Looker. That is because Looker provides a powerful and flexible framework to enable users to develop and iterate on their data models. Indeed, working with joint Snowplow-Looker customers has been very helpful in driving our thinking about how best to perform data modeling.
It should also be familiar to many other Snowplow users. We’ve worked with a number of Snowplow clients who do not use Looker, and nearly all of them end up creating aggregate tables to power queries.
In spite of Looker’s great data modeling functionality, there are a number of reasons we chose to make data modeling a core step in the Snowplow pipeline:
So this release offers something for Looker and non-Looker customers alike.
The data modeling takes place in the data warehouse, once new data has been loaded by the StorageLoader.
The process works as follows:
atomicschema, data is loaded into a new
snowplow_landingschema to the
atomicschema. (This is done directly via SQL.)
The process using our SQL Runner application, described later in this blog post.
This release comes with a basic data model, a set of SQL queries which aggregate event-level data in Redshift into:
This basic model is meant as an exemplar: it can be useful place for new Snowplow users to start modeling their data. In general, however, we expect data models to be pretty company-specific, reflecting the fact that:
Palila also comes with an updated Looker data model, which is based on the same set of SQL queries and can be implemented and modified from the Looker UI.
Both models make minimal assumptions about the internal business logic. What tables are produced and what fields available in each one of them, varies widely between companies in different sectors, and surprisingly even within the same vertical.
The basic data model comes with 2 different sets of SQL queries:
These two version are usually used at different stages in the development/implementation process. We recommend users start with the basic model setup in full mode. Although it is less efficient to recompute the tables from scratch each time, it is easier to iterate the business logic and underlying SQL in the data modeling process when you recompute the data from scratch. We find that users typically iterate on the models very frequently to start off with, but that this frequency decreases markedly over time.
At the point where the models become relatively stable, it then becomes sensible to migrate to the incremental model. This migration can be delayed until such time that the data volume gets too big to make recomputing the tables from scratch each time practical.
This whole process is described in more detail the setup guide and in our analytics cookbook.
Snowplow users who are getting started or trialling Looker may wish to use the Looker models as a starting point to develop their own models.
It should be reasonably straightforward to copy the model into your LookML repository. (This is easiest if it can be done locally and then pushed to Git, rather than through the Looker UI.) Don’t forget to update the connection name to reflect the name you’ve given to your Snowplow data connection.
SQL Runner is an open source app, written in Go, that makes it easy to execute SQL statements programmatically as part of the Snowplow data pipeline.
To use SQL Runner, you assemble a playbook i.e. a YAML file that lists the different
.sql files to be run and the database they are to be run against. It is possible to specify which sequence the files should be run, and to run files in parallel.
The Palila release includes both the underlying SQL and the associated playbooks for running them. For more information on SQL Runner please view the repo.
There are four important changes in this release:
atomic.eventsis too short to support Google’s
gclidparameter, which can range from 25 to 100 chars. We have widened this column in both Redshift and Postgres (#1606 and #1603 respectively)
user_idfield. This has now been fixed (#1620)
In the next section we will cover how to upgrade Snowplow to include these fixes.
You need to update EmrEtlRunner to the latest code (0.14.0) on GitHub:
From this release onwards, you must specify IAM roles for Elastic MapReduce to use. If you have not already done so, you can create these default EMR roles using the AWS Command Line Interface, like so:
Now update your EmrEtlRunner’s
config.yml file to add the default roles you just created:
This release also bumps the Hadoop Enrichment process to version 0.14.1. Update
config.yml like so:
For a complete example, see our sample
This release widens the
mkt_clickid field in
atomic.events. You need to use the appropriate migration script to update to the new table definition:
And that’s it - you should be fully upgraded.
The data modeling step in Snowplow 64 is still very new and experimental — we’re excited to see how it plays out and look forward to the community’s feedback.
There are a number of ways that we can improve the data modeling functionality - these are just some of our ideas, and we’ve love to bounce them off you, our users:
In the shorter term we also plan to extend our data modeling documentation to cover common design patterns, including:
Keep checking the blog and Analytics Cookbook for updates!
For more details on this release, please check out the R64 Palila Release Notes on GitHub.
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.