Introducing a new generation of our web data model

Cara Baestlein

We are very excited to announce the release of Redshift Web model v1.1. This is the first of a series of planned releases intended to address a hugely important need for Snowplow : extensible, scalable, incremental data modeling. Next, we will be working on BigQuery and Snowflake versions of this model, as well as a standard mobile model.

A brief history of data modeling with Snowplow

One of the biggest advantages to using Snowplow is that it gives you full ownership of your raw, unopinionated data. This data is then aggregated and transformed using business logic in order to produce insights. It is hugely advantageous to have full control over this modeling logic so that you can tailor it specifically to the nuances of your business. Therefore, one of the key drivers of deriving value with Snowplow is found in building out a data modeling process .

Our initial approach to helping our customers and community with data modeling was to release some example drop-and-recompute web models that aggregated the out-of-the-box tracking from the Snowplow JavaScript tracker into a set of derived tables (page views, sessions and users), with the expectation that users could adapt them to their needs. Over time however, some of the challenges with this approach have become apparent. The drop-and-recompute structure isn’t suitable for large datasets, and developing an incremental structure isn’t necessarily straightforward. Additionally, without a solid understanding of some of the nuances to how the data is tracked and processed, certain parts of the logic are difficult to reason about.

Over the years, we have developed various incremental models to address these challenges for Snowplow Insights customers. In guiding our customers through customizing and expanding on these models, further challenges have come to light. Firstly, maintaining SQL is incredibly difficult. Once edited, rolling out changes or bug fixes is almost impossible. Secondly, developing upon a complex structure that has been written by someone else is incredibly difficult even for the most skilled Analytics Engineer, creating a barrier to customers benefiting from our models to the full extent.

Why data modeling is important

There have been a few shifts in recent years that have hugely impacted the role data modeling plays in a company’s data strategy:

These shifts have meant that data modeling has become a core part of company’s data infrastructure. Therefore, the ability to test models properly, to easily maintain and upgrade models as tracking and business goals change, or to keep track of versions of models is now crucial.

What the new model brings

This new generation of the web model attempts to address these challenges. Specifically, it is designed to implement a SQL-as-software structure:

This structure allows us to segregate the ‘heavy lifting’ of an incremental Snowplow module by extrapolating the incremental logic into its own ‘base’ module. The base module produces a table which contains only events relevant to this run of the incremental logic, both new events and those event that require recomputing (for example because they are part of an ongoing session). The same structure can then be applied to all three tables, i.e. the page views model acts as the base module for the sessions model, etc.

This structure and approach has two key benefits. It removes the complexity from customization, as all subsequent logic can operate on this input, as if it were a simple drop-and-recompute model, but the mode’s structure ensures an efficient incremental update. This means that the end user only needs to be concerned with the aggregation logic they care about, rather than expending effort on how to make that logic work within a complex structure. It also simplifies maintenance and upgrades, as the standard (Snowplow-maintained) and custom aspects of the model are separate modules.

Additional features introduced

We have also introduced some smaller, but promising features, such as feature-flags, metadata logging, and a more robust testing framework using the excellent great expectations framework.

More information

For more information on the model structure and a quickstart guide, take a look at the technical documentation as well as the README in the GitHub repository.

For a general introduction to Snowplow’s approach to data modeling, check out our 4-part webinar series.

Related articles