The current data models were introduced with Snowplow 64 Palila and aggregate web data into page views, sessions and visitors (the derived tables). This release contains an improved version of the incremental model, which updates the derived tables using only the events from the most recent batch.
The SQL queries were rewritten to:
This, together with other changes, results in a decrease in disk usage of up to 10% and execution times that are up to 5 times faster. For a user with 10 to 20 million events per batch, we have seen the execution time drop from 30 to 6 minutes.
The following diagram illustrates how the incremental models updates the derived tables:
This release includes a new mobile SQL data model.
This model takes events from the mobile context and aggregates them into sessions and users. It’s an example of a data model that uses server-side, rather than client-side, sessionization. The techniques used to sessionize events will be discussed in a future blog post.
This release also includes a new model that deduplicates events in
atomic.events, and in doing so ensures that the event ID is unique. This addresses an issue where a small percentage of rows have the same event ID.
Duplicate events are either natural or synthetic copies. Natural copies are true duplicates (i.e. the entire event is duplicated) and are introduced because the Snowplow pipeline is set up to guarantee that each event is processed at least once. Synthetic copies are produced external to Snowplow by, for example, browser pre-cachers and web scrapers. These copies have the same event ID, but parts of the rest of the event can be different.
This new model deduplicates natural copies and moves synthetic copies from
atomic.duplicated_events. This ensures that the event ID in
atomic.events is unique. The issue of duplicate events will be discussed in more detail in a subsequent blogpost.
The SQL data models are a standalone and optional part of the Snowplow pipeline. Users who don’t use the SQL data models are therefore not affected by this release.
To implement the SQL data models, first execute the relevant setup queries in Redshift. Then use SQL Runner to execute the queries on a regular basis. SQL Runner is an open source app that makes it easy to execute SQL statements programmatically as part of the Snowplow data pipeline.
The web and mobile data models come in two variants:
recalculate models drop and recalculate the derived tables using all events, and can therefore be replaced without having to upgrade the tables.
incremental models update the derived tables using only the events from the most recent batch. The updated incremental model comes with a migration script.
For more details on this release, check the R69 Blue-Bellied Roller release on GitHub.
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels. For more information on the Blue-Bellied Roller itself, visit Wikipedia.