We are pleased to announce the release of Snowplow 69, Blue-Bellied Roller, which contains new and updated SQL data models. The blue-bellied roller is a beautiful African bird that breeds in a narrow belt from Senegal to the northeast of the Congo. It has a dark green back, a white head, neck and breast, and a blue belly and tail.
This post covers:
- Updated data model: incremental
- New data model: mobile
- New data model: deduplicate
- Implementing and upgrading SQL data models
- Details and questions
1. Updated data model: incremental
The current data models were introduced with Snowplow 64 Palila and aggregate web data into page views, sessions and visitors (the derived tables). This release contains an improved version of the incremental model, which updates the derived tables using only the events from the most recent batch.
The SQL queries were rewritten to:
- Keep intermediate results in memory, rather than writing them out to disk
- Reduce the disk IO needed to update existing entries in the derived table
This, together with other changes, results in a decrease in disk usage of up to 10% and execution times that are up to 5 times faster. For a user with 10 to 20 million events per batch, we have seen the execution time drop from 30 to 6 minutes.
The following diagram illustrates how the incremental models updates the derived tables:
2. New data model: mobile
This release includes a new mobile SQL data model.
This model takes events from the mobile context and aggregates them into sessions and users. It’s an example of a data model that uses server-side, rather than client-side, sessionization. The techniques used to sessionize events will be discussed in a future blog post.
3. New data model: deduplicate
This release also includes a new model that deduplicates events in
atomic.events, and in doing so ensures that the event ID is unique. This addresses an issue where a small percentage of rows have the same event ID.
Duplicate events are either natural or synthetic copies. Natural copies are true duplicates (i.e. the entire event is duplicated) and are introduced because the Snowplow pipeline is set up to guarantee that each event is processed at least once. Synthetic copies are produced external to Snowplow by, for example, browser pre-cachers and web scrapers. These copies have the same event ID, but parts of the rest of the event can be different.
This new model deduplicates natural copies and moves synthetic copies from
atomic.duplicated_events. This ensures that the event ID in
atomic.events is unique. The issue of duplicate events will be discussed in more detail in a subsequent blogpost.
4. Implementing and upgrading SQL data models
The SQL data models are a standalone and optional part of the Snowplow pipeline. Users who don’t use the SQL data models are therefore not affected by this release.
To implement the SQL data models, first execute the relevant setup queries in Redshift. Then use SQL Runner to execute the queries on a regular basis. SQL Runner is an open source app that makes it easy to execute SQL statements programmatically as part of the Snowplow data pipeline.
The web and mobile data models come in two variants:
recalculate models drop and recalculate the derived tables using all events, and can therefore be replaced without having to upgrade the tables.
5. Details and questions
For more details on this release, check the R69 Blue-Bellied Roller release on GitHub.