We are pleased to announce the release of Snowplow 86 Petra. This release introduces additional event de-duplication functionality for our Redshift load process, plus a brand new data model that makes it easier to get started with web data. This release also adds support for AWS’s newest regions: Ohio, Montreal and London.
Having exhausted the bird population, we needed a new set of names for our Snowplow releases. We have decided to name this release series after archaelogical sites, starting with Petra in Jordan.
Read on after the fold for:
- Synthetic deduplication
- New data model for web data
- Support for new regions
- Getting help
1. Synthetic deduplication
1.1 Event duplicates 101
Snowplow users will be familiar with the idea that they can find duplicate events flowing through their pipeline. These duplicate events originate in a few places, including:
- The Snowplow pipeline provides at-least-once delivery semantics: an event can hit a collector twice; a Kinesis worker can be restarted from the last checkpoint
- Some third-party software like anti-virus or adult-content screeners can pre-cache HTTP requests, resulting in sending them twice
- UUID generation algorithm flaws cause collisions in the event IDs for totally independent events
We can divide these duplicates into two groups:
- Natural: duplicates with same event ID and same payload (which we call the event’s “fingerprint”), which are in fact “real” duplicates, caused mostly by absence of exactly-once semantics
- Synthetic: duplicates with same event ID, but different fingerprint, caused by third-party software and UUID clashes
Duplicates introduce significant skews in data modelling: they skew counts, confuse event pathing and, in Redshift, SQL
JOINs with duplicates will result in a Cartesian product.
For our original thinking on duplicates, please see the blog post Dealing with duplicate event IDs.
1.2 In-batch synthetic deduplication in Scala Hadoop Shred
The natural duplicates problem in Redshift was initially addressed by R76 Changeable Hawk-Eagle release, which deletes all but one of each set of natural duplicates. This logic sits in the Scala Hadoop Shred component, which prepares events and contexts for loading into Redshift.
In R86 Petra we’re now introducing new in-batch synthetic deduplication, again as part of Scala Hadoop Shred.
The new functionality eliminates synthetic duplicates through the following steps:
- Group events with the same event ID but different event fingerprint
- Generate a new random UUID to use as each event’s new event ID
- Attach a new duplicate context with the original event ID to provide data lineage
Using this approach we have seen an enormous reduction (close to disappearance) of synthetic duplicates in our event warehouses.
The next step in our treatment of duplicates will be removing duplicates across ETL runs - also known as cross-batch deduplication. Stay tuned for our upcoming release R8x [HAD] Cross-batch natural deduplication.
2. New data model for web data
This release introduces a new SQL data model that makes it easier to get started with web data. It aggregates the page view and page ping events to create a set of derived tables that contain a lot of detail, including: time engaged, scroll depth, and page performance (three dimensions we often get asked about). The model comes in three variants:
- A straightforward set of SQL queries
- A variant optimized for SQL Runner
- A variant optimized for Looker
3. Support for new regions
We are delighted to be adding support for three new AWS regions:
AWS has a healthy roadmap of new data center regions opening over the coming months; we are committed to Snowplow supporting these new regions as they become available.
Upgrading is simple - update the
hadoop_shred job version in your configuration YAML like so:
For a complete example, see our sample
You will also need to deploy the following table for Redshift:
As well as the cross-batch deduplication mentioned above, upcoming Snowplow releases include:
- R87 Chichen Itza, with various stability improvements for EmrEtlRunner and StorageLoader
- R8x [HAD] 4 webhooks, which will add support for 4 new webhooks (Mailgun, Olark, Unbounce, StatusGator)
- R8x [HAD] DashDB support, the first phase of our support for IBM’s dashDB, per our dashDB RFC
Note that these releases are always subject to change between now and the actual release date.
6. Getting help
For more details on this release, please check out the release notes on GitHub.