Snowplow users will be familiar with the idea that they can find duplicate events flowing through their pipeline. These duplicate events originate in a few places, including:
We can divide these duplicates into two groups:
Duplicates introduce significant skews in data modelling: they skew counts, confuse event pathing and, in Redshift, SQL
JOINs with duplicates will result in a Cartesian product.
For our original thinking on duplicates, please see the blog post Dealing with duplicate event IDs.
The natural duplicates problem in Redshift was initially addressed by R76 Changeable Hawk-Eagle release, which deletes all but one of each set of natural duplicates. This logic sits in the Scala Hadoop Shred component, which prepares events and contexts for loading into Redshift.
In R86 Petra we’re now introducing new in-batch synthetic deduplication, again as part of Scala Hadoop Shred.
The new functionality eliminates synthetic duplicates through the following steps:
Using this approach we have seen an enormous reduction (close to disappearance) of synthetic duplicates in our event warehouses.
The next step in our treatment of duplicates will be removing duplicates across ETL runs - also known as cross-batch deduplication. Stay tuned for our upcoming release R8x [HAD] Cross-batch natural deduplication.
This release introduces a new SQL data model that makes it easier to get started with web data. It aggregates the page view and page ping events to create a set of derived tables that contain a lot of detail, including: time engaged, scroll depth, and page performance (three dimensions we often get asked about). The model comes in three variants:
We are delighted to be adding support for three new AWS regions:
AWS has a healthy roadmap of new data center regions opening over the coming months; we are committed to Snowplow supporting these new regions as they become available.
Upgrading is simple - update the
hadoop_shred job version in your configuration YAML like so:
For a complete example, see our sample
You will also need to deploy the following table for Redshift:
As well as the cross-batch deduplication mentioned above, upcoming Snowplow releases include:
Note that these releases are always subject to change between now and the actual release date.
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.