Snowplow 86 Petra released

We are pleased to announce the release of Snowplow 86 Petra. This release introduces additional event de-duplication functionality for our Redshift load process, plus a brand new data model that makes it easier to get started with web data. This release also adds support for AWS’s newest regions: Ohio, Montreal and London.

Having exhausted the bird population, we needed a new set of names for our Snowplow releases. We have decided to name this release series after archaelogical sites, starting with Petra in Jordan.

Read on after the fold for:

  1. Synthetic deduplication
  2. New data model for web data
  3. Support for new regions
  4. Upgrading
  5. Roadmap
  6. Getting help

petra-jordan

1. Synthetic deduplication

1.1 Event duplicates 101

Snowplow users will be familiar with the idea that they can find duplicate events flowing through their pipeline. These duplicate events originate in a few places, including:

  • The Snowplow pipeline provides at-least-once delivery semantics: an event can hit a collector twice; a Kinesis worker can be restarted from the last checkpoint
  • Some third-party software like anti-virus or adult-content screeners can pre-cache HTTP requests, resulting in sending them twice
  • UUID generation algorithm flaws cause collisions in the event IDs for totally independent events

We can divide these duplicates into two groups:

  1. Natural: duplicates with same event ID and same payload (which we call the event’s “fingerprint”), which are in fact “real” duplicates, caused mostly by absence of exactly-once semantics
  2. Synthetic: duplicates with same event ID, but different fingerprint, caused by third-party software and UUID clashes

Duplicates introduce significant skews in data modelling: they skew counts, confuse event pathing and, in Redshift, SQL JOINs with duplicates will result in a Cartesian product.

For our original thinking on duplicates, please see the blog post Dealing with duplicate event IDs.

1.2 In-batch synthetic deduplication in Scala Hadoop Shred

The natural duplicates problem in Redshift was initially addressed by R76 Changeable Hawk-Eagle release, which deletes all but one of each set of natural duplicates. This logic sits in the Scala Hadoop Shred component, which prepares events and contexts for loading into Redshift.

In R86 Petra we’re now introducing new in-batch synthetic deduplication, again as part of Scala Hadoop Shred.

The new functionality eliminates synthetic duplicates through the following steps:

  1. Group events with the same event ID but different event fingerprint
  2. Generate a new random UUID to use as each event’s new event ID
  3. Attach a new duplicate context with the original event ID to provide data lineage

Using this approach we have seen an enormous reduction (close to disappearance) of synthetic duplicates in our event warehouses.

The next step in our treatment of duplicates will be removing duplicates across ETL runs - also known as cross-batch deduplication. Stay tuned for our upcoming release R8x [HAD] Cross-batch natural deduplication.

2. New data model for web data

The most common tracker for Snowplow users to get started with is the JavaScript Tracker. Like all our trackers, it can be used to track the self-describing events and entities that our users have defined themselves. In addition, we provide built-in support for the web-native events that most users will want to track. This includes events such as page views, page pings, and link clicks.

This release introduces a new SQL data model that makes it easier to get started with web data. It aggregates the page view and page ping events to create a set of derived tables that contain a lot of detail, including: time engaged, scroll depth, and page performance (three dimensions we often get asked about). The model comes in three variants:

  1. A straightforward set of SQL queries
  2. A variant optimized for SQL Runner
  3. A variant optimized for Looker

3. Support for new regions

We are delighted to be adding support for three new AWS regions:

  1. Ohio, USA (us-east-2)
  2. Montreal, Canada (ca-central-1)
  3. London, UK (eu-west-2)

AWS has a healthy roadmap of new data center regions opening over the coming months; we are committed to Snowplow supporting these new regions as they become available.

4. Upgrading

Upgrading is simple - update the hadoop_shred job version in your configuration YAML like so:

1 versions:
2   hadoop_enrich: 1.8.0        # UNCHANGED
3   hadoop_shred: 0.10.0        # WAS 0.9.0
4   hadoop_elasticsearch: 0.1.0 # UNCHANGED

For a complete example, see our sample config.yml template.

You will also need to deploy the following table for Redshift:

5. Roadmap

As well as the cross-batch deduplication mentioned above, upcoming Snowplow releases include:

Note that these releases are always subject to change between now and the actual release date.

6. Getting help

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.

Thoughts or questions? Come join us in our Discourse forum!

Anton Parkhomenko

Anton is a data engineer at Snowplow. You can find him on GitHub, Twitter and on his personal blog.