Snowplow 76 Changeable Hawk-Eagle released

Share

We are pleased to announce the release of Snowplow 76 Changeable Hawk-Eagle. This release introduces an event de-duplication process which runs on Hadoop, and also includes an important bug fix for our recent SendGrid webhook support (#2328).

R76 Changeable Hawk-Eagle

Here are the sections after the fold:

  1. Event de-duplication in Hadoop Shred
  2. SendGrid webhook bug fix
  3. Upgrading
  4. Roadmap and contributing
  5. Getting help

1. Event de-duplication in Hadoop Shred

1.1 Event duplicates 101

Duplicate events are an unfortunate fact of life when it comes to data pipelines – for a helpful primer on this issue, see last year’s blog post Dealing with duplicate event IDs. Fortunately Snowplow makes it easy to identify duplicates, thanks to:

  1. Our major trackers (including JavaScript, iOS and Android) all generate a UUID for the event ID at event creation time, so any duplication that occurs downstream (e.g. due to spiders or anti-virus software) is easy to spot
  2. In Snowpow 71 Stork-Billed Kingfisher we introduced a new Event fingerprint enrichment, to help identify whether two events are semantically identical (i.e. contain all the same properties)

Once you have identified duplicates, it can be helpful to remove them – this is particularly important for Redshift, where we use the event ID to join between the master atomic.events table and the shredded JSON child tables. If duplicates are not removed, then JOINs between the master and child tables can become problematic.

1.2 Limitations of event de-duplication in SQL

In Snowplow 72 Great Spotted Kiwi we released SQL queries to de-dupe Snoplow events inside Redshift. While this was a great start, Redshift is not the ideal place to de-dupe events, for two reasons:

  1. The events have already been shredded into master atomic.events and child JSON tables, potentially resulting in a lot of company-specific tables to de-dupe
  2. De-duplication is resource-intensive and can add hours to a data modeling process

For both reasons, it makes sense to bring event de-duplication upstream in the pipeline – and so as of this release we are de-duplicating events inside our Hadoop Shred module, which reads Snowplow enriched events and prepares them for loading into Redshift.

1.3 Event de-duplication in Hadoop Shred

As of this release, Hadoop Shred de-duplicates “natural duplicates” – i.e. events which share the same event ID and the same event fingerprint, meaning that they are semantically identical to each other.

For a given ETL run of events being processed, Hadoop Shred will now keep only one out of each group of natural duplicates; all others will be discarded.

There is no configuration required for this functionality – de-duplication is performed automatically in Hadoop Shred, prior to shredding the events and loading them into Redshift.

Some notes on this:

2. SendGrid webhook bug fix

In the last release, Snowplow R75 Long-Legged Buzzard, we introduced support for ingesting SendGrid events into Snowplow. Since the release an important bug was identified (#2328), which has now been fixed in R76.

Many thanks to community member Bernardo Srulzon for bringing this issue to our attention!

3. Upgrading

Upgrading to this release is simple – the only changed components are the jar versions for Hadoop Enrich and Hadoop Shred.

4.1 Upgrading your EmrEtlRunner config.yml

In the config.yml file for your EmrEtlRunner, update your hadoop_enrich and hadoop_shred job versions like so:

 versions: hadoop_enrich: 1.5.1 # WAS 1.5.0 hadoop_shred: 0.7.0 # WAS 0.6.0 hadoop_elasticsearch: 0.1.0 # Unchanged

For a complete example, see our sample config.yml template.

4. Roadmap and contributing

This event de-duplication code in Hadoop Shred represents our first piece of data modeling in Hadoop (rather than Redshift) – an exciting step for Snowplow! We plan to extend this functionality in Hadoop Shred in coming releases, in particular:

  1. Adding support for de-duplicating synthetic duplicates
  2. Adding support for de-duplicating events across ETL runs (likely using DynamoDB as our cross-batch “memory”)

In the meantime, upcoming Snowplow releases include:

Note that these releases are always subject to change between now and the actual release date.

5. Getting help

As always, if you do run into any issues or don’t understand any of the new features, please raise an issue or get in touch with us vi
a the usual channels.

For more details on this release, please check out the R76 Release Notes on GitHub.

Share

Related articles