Snowplow 76 Changeable Hawk-Eagle released

26 January 2016  •  Alex Dean

We are pleased to announce the release of Snowplow 76 Changeable Hawk-Eagle. This release introduces an event de-duplication process which runs on Hadoop, and also includes an important bug fix for our recent SendGrid webhook support (#2328).

R76 Changeable Hawk-Eagle

Here are the sections after the fold:

  1. Event de-duplication in Hadoop Shred
  2. SendGrid webhook bug fix
  3. Upgrading
  4. Roadmap and contributing
  5. Getting help

1. Event de-duplication in Hadoop Shred

1.1 Event duplicates 101

Duplicate events are an unfortunate fact of life when it comes to data pipelines - for a helpful primer on this issue, see last year’s blog post Dealing with duplicate event IDs. Fortunately Snowplow makes it easy to identify duplicates, thanks to:

  1. Our major trackers (including JavaScript, iOS and Android) all generate a UUID for the event ID at event creation time, so any duplication that occurs downstream (e.g. due to spiders or anti-virus software) is easy to spot
  2. In Snowpow 71 Stork-Billed Kingfisher we introduced a new Event fingerprint enrichment, to help identify whether two events are semantically identical (i.e. contain all the same properties)

Once you have identified duplicates, it can be helpful to remove them - this is particularly important for Redshift, where we use the event ID to join between the master atomic.events table and the shredded JSON child tables. If duplicates are not removed, then JOINs between the master and child tables can become problematic.

1.2 Limitations of event de-duplication in SQL

In Snowplow 72 Great Spotted Kiwi we released SQL queries to de-dupe Snoplow events inside Redshift. While this was a great start, Redshift is not the ideal place to de-dupe events, for two reasons:

  1. The events have already been shredded into master atomic.events and child JSON tables, potentially resulting in a lot of company-specific tables to de-dupe
  2. De-duplication is resource-intensive and can add hours to a data modeling process

For both reasons, it makes sense to bring event de-duplication upstream in the pipeline - and so as of this release we are de-duplicating events inside our Hadoop Shred module, which reads Snowplow enriched events and prepares them for loading into Redshift.

1.3 Event de-duplication in Hadoop Shred

As of this release, Hadoop Shred de-duplicates “natural duplicates” - i.e. events which share the same event ID and the same event fingerprint, meaning that they are semantically identical to each other.

For a given ETL run of events being processed, Hadoop Shred will now keep only one out of each group of natural duplicates; all others will be discarded.

There is no configuration required for this functionality - de-duplication is performed automatically in Hadoop Shred, prior to shredding the events and loading them into Redshift.

Some notes on this:

  • The Snowplow enriched events written out to your enriched/good S3 bucket are not affected - they will continue to contain all duplicates
  • We do not yet tackle “synthetic dupes” - this is where two events have the same event ID but different event fingerprints. We are working on this, but in the meantime you can continue to use the SQL de-duplication for this if you have a major issue with bots, spiders and similar
  • If natural duplicates exist across ETL runs, these will not be de-duplicated currently. This is something we hope to explore soon

2. SendGrid webhook bug fix

In the last release, Snowplow R75 Long-Legged Buzzard, we introduced support for ingesting SendGrid events into Snowplow. Since the release an important bug was identified (#2328), which has now been fixed in R76.

Many thanks to community member Bernardo Srulzon for bringing this issue to our attention!

3. Upgrading

Upgrading to this release is simple - the only changed components are the jar versions for Hadoop Enrich and Hadoop Shred.

4.1 Upgrading your EmrEtlRunner config.yml

In the config.yml file for your EmrEtlRunner, update your hadoop_enrich and hadoop_shred job versions like so:

  versions:
    hadoop_enrich: 1.5.1 # WAS 1.5.0
    hadoop_shred: 0.7.0 # WAS 0.6.0
    hadoop_elasticsearch: 0.1.0 # Unchanged

For a complete example, see our sample config.yml template.

4. Roadmap and contributing

This event de-duplication code in Hadoop Shred represents our first piece of data modeling in Hadoop (rather than Redshift) - an exciting step for Snowplow! We plan to extend this functionality in Hadoop Shred in coming releases, in particular:

  1. Adding support for de-duplicating synthetic duplicates
  2. Adding support for de-duplicating events across ETL runs (likely using DynamoDB as our cross-batch “memory”)

In the meantime, upcoming Snowplow releases include:

  • Release 77 Great Auk, which will refresh our EmrEtlRunner app, including updating Snowplow to using the EMR 4.x AMI series
  • Release 78 Great Hornbill, which will bring the Kinesis pipeline up-to-date with the most recent Scala Common Enrich releases. This will also include click redirect support in the Scala Stream Collector
  • Release 79 Black Swan, which will allow enriching an event by requesting data from a third-party API

Note that these releases are always subject to change between now and the actual release date.

5. Getting help

As always, if you do run into any issues or don’t understand any of the new features, please raise an issue or get in touch with us via the usual channels.

For more details on this release, please check out the R76 Release Notes on GitHub.