We are pleased to announce the release of Snowplow 76 Changeable Hawk-Eagle. This release introduces an event de-duplication process which runs on Hadoop, and also includes an important bug fix for our recent SendGrid webhook support (#2328).
Here are the sections after the fold:
- Event de-duplication in Hadoop Shred
- SendGrid webhook bug fix
- Roadmap and contributing
- Getting help
1. Event de-duplication in Hadoop Shred
1.1 Event duplicates 101
Duplicate events are an unfortunate fact of life when it comes to data pipelines - for a helpful primer on this issue, see last year’s blog post Dealing with duplicate event IDs. Fortunately Snowplow makes it easy to identify duplicates, thanks to:
- In Snowpow 71 Stork-Billed Kingfisher we introduced a new Event fingerprint enrichment, to help identify whether two events are semantically identical (i.e. contain all the same properties)
Once you have identified duplicates, it can be helpful to remove them - this is particularly important for Redshift, where we use the event ID to join between the master
atomic.events table and the shredded JSON child tables. If duplicates are not removed, then
JOINs between the master and child tables can become problematic.
1.2 Limitations of event de-duplication in SQL
In Snowplow 72 Great Spotted Kiwi we released SQL queries to de-dupe Snoplow events inside Redshift. While this was a great start, Redshift is not the ideal place to de-dupe events, for two reasons:
- The events have already been shredded into master
atomic.eventsand child JSON tables, potentially resulting in a lot of company-specific tables to de-dupe
- De-duplication is resource-intensive and can add hours to a data modeling process
For both reasons, it makes sense to bring event de-duplication upstream in the pipeline - and so as of this release we are de-duplicating events inside our Hadoop Shred module, which reads Snowplow enriched events and prepares them for loading into Redshift.
1.3 Event de-duplication in Hadoop Shred
As of this release, Hadoop Shred de-duplicates “natural duplicates” - i.e. events which share the same event ID and the same event fingerprint, meaning that they are semantically identical to each other.
For a given ETL run of events being processed, Hadoop Shred will now keep only one out of each group of natural duplicates; all others will be discarded.
There is no configuration required for this functionality - de-duplication is performed automatically in Hadoop Shred, prior to shredding the events and loading them into Redshift.
Some notes on this:
- The Snowplow enriched events written out to your
enriched/goodS3 bucket are not affected - they will continue to contain all duplicates
- We do not yet tackle “synthetic dupes” - this is where two events have the same event ID but different event fingerprints. We are working on this, but in the meantime you can continue to use the SQL de-duplication for this if you have a major issue with bots, spiders and similar
- If natural duplicates exist across ETL runs, these will not be de-duplicated currently. This is something we hope to explore soon
2. SendGrid webhook bug fix
In the last release, Snowplow R75 Long-Legged Buzzard, we introduced support for ingesting SendGrid events into Snowplow. Since the release an important bug was identified (#2328), which has now been fixed in R76.
Many thanks to community member Bernardo Srulzon for bringing this issue to our attention!
Upgrading to this release is simple - the only changed components are the jar versions for Hadoop Enrich and Hadoop Shred.
4.1 Upgrading your EmrEtlRunner config.yml
config.yml file for your EmrEtlRunner, update your
hadoop_shred job versions like so:
For a complete example, see our sample
4. Roadmap and contributing
This event de-duplication code in Hadoop Shred represents our first piece of data modeling in Hadoop (rather than Redshift) - an exciting step for Snowplow! We plan to extend this functionality in Hadoop Shred in coming releases, in particular:
- Adding support for de-duplicating synthetic duplicates
- Adding support for de-duplicating events across ETL runs (likely using DynamoDB as our cross-batch “memory”)
In the meantime, upcoming Snowplow releases include:
- Release 77 Great Auk, which will refresh our EmrEtlRunner app, including updating Snowplow to using the EMR 4.x AMI series
- Release 78 Great Hornbill, which will bring the Kinesis pipeline up-to-date with the most recent Scala Common Enrich releases. This will also include click redirect support in the Scala Stream Collector
- Release 79 Black Swan, which will allow enriching an event by requesting data from a third-party API
Note that these releases are always subject to change between now and the actual release date.
5. Getting help
For more details on this release, please check out the R76 Release Notes on GitHub.