Here are the sections after the fold:
Duplicate events are an unfortunate fact of life when it comes to data pipelines - for a helpful primer on this issue, see last year’s blog post Dealing with duplicate event IDs. Fortunately Snowplow makes it easy to identify duplicates, thanks to:
Once you have identified duplicates, it can be helpful to remove them - this is particularly important for Redshift, where we use the event ID to join between the master atomic.events
table and the shredded JSON child tables. If duplicates are not removed, then JOIN
s between the master and child tables can become problematic.
In Snowplow 72 Great Spotted Kiwi we released SQL queries to de-dupe Snoplow events inside Redshift. While this was a great start, Redshift is not the ideal place to de-dupe events, for two reasons:
atomic.events
and child JSON tables, potentially resulting in a lot of company-specific tables to de-dupeFor both reasons, it makes sense to bring event de-duplication upstream in the pipeline - and so as of this release we are de-duplicating events inside our Hadoop Shred module, which reads Snowplow enriched events and prepares them for loading into Redshift.
As of this release, Hadoop Shred de-duplicates “natural duplicates” - i.e. events which share the same event ID and the same event fingerprint, meaning that they are semantically identical to each other.
For a given ETL run of events being processed, Hadoop Shred will now keep only one out of each group of natural duplicates; all others will be discarded.
There is no configuration required for this functionality - de-duplication is performed automatically in Hadoop Shred, prior to shredding the events and loading them into Redshift.
Some notes on this:
enriched/good
S3 bucket are not affected - they will continue to contain all duplicatesIn the last release, Snowplow R75 Long-Legged Buzzard, we introduced support for ingesting SendGrid events into Snowplow. Since the release an important bug was identified (#2328), which has now been fixed in R76.
Many thanks to community member Bernardo Srulzon for bringing this issue to our attention!
Upgrading to this release is simple - the only changed components are the jar versions for Hadoop Enrich and Hadoop Shred.
In the config.yml
file for your EmrEtlRunner, update your hadoop_enrich
and hadoop_shred
job versions like so:
versions:
hadoop_enrich: 1.5.1 # WAS 1.5.0
hadoop_shred: 0.7.0 # WAS 0.6.0
hadoop_elasticsearch: 0.1.0 # Unchanged
For a complete example, see our sample config.yml
template.
This event de-duplication code in Hadoop Shred represents our first piece of data modeling in Hadoop (rather than Redshift) - an exciting step for Snowplow! We plan to extend this functionality in Hadoop Shred in coming releases, in particular:
In the meantime, upcoming Snowplow releases include:
Note that these releases are always subject to change between now and the actual release date.
As always, if you do run into any issues or don’t understand any of the new features, please raise an issue or get in touch with us via the usual channels.
For more details on this release, please check out the R76 Release Notes on GitHub.