Since the inception of Snowplow three years ago, our Hadoop Enrichment process has been tied to Hadoop 1 and Elastic MapReduce’s 2.4.x series AMIs. In the meantime, Elastic MapReduce has been iterating through the 3.x.x series of AMIs, introducing lots of great features including:
To take advantage of these new features, we are now upgrading our Hadoop Enrichment process to run on Hadoop 2.4 and the EMR 3.x.x series AMIs exclusively. Our testing has been with the 3.6.0 AMI, so that is the recommended version currently.
To reflect this breaking change, the new version of Hadoop Enrich is 1.0.0. Because our Hadoop Shred process works on Hadoop 2.4 without code changes, this version is unchanged at 0.4.0.
We are hugely excited about our move to Hadoop 2.x and YARN! This should allow for some powerful new capabilities in the Snowplow batch pipeline, such as mixed Hadoop/Spark event processing.
A Lambda Architecture is Nathan Marz’s term for a hybrid batch and streaming architecture for event processing. There are two reasons why users of Snowplow’s Kinesis pipeline should consider a lambda architecture, operating the Hadoop pipeline alongside their existing Kinesis flow:
To run the Hadoop pipeline alongside your Kinesis pipeline follow these steps:
This release fixes some issues with running the Kinesis-Hadoop lambda architecture which were related to Amazon’s introduction of IAM roles for Elastic MapReduce; two of these fixes were implemented in EmrEtlRunner (#1715 and #1647), so you will have to upgrade your EmrEtlRunner as per the instructions below.
This enrichment has been introduced for the Hadoop pipeline only in this release; it will be added to the Kinesis pipeline in our next release.
derived_contextsfield in the enriched event
nullif there are no contexts to add to this event
throwexceptions but note that throwing an exception will cause the entire enriched event to end up in the Bad Bucket or Bad Stream
process(event) function somewhere in your script.
This function is actually serving two discrete roles:
app_idmatches our secret. This is a simple way of preventing a “bad actor” from spoofing our server-sent events
app_idis not null, we return a new context for Acme Inc,
derived_app_id, which contains the upper-cased
These are of course just very simple examples - we look forward to seeing what the community come up with!
process function is passed the exact Snowplow enriched event POJO. The return value from the
process function is converted into a JSON string (using
We have also:
README.md, thank you Snowplow community member and intern Vincent Ohprecio! (#1669)
refr_fields TSV safe - big thanks to Snowplow community member Jason Bosco for this! (#1643)
You need to update EmrEtlRunner to the latest version (0.15.0) on GitHub:
You need to update your EmrEtlRunner’s
config.yml file to reflect the new Hadoop 2.4.0 and AMI 3.6.0 support:
For a complete example, see our sample
You can enable this enrichment by creating a self-describing JSON and adding into your
For more details on this release, please check out the r66 Oriental Skylark on GitHub.
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.