Table of contents:
- Our enrichment process on Hadoop 2.4
- Re-enabled Kinesis-Hadoop lambda architecture
- Other changes
- Getting help
Since the inception of Snowplow three years ago, our Hadoop Enrichment process has been tied to Hadoop 1 and Elastic MapReduce’s 2.4.x series AMIs. In the meantime, Elastic MapReduce has been iterating through the 3.x.x series of AMIs, introducing lots of great features including:
- Hadoop 2.x, along with YARN and new HDFS features e.g. symbolic links
- New features and important bug fixes in S3DistCp
- The ability to run Spark on an EMR cluster
To take advantage of these new features, we are now upgrading our Hadoop Enrichment process to run on Hadoop 2.4 and the EMR 3.x.x series AMIs exclusively. Our testing has been with the 3.6.0 AMI, so that is the recommended version currently.
To reflect this breaking change, the new version of Hadoop Enrich is 1.0.0. Because our Hadoop Shred process works on Hadoop 2.4 without code changes, this version is unchanged at 0.4.0.
We are hugely excited about our move to Hadoop 2.x and YARN! This should allow for some powerful new capabilities in the Snowplow batch pipeline, such as mixed Hadoop/Spark event processing.
A Lambda Architecture is Nathan Marz’s term for a hybrid batch and streaming architecture for event processing. There are two reasons why users of Snowplow’s Kinesis pipeline should consider a lambda architecture, operating the Hadoop pipeline alongside their existing Kinesis flow:
- The Hadoop pipeline allows you to re-process your raw events (e.g. when we introduce a new enrichment) long after the raw events have expired from your Kinesis stream
- The Hadoop pipeline lets you load Snowplow enriched events into Amazon Redshift (or Postgres)
To run the Hadoop pipeline alongside your Kinesis pipeline follow these steps:
- Deploy the kinesis-s3 application and configure it to write your Kinesis stream of raw Snowplow events to Amazon S3
- Deploy the Hadoop pipeline and configure EmrEtlRunner to read from the S3 bucket from #1 with
This release fixes some issues with running the Kinesis-Hadoop lambda architecture which were related to Amazon’s introduction of IAM roles for Elastic MapReduce; two of these fixes were implemented in EmrEtlRunner (#1715 and #1647), so you will have to upgrade your EmrEtlRunner as per the instructions below.
This enrichment has been introduced for the Hadoop pipeline only in this release; it will be added to the Kinesis pipeline in our next release.
- Takes a Snowplow enriched event POJO (Plain Old Java Object) as its sole argument
derived_contextsfield in the enriched event
nullif there are no contexts to add to this event
throwexceptions but note that throwing an exception will cause the entire enriched event to end up in the Bad Bucket or Bad Stream
process(event) function somewhere in your script.
This function is actually serving two discrete roles:
- If this is a server-sent event, we validate that the
app_idmatches our secret. This is a simple way of preventing a “bad actor” from spoofing our server-sent events
app_idis not null, we return a new context for Acme Inc,
derived_app_id, which contains the upper-cased
These are of course just very simple examples – we look forward to seeing what the community come up with!
process function is passed the exact Snowplow enriched event POJO. The return value from the
process function is converted into a JSON string (using
We have also:
- Fixed the various incorrect links in Scala Common Enrich’s
README.md, thank you Snowplow community member and intern Vincent Ohprecio! (#1669)
- Made the
refr_fields TSV safe – big thanks to Snowplow community member Jason Bosco for this! (#1643)
- Fixed an uncaught NPE exception in our JSON error handling code’s
- On the data modeling side of things, we have removed restrictions in sessions and visitors-source (#1725)
You need to update EmrEtlRunner to the latest version (0.15.0) on GitHub:
You need to update your EmrEtlRunner’s
config.yml file to reflect the new Hadoop 2.4.0 and AMI 3.6.0 support:
For a complete example, see our sample
You can enable this enrichment by creating a self-describing JSON and adding into your
For more details on this release, please check out the r66 Oriental Skylark on GitHub.