Snowplow 66 Oriental Skylark released

16 June 2015  •  Alex Dean

We are pleased to announce the release of Snowplow 66, Oriental Skylark. This release upgrades our Hadoop Enrichment process to run on Hadoop 2.4, re-enables our Kinesis-Hadoop lambda architecture and also introduces a new scriptable enrichment powered by JavaScript - our most powerful enrichment yet!

Table of contents:

  1. Our enrichment process on Hadoop 2.4
  2. Re-enabled Kinesis-Hadoop lambda architecture
  3. JavaScript scripting enrichment
  4. Other changes
  5. Upgrading
  6. Getting help

oriental-skylark

1. Our enrichment process on Hadoop 2.4

Since the inception of Snowplow three years ago, our Hadoop Enrichment process has been tied to Hadoop 1 and Elastic MapReduce’s 2.4.x series AMIs. In the meantime, Elastic MapReduce has been iterating through the 3.x.x series of AMIs, introducing lots of great features including:

  • Hadoop 2.x, along with YARN and new HDFS features e.g. symbolic links
  • New features and important bug fixes in S3DistCp
  • The ability to run Spark on an EMR cluster

To take advantage of these new features, we are now upgrading our Hadoop Enrichment process to run on Hadoop 2.4 and the EMR 3.x.x series AMIs exclusively. Our testing has been with the 3.6.0 AMI, so that is the recommended version currently.

To reflect this breaking change, the new version of Hadoop Enrich is 1.0.0. Because our Hadoop Shred process works on Hadoop 2.4 without code changes, this version is unchanged at 0.4.0.

We are hugely excited about our move to Hadoop 2.x and YARN! This should allow for some powerful new capabilities in the Snowplow batch pipeline, such as mixed Hadoop/Spark event processing.

2. Re-enabled Kinesis-Hadoop lambda architecture

A Lambda Architecture is Nathan Marz’s term for a hybrid batch and streaming architecture for event processing. There are two reasons why users of Snowplow’s Kinesis pipeline should consider a lambda architecture, operating the Hadoop pipeline alongside their existing Kinesis flow:

  1. The Hadoop pipeline allows you to re-process your raw events (e.g. when we introduce a new enrichment) long after the raw events have expired from your Kinesis stream
  2. The Hadoop pipeline lets you load Snowplow enriched events into Amazon Redshift (or Postgres)

To run the Hadoop pipeline alongside your Kinesis pipeline follow these steps:

  1. Deploy the kinesis-s3 application and configure it to write your Kinesis stream of raw Snowplow events to Amazon S3
  2. Deploy the Hadoop pipeline and configure EmrEtlRunner to read from the S3 bucket from #1 with collector_format set to thrift

This release fixes some issues with running the Kinesis-Hadoop lambda architecture which were related to Amazon’s introduction of IAM roles for Elastic MapReduce; two of these fixes were implemented in EmrEtlRunner (#1715 and #1647), so you will have to upgrade your EmrEtlRunner as per the instructions below.

3. JavaScript scripting enrichment

The JavaScript scripting enrichment lets you write a JavaScript function which is executed in the Enrichment process for each enriched event, and returns one or more derived contexts which are attached to the final enriched event.

Use this enrichment to apply your own business logic to your enriched events; because your JavaScript function can throw exceptions which are gracefully handled by the calling Enrichment process, you can also use this enrichment to perform simple filtering of events.

This enrichment has been introduced for the Hadoop pipeline only in this release; it will be added to the Kinesis pipeline in our next release.

3.1 Usage guide

Your JavaScript must include a function, process(event), which:

  • Takes a Snowplow enriched event POJO (Plain Old Java Object) as its sole argument
  • Returns a JavaScript array of valid self-describing JSONs, which will be added to the derived_contexts field in the enriched event
  • Returns [] or null if there are no contexts to add to this event
  • Can throw exceptions but note that throwing an exception will cause the entire enriched event to end up in the Bad Bucket or Bad Stream

Note that you can also include other top-level functions and variables in your JavaScript script - but you must include a process(event) function somewhere in your script.

For a more detailed usage guide, please see the JavaScript script enrichment wiki page.

3.2 Example

Here is an example JavaScript script for this enrichment:

const SECRET_APP_ID = "Joshua";

function process(event) {
    var appId = event.getApp_id();

    if (platform == "server" && appId != SECRET_APP_ID) {
        throw "Server-side event has invalid app_id: " + appId;
    }

    if (appId == null) {
        return [];
    }

    var appIdUpper = new String(appId.toUpperCase());
    return [ { schema: "iglu:com.acme/derived_app_id/jsonschema/1-0-0",
               data:  { appIdUpper: appIdUpper } } ];
}

This function is actually serving two discrete roles:

  1. If this is a server-sent event, we validate that the app_id matches our secret. This is a simple way of preventing a “bad actor” from spoofing our server-sent events
  2. If app_id is not null, we return a new context for Acme Inc, derived_app_id, which contains the upper-cased app_id

These are of course just very simple examples - we look forward to seeing what the community come up with!

3.3 How this enrichment works

This enrichment uses the Rhino JavaScript engine to execute your JavaScript. Your JavaScript is pre-compiled so that your code should approach native Java speeds.

The process function is passed the exact Snowplow enriched event POJO. The return value from the process function is converted into a JSON string (using JSON.stringify) in JavaScript before being retrieved in our Scala code. Our Scala code confirms that the return value is either null or an empty or non-empty array of Objects. No validation of the self-describing JSONs inside the array is performed.

If you are interested in learning more about Rhino and the JVM, check out our earlier R&D blog post, Scripting Hadoop, Part One - Adventures with Scala, Rhino and JavaScript.

4. Other changes

We have also:

  • Fixed the various incorrect links in Scala Common Enrich’s README.md, thank you Snowplow community member and intern Vincent Ohprecio! (#1669)
  • Made the mkt_ and refr_ fields TSV safe - big thanks to Snowplow community member Jason Bosco for this! (#1643)
  • Fixed an uncaught NPE exception in our JSON error handling code’s stripInstanceEtc function (#1622)
  • On the data modeling side of things, we have removed restrictions in sessions and visitors-source (#1725)

5. Upgrading

5.1 Upgrading your EmrEtlRunner

You need to update EmrEtlRunner to the latest version (0.15.0) on GitHub:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout r66-oriental-skylark
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment

5.2 Updating EmrEtlRunner's configuration

You need to update your EmrEtlRunner’s config.yml file to reflect the new Hadoop 2.4.0 and AMI 3.6.0 support:

:emr:
  :ami_version: 3.6.0 # WAS 2.4.2

And:

  :versions:
    :hadoop_enrich: 1.0.0 # WAS 0.14.1

For a complete example, see our sample config.yml template.

5.3 JavaScript scripting enrichment

You can enable this enrichment by creating a self-describing JSON and adding into your enrichments folder. The configuration JSON should validate against the [javascript_script_config schema] [schema].

The configuration JSON for the JavaScript example above would be as follows:

{
    "schema": "iglu:com.snowplowanalytics.snowplow/javascript_script_config/jsonschema/1-0-0",
    "data": {
        "vendor": "com.snowplowanalytics.snowplow",
        "name": "javascript_script_config",
        "enabled": true,
        "parameters": {
            "script": "Y29uc3QgU0VDUkVUX0FQUF9JRCA9ICJKb3NodWEiOw0KDQovKioNCiAqIFBlcmZvcm1zIHR3byByb2xlczoNCiAqIDEuIElmIHRoaXMgaXMgYSBzZXJ2ZXItc2lkZSBldmVudCwgd2UNCiAqICAgIHZhbGlkYXRlIHRoYXQgdGhlIGFwcF9pZCBpcyBvdXINCiAqICAgIHZhbGlkIHNlY3JldC4gUHJldmVudHMgc3Bvb2Zpbmcgb2YNCiAqICAgIG91ciBzZXJ2ZXItc2lkZSBldmVudHMNCiAqIDIuIElmIGFwcF9pZCBpcyBub3QgbnVsbCwgcmV0dXJuIGEgbmV3DQogKiAgICBBY21lIGNvbnRleHQsIGRlcml2ZWRfYXBwX2lkLCB3aGljaA0KICogICAgY29udGFpbnMgdGhlIHVwcGVyLWNhc2VkIGFwcF9pZA0KICovDQpmdW5jdGlvbiBwcm9jZXNzKGV2ZW50KSB7DQogICAgdmFyIGFwcElkID0gZXZlbnQuZ2V0QXBwX2lkKCk7DQoNCiAgICBpZiAocGxhdGZvcm0gPT0gInNlcnZlciIgJiYgYXBwSWQgIT0gU0VDUkVUX0FQUF9JRCkgew0KICAgICAgICB0aHJvdyAiU2VydmVyLXNpZGUgZXZlbnQgaGFzIGludmFsaWQgYXBwX2lkOiAiICsgYXBwSWQ7DQogICAgfQ0KDQogICAgaWYgKGFwcElkID09IG51bGwpIHsNCiAgICAgICAgcmV0dXJuIFtdOw0KICAgIH0NCg0KICAgIHZhciBhcHBJZFVwcGVyID0gbmV3IFN0cmluZyhhcHBJZC50b1VwcGVyQ2FzZSgpKTsNCiAgICByZXR1cm4gWyB7IHNjaGVtYTogImlnbHU6Y29tLmFjbWUvZGVyaXZlZF9hcHBfaWQvanNvbnNjaGVtYS8xLTAtMCIsDQogICAgICAgICAgICAgICBkYXRhOiAgeyBhcHBJZFVwcGVyOiBhcHBJZFVwcGVyIH0gfSBdOw0KfQ=="
        }
    }
}

6. Getting help

For more details on this release, please check out the r66 Oriental Skylark on GitHub.

Documentation on the new JavaScript script enrichment is available on the wiki.

If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.