Snowplow R102 Afontova Gora released with EmrEtlRunner improvements

03 April 2018  •  Anton Parkhomenko

We are pleased to announce the release of Snowplow R102. This Snowplow batch pipeline release brings several long-awaited improvements and bugfixes into EmrEtlRunner, improving the efficiency and stability of the batch pipeline.

Read on for more information on R102 Afontova Gora, named after the complex of Upper Paleolithic sites near my hometown of Krasnoyarsk, Central Siberia:

  1. Support for Kinesis-enriched data
  2. RDB Loader R29 compatibility
  3. Other improvements
  4. Upgrading
  5. Roadmap
  6. Help

afontova-gora

1. Support for Stream Enrich'ed events

1.1 Snowplow Lambda architecture 101

Broadly speaking, the Snowplow platform has two primary flavors: the original batch pipeline, and the newer realtime pipeline.

The Snowplow realtime pipeline is not a strict superset of the capabilities of our batch pipeline: the realtime pipeline is missing the batch pipeline’s functionality to prepare and load enriched events into Amazon Redshift.

For Snowplow realtime users wanting to load Redshift, we support a so-called Lambda architecture, which serves as a scalable and fault-tolerant combination of batch and realtime layers within single a pipeline.

In the Snowplow Lambda architecture, the Scala Stream Collector writes raw collector payload data from Kinesis to S3, and it is at this point that the pipeline splits into two independent flows, with:

  1. Stream Enrich reading from Kinesis, and then onwards into further Kinesis streams
  2. Spark Enrich reading from S3, and then onwards into RDB Shredder and RDB Loader

Although this architecture is widely used and works well, there is some inefficiency here: we are running the same enrichment process twice - in the realtime and batch layers.

1.2 EmrEtlRunner's new Stream Enrich support

To remove this duplication, in theory we could setup a Snowplow S3 Loader downstream of Stream Enrich’s enriched event stream, sinking those enriched events to S3. But unfortunately EmrEtlRunner didn’t support processing those enriched event files, leading creative members of the community to try workarounds involving our Dataflow Runner app

R102 Afontova Gora fixes this, introducing a “Stream Enrich mode” for EmrEtlRunner; this mode of operation effectively forces EmrEtlRunner to skip the staging of the collector payloads and the running of Spark Enrich - instead, EmrEtlRunner kicks off by staging enriched data written by S3 Loader.

In “Stream Enrich mode”, the EmrEtlRunner steps are as follows:

  1. Stage the enriched events which were written to S3 by the realtime pipeline
  2. Prepare the enriched events for Redshift using RDB Shredder
  3. Load the events into Redshift using RDB Loader
  4. Archive the events in S3

One important difference: in Stream Enrich mode the enriched event files are the master copy of the data you will load into Redshift, so make sure never to manually delete data from the enriched.good folder.

If you are an existing Lambda architecture user, please check out the Upgrading section below for help moving to the new architecture.

2. RDB Loader R29 compatibility

The upcoming release for RDB Loader, R29, focuses on improving stability, and guarding against problems such as S3 eventual consistency and accidental double-loading.

This release’s EmrEtlRunner update now prepares for the upcoming RDB Loader release, by passing additional information to these EMR steps if the specified versions of the artifacts are from R29 or above.

Stay tuned for the RDB Loader R29 release, where the new functionality will be explained.

3. Other improvements

This release of EmrEtlRunner also brings multiple bugfixes and improvements to the batch pipeline’s operational stability.

EmrEtlRunner now tries to recover from very common, but intermittent, failures such as RequestTimeout (#3468) or ServiceUnavailable (#3539). This should reduce unnecessary manual recoveries.

Additionally, for AMI 5 clusters, EmrEtlRunner now uses a specific bootstrap action which tweaks network settings to make a cluster fully compatible with AWS NAT Gateway.

4. Upgrading

The latest version of EmrEtlRunner is available from our Bintray here.

4.1 Upgrading for batch pipeline users

If you are only using the Snowplow batch pipeline, then it is still important to upgrade EmrEtlRunner, to prepare for the next RDB Loader release.

You won’t have to make any configuration file updates as part of this upgrade.

4.2 Upgrading for Lambda architecture users

If you currently run a Lambda architecture (realtime plus batch), then you will most likely want to upgrade to EmrEtlRunner’s new “Stream Enrich mode”.

To turn this mode on, you need to add a new aws.s3.buckets.enriched.stream property to your config.yml file. This should point to the bucket where you have configured Snowplow S3 Loader) to write enriched events. Add this like so:

aws:
  s3:
    buckets:
      enriched:
        stream: s3://path-to-kinesis/output/

In Stream Enrich mode, some properties in your config.yml file, such as aws.s3.buckets.raw, aws.s3.buckets.enriched.bad are aws.s3.buckets.enriched.errors are ignored by EmrEtlRunner and other batch applications.

For a complete example, we now have a dedicated sample stream_config.yml template - this shows what you need to set, and what you can remove.

Notice that in Stream Enrich mode, staging, enrich and archive_raw steps are effectively no-op. To avoid staging enriched data during recovery in this mode you need to skip new staging_stream_enrich step.

An important point: you need to be careful when making this switchover to avoid either missing events from Redshift, or duplicating them. Our preferred switchover approach is to:

  1. Prevent missing events, by building in some time period overlap between the raw and enriched folders in S3, and
  2. Prevent duplication in Redshift, by temporarily enable cross-batch deduplication, if it’s not already enabled

5. Roadmap

Upcoming Snowplow releases will include:

  • R103 Paestum, an urgent update of our IP Lookups Enrichment, moving us away from using the legacy MaxMind database format, which won’t be updated after 2nd April 2018
  • R10x [STR] PII Enrichment phase 2, enhancing our recently-released GDPR-focused PII Enrichment for the realtime pipeline

6. Getting help

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problems, please visit our Discourse forum.