Broadly speaking, the Snowplow platform has two primary flavors: the original batch pipeline, and the newer realtime pipeline.
The Snowplow realtime pipeline is not a strict superset of the capabilities of our batch pipeline: the realtime pipeline is missing the batch pipeline’s functionality to prepare and load enriched events into Amazon Redshift.
For Snowplow realtime users wanting to load Redshift, we support a so-called Lambda architecture, which serves as a scalable and fault-tolerant combination of batch and realtime layers within single a pipeline.
In the Snowplow Lambda architecture, the Scala Stream Collector writes raw collector payload data from Kinesis to S3, and it is at this point that the pipeline splits into two independent flows, with:
Although this architecture is widely used and works well, there is some inefficiency here: we are running the same enrichment process twice - in the realtime and batch layers.
To remove this duplication, in theory we could setup a Snowplow S3 Loader downstream of Stream Enrich’s enriched event stream, sinking those enriched events to S3. But unfortunately EmrEtlRunner didn’t support processing those enriched event files, leading creative members of the community to try workarounds involving our Dataflow Runner app
R102 Afontova Gora fixes this, introducing a “Stream Enrich mode” for EmrEtlRunner; this mode of operation effectively forces EmrEtlRunner to skip the staging of the collector payloads and the running of Spark Enrich - instead, EmrEtlRunner kicks off by staging enriched data written by S3 Loader.
In “Stream Enrich mode”, the EmrEtlRunner steps are as follows:
One important difference: in Stream Enrich mode the enriched event files are the master copy of the data you will load into Redshift, so make sure never to manually delete data from the
If you are an existing Lambda architecture user, please check out the Upgrading section below for help moving to the new architecture.
The upcoming release for RDB Loader, R29, focuses on improving stability, and guarding against problems such as S3 eventual consistency and accidental double-loading.
This release’s EmrEtlRunner update now prepares for the upcoming RDB Loader release, by passing additional information to these EMR steps if the specified versions of the artifacts are from R29 or above.
Stay tuned for the RDB Loader R29 release, where the new functionality will be explained.
This release of EmrEtlRunner also brings multiple bugfixes and improvements to the batch pipeline’s operational stability.
EmrEtlRunner now tries to recover from very common, but intermittent, failures such as
RequestTimeout (#3468) or
ServiceUnavailable (#3539). This should reduce unnecessary manual recoveries.
Additionally, for AMI 5 clusters, EmrEtlRunner now uses a specific bootstrap action which tweaks network settings to make a cluster fully compatible with AWS NAT Gateway.
The latest version of EmrEtlRunner is available from our Bintray here.
If you are only using the Snowplow batch pipeline, then it is still important to upgrade EmrEtlRunner, to prepare for the next RDB Loader release.
You won’t have to make any configuration file updates as part of this upgrade.
If you currently run a Lambda architecture (realtime plus batch), then you will most likely want to upgrade to EmrEtlRunner’s new “Stream Enrich mode”.
To turn this mode on, you need to add a new
aws.s3.buckets.enriched.stream property to your
config.yml file. This should point to the bucket where you have configured Snowplow S3 Loader) to write enriched events. Add this like so:
In Stream Enrich mode, some properties in your
config.yml file, such as
aws.s3.buckets.enriched.errors are ignored by EmrEtlRunner and other batch applications.
For a complete example, we now have a dedicated sample stream_config.yml template - this shows what you need to set, and what you can remove.
Notice that in Stream Enrich mode,
archive_raw steps are effectively no-op. To avoid staging enriched data during recovery in this mode you need to skip new
An important point: you need to be careful when making this switchover to avoid either missing events from Redshift, or duplicating them. Our preferred switchover approach is to:
Upcoming Snowplow releases will include:
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problems, please visit our Discourse forum.