A recurring request from the Snowplow community has been for increased control over how the Snowplow batch pipeline runs on Elastic MapReduce.
Over time, our plan is to give you total control over this, with our planned migration from EmrEtlRunner to our new Dataflow Runner, as per our RFC. However, this plan will take some time, and in the meantime we are continuing to invest in improving EmrEtlRunner.
In this release we add the ability to specify an EBS volume to attach to each core instance in your EMR cluster. This is particularly powerful for two scenarios:
c4series, for your EMR jobs
EmrEtlRunner lets you attach one EBS volume to each node, broadly exposing the functionality described in the EMR documentation for Amazon EBS volumes. For an example, please see the upgrade section 5.2 Updating config.yml below.
We have made a variety of “under-the-hood” improvements to the EmrEtlRunner.
Most noticeably, we have migrated the archival code for raw collector payloads from EmrEtlRunner into the EMR cluster itself, where the work is performed by the S3DistCp distributed tool. This should reduce the strain on your server running EmrEtlRunner, and should improve the speed of that step. Note that as a result of this, the raw files are now archived in the same way as the enriched and shredded files, using
For more robust monitoring of EMR while waiting for jobflow completion, EmrEtlRunner now anticipates and recovers from additional Elasticity errors (
For users running Snowplow in a Lambda architecture we have removed the
UnmatchedLzoFilesError check, which would prevent EMR from starting even though an LZO index file missing from the processing folder is in fact benign.
In the case that a previous run has failed or is ungoing, EmrEtlRunner now exits out with a dedicated return code (
Finally, we have bumped the JRuby version for EmrEtlRunner to 18.104.22.168, and upgraded the key Elasticity dependency to 6.0.10.
As of this release, StorageLoader now populates a manifest table as part of the Redshift load. The table is simply called
manifest and lives in the same schema as your
events and other tables.
Here are the last 5 loads for one of our internal pipelines:
The fields are as follows:
etl_tstampis the time at which the Snowplow pipeline run started
commit_tstampis the time at which the load transaction started
event_countis the number of events loaded into the
eventstable as part of this load transaction
shredded_cardinalityis how many different self-describing event and context tables were loaded as part of this load transaction
At the moment this manifest table is only informational, however in the future we want to use it proactively - for example to prevent a batch of events from being accidentally double-loaded into Redshift.
As with EmrEtlRunner, we have bumped the JRuby version for StorageLoader to 22.214.171.124.
We have also fixed a critical bug for loading events into Postgres via StorageLoader (#2888).
The latest version of the EmrEtlRunner and StorageLoader are available from our Bintray here.
To make use of the new ability to specify EBS volumes for your EMR cluster’s core nodes, update your configuration YAML like so:
The above configuration will attach an EBS volume of 200 GiB to each core instance in your EMR cluster; the volumes will be Provisioned IOPS (SSD), with performance of 400 IOPS/GiB. The volumes will not be EBS optimized. Note that this configuration has finally allowed us to use the EBS-only
c4 instance types for our core nodes.
For a complete example, see our sample
You will also need to deploy the following manifest table for Redshift:
This table should be deployed into the same schema as your
events and other tables.
Upcoming Snowplow releases include:
Note that these releases are always subject to change between now and the actual release date.
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.