We are pleased to announce the immediate availability of Snowplow 87 Chichen Itza.
This release contains a wide array of new features, stability enhancements and performance improvements for EmrEtlRunner and StorageLoader. As of this release EmrEtlRunner lets you specify EBS volumes for your Hadoop worker nodes; meanwhile StorageLoader now writes to a dedicated manifest table to record each load.
Continuing with this release series named for archaelogical sites, Release 87 is Chichen Itza, the ancient Mayan city in the Yucatan Peninsula in Mexico.
Read on after the fold for:
- Specifying EBS volumes for Hadoop in EmrEtlRunner
- EmrEtlRunner stability and performance improvements
- A load manifest for Redshift
- StorageLoader stability improvements
- Getting help
1. Specifying EBS volumes for Hadoop in EmrEtlRunner
A recurring request from the Snowplow community has been for increased control over how the Snowplow batch pipeline runs on Elastic MapReduce.
Over time, our plan is to give you total control over this, with our planned migration from EmrEtlRunner to our new Dataflow Runner, as per our RFC. However, this plan will take some time, and in the meantime we are continuing to invest in improving EmrEtlRunner.
In this release we add the ability to specify an EBS volume to attach to each core instance in your EMR cluster. This is particularly powerful for two scenarios:
- When you want to use new EBS-only instance types, such as the
c4series, for your EMR jobs
- When you have significant event volumes and you would like much more finegrained control over the amount of disk you are allocating to the HDFS cluster that is formed out of the task nodes
EmrEtlRunner lets you attach one EBS volume to each node, broadly exposing the functionality described in the EMR documentation for Amazon EBS volumes. For an example, please see the upgrade section 5.2 Updating config.yml below.
2. EmrEtlRunner stability and performance improvements
We have made a variety of “under-the-hood” improvements to the EmrEtlRunner.
Most noticeably, we have migrated the archival code for raw collector payloads from EmrEtlRunner into the EMR cluster itself, where the work is performed by the S3DistCp distributed tool. This should reduce the strain on your server running EmrEtlRunner, and should improve the speed of that step. Note that as a result of this, the raw files are now archived in the same way as the enriched and shredded files, using
For more robust monitoring of EMR while waiting for jobflow completion, EmrEtlRunner now anticipates and recovers from additional Elasticity errors (
For users running Snowplow in a Lambda architecture we have removed the
UnmatchedLzoFilesError check, which would prevent EMR from starting even though an LZO index file missing from the processing folder is in fact benign.
In the case that a previous run has failed or is ungoing, EmrEtlRunner now exits out with a dedicated return code (
Finally, we have bumped the JRuby version for EmrEtlRunner to 22.214.171.124, and upgraded the key Elasticity dependency to 6.0.10.
3. A load manifest for Redshift
As of this release, StorageLoader now populates a manifest table as part of the Redshift load. The table is simply called
manifest and lives in the same schema as your
events and other tables.
Here are the last 5 loads for one of our internal pipelines:
The fields are as follows:
etl_tstampis the time at which the Snowplow pipeline run started
commit_tstampis the time at which the load transaction started
event_countis the number of events loaded into the
eventstable as part of this load transaction
shredded_cardinalityis how many different self-describing event and context tables were loaded as part of this load transaction
At the moment this manifest table is only informational, however in the future we want to use it proactively – for example to prevent a batch of events from being accidentally double-loaded into Redshift.
4. StorageLoader stability improvements
As with EmrEtlRunner, we have bumped the JRuby version for StorageLoader to 126.96.36.199.
We have also fixed a critical bug for loading events into Postgres via StorageLoader (#2888).
5.1 Upgrading EmrEtlRunner and StorageLoader
The latest version of the EmrEtlRunner and StorageLoader are available from our Bintray here.
5.2 Updating config.yml
To make use of the new ability to specify EBS volumes for your EMR cluster’s core nodes, update your configuration YAML like so:
The above configuration will attach an EBS volume of 200 GiB to each core instance in your EMR cluster; the volumes will be Provisioned IOPS (SSD), with performance of 400 IOPS/GiB. The volumes will not be EBS optimized. Note that this configuration has finally allowed us to use the EBS-only
c4 instance types for our core nodes.
For a complete example, see our sample
5.3 Upgrading Redshift
You will also need to deploy the following manifest table for Redshift:
This table should be deployed into the same schema as your
events and other tables.
Upcoming Snowplow releases include:
- R8x [HAD] Cross-batch natural deduplication, removing duplicates across ETL runs using an event manifest in DynamoDB
- R9x [HAD] Spark port, which will port our Hadoop Enrich and Hadoop Shred jobs from Scalding to Apache Spark
- R9x [HAD] 4 webhooks, which will add support for 4 new webhooks (Mailgun, Olark, Unbounce, StatusGator)
Note that these releases are always subject to change between now and the actual release date.
7. Getting help
For more details on this release, please check out the release notes on GitHub.