21 February 2017  •  Releases  •  Alex Dean

Snowplow 87 Chichen Itza released

We are pleased to announce the immediate availability of Snowplow 87 Chichen Itza.

This release contains a wide array of new features, stability enhancements and performance improvements for EmrEtlRunner and StorageLoader. As of this release EmrEtlRunner lets you specify EBS volumes for your Hadoop worker nodes; meanwhile StorageLoader now writes to a dedicated manifest table to record each load.

Continuing with this release series named for archaelogical sites, Release 87 is Chichen Itza, the ancient Mayan city in the Yucatan Peninsula in Mexico.

Read on after the fold for:

  1. Specifying EBS volumes for Hadoop in EmrEtlRunner
  2. EmrEtlRunner stability and performance improvements
  3. A load manifest for Redshift
  4. StorageLoader stability improvements
  5. Upgrading
  6. Roadmap
  7. Getting help


1. Specifying EBS volumes for Hadoop in EmrEtlRunner

A recurring request from the Snowplow community has been for increased control over how the Snowplow batch pipeline runs on Elastic MapReduce.

Over time, our plan is to give you total control over this, with our planned migration from EmrEtlRunner to our new Dataflow Runner, as per our RFC. However, this plan will take some time, and in the meantime we are continuing to invest in improving EmrEtlRunner.

In this release we add the ability to specify an EBS volume to attach to each core instance in your EMR cluster. This is particularly powerful for two scenarios:

  1. When you want to use new EBS-only instance types, such as the c4 series, for your EMR jobs
  2. When you have significant event volumes and you would like much more finegrained control over the amount of disk you are allocating to the HDFS cluster that is formed out of the task nodes

EmrEtlRunner lets you attach one EBS volume to each node, broadly exposing the functionality described in the EMR documentation for Amazon EBS volumes. For an example, please see the upgrade section 5.2 Updating config.yml below.

2. EmrEtlRunner stability and performance improvements

We have made a variety of “under-the-hood” improvements to the EmrEtlRunner.

Most noticeably, we have migrated the archival code for raw collector payloads from EmrEtlRunner into the EMR cluster itself, where the work is performed by the S3DistCp distributed tool. This should reduce the strain on your server running EmrEtlRunner, and should improve the speed of that step. Note that as a result of this, the raw files are now archived in the same way as the enriched and shredded files, using run= sub-folders.

For more robust monitoring of EMR while waiting for jobflow completion, EmrEtlRunner now anticipates and recovers from additional Elasticity errors (ThrottlingException and ArgumentError).

For users running Snowplow in a Lambda architecture we have removed the UnmatchedLzoFilesError check, which would prevent EMR from starting even though an LZO index file missing from the processing folder is in fact benign.

In the case that a previous run has failed or is ungoing, EmrEtlRunner now exits out with a dedicated return code (4).

Finally, we have bumped the JRuby version for EmrEtlRunner to, and upgraded the key Elasticity dependency to 6.0.10.

3. A load manifest for Redshift

As of this release, StorageLoader now populates a manifest table as part of the Redshift load. The table is simply called manifest and lives in the same schema as your events and other tables.

Here are the last 5 loads for one of our internal pipelines:

snplow=# select * from atomic.manifest order by etl_tstamp desc limit 5;
       etl_tstamp        |       commit_tstamp        | event_count | shredded_cardinality
 2017-02-21 19:01:49.648 | 2017-02-21 19:28:33.394898 |        4340 |                    8
 2017-02-21 15:02:02.03  | 2017-02-21 15:27:24.534884 |        2891 |                    7
 2017-02-21 11:02:09.63  | 2017-02-21 11:28:00.317476 |        2210 |                    7
 2017-02-21 08:02:14.957 | 2017-02-21 08:27:59.829461 |        1945 |                    6
 2017-02-21 03:01:39.476 | 2017-02-21 03:27:03.493547 |        1986 |                    8
(5 rows)

The fields are as follows:

  • etl_tstamp is the time at which the Snowplow pipeline run started
  • commit_tstamp is the time at which the load transaction started
  • event_count is the number of events loaded into the events table as part of this load transaction
  • shredded_cardinality is how many different self-describing event and context tables were loaded as part of this load transaction

At the moment this manifest table is only informational, however in the future we want to use it proactively - for example to prevent a batch of events from being accidentally double-loaded into Redshift.

4. StorageLoader stability improvements

As with EmrEtlRunner, we have bumped the JRuby version for StorageLoader to

We have also fixed a critical bug for loading events into Postgres via StorageLoader (#2888).

5. Upgrading

5.1 Upgrading EmrEtlRunner and StorageLoader

The latest version of the EmrEtlRunner and StorageLoader are available from our Bintray here.

5.2 Updating config.yml

To make use of the new ability to specify EBS volumes for your EMR cluster’s core nodes, update your configuration YAML like so:

      master_instance_type: m1.medium
      core_instance_count: 1
      core_instance_type: c4.2xlarge
      core_instance_ebs:   # Optional. Attach an EBS volume to each core instance.
        volume_size: 200    # Gigabytes
        volume_type: "io1"
        volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
        ebs_optimized: false # Optional. Will default to true

The above configuration will attach an EBS volume of 200 GiB to each core instance in your EMR cluster; the volumes will be Provisioned IOPS (SSD), with performance of 400 IOPS/GiB. The volumes will not be EBS optimized. Note that this configuration has finally allowed us to use the EBS-only c4 instance types for our core nodes.

For a complete example, see our sample config.yml template.

5.3 Upgrading Redshift

You will also need to deploy the following manifest table for Redshift:

This table should be deployed into the same schema as your events and other tables.

6. Roadmap

Upcoming Snowplow releases include:

Note that these releases are always subject to change between now and the actual release date.

7. Getting help

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.

Thoughts or questions? Come join us in our Discourse forum!
Alex Dean
Alex is co-founder and technical lead at Snowplow. You can find in him on , Twitter and LinkedIn.