Snowplow 70 Bornean Green Magpie released

19 August 2015  •  Fred Blundun

We are happy to announce the release of Snowplow version 70 Bornean Green Magpie. This release focuses on improving our StorageLoader and EmrEtlRunner components and is the first step towards combining the two into a single CLI application.

The rest of this post will cover the following topics:

  1. Combined configuration
  2. Move to JRuby
  3. Improved retry logic
  4. App monitoring with Snowplow
  5. Compression support
  6. Loading Postgres via stdin
  7. Multiple in buckets
  8. New safety checks
  9. Other changes
  10. Upgrading
  11. Getting help

bornean-green-magpie

1. Combined configuration

This release unifies the configuration file format for the EmrEtlRunner and the StorageLoader. This means that you only need a single configuration file, shared between the two apps.

An example configuration file is available in the repository.

This release also includes a script named combine_configurations.rb which can be used to combine your existing configuration files into one. Use it like this:

ruby combine_configurations.rb runner.yml loader.yml combined.yml resolver.json

This will result in the two configuration files being combined into a single file named combined.yml. It will also extract the Iglu resolver into a JSON file named resolver.json. This is because the Iglu resolver is now passed to the EmrEtlRunner as a self-describing JSON using a dedicated command-line argument (i.e. the same format as Scala Kinesis Enrich uses).

Please note that the combine_configurations.rb script has only been tested on Snowplow R64 and upwards.

2. Move to JRuby

We have ported the EmrEtlRunner and StorageLoader to JRuby. Both apps are now deployable as single “fat jars” which can be run like any other jarfile: no Ruby installation is required, just a Java Runtime Environment (1.7+). This is an exciting step forwards for Snowplow, because dealing with Ruby, Bundler and RVM has been a major painpoint for Snowplow users.

Both applications are now available pre-built in a single zipfile hosted in our Bintray:

http://dl.bintray.com/snowplow/snowplow-generic/snowplow_emr_r70_bornean_green_magpie.zip

If you prefer to build yourself: both apps now have a build.sh script which installs the necessary dependencies and saves the fat jar in the “deploy” directory.

Now that both apps use the same configuration file and are built in the same way, our next goal is to combine them into a single “Snowplow CLI” application, again built in JRuby.

3. Improved retry logic

Occasionally an EMR job will fail before any step has begun due to a “bootstrap failure”. In these cases, since no data has been moved or processed, it is always safe to restart the job. Rather than crashing, the new EmrEtlRunner version will detect that the job has halted due to a bootstrap failure and will keep attempting to restart the job until the job succeeds, the job fails for another reason, or the new bootstrap_failure_tries configuration setting is exceeded.

Additionally, the process of polling the EMR job to check its status is now resilient to more errors; Dani Sola from Simply Business contributed error handling to prevent the connection timeouts from crashing the EmrEtlRunner. Thanks Dani!

4. App monitoring with Snowplow

You can now configure both apps to turn on internal Snowplow tracking - this is another step in us making Snowplow “self-hosting”, meaning that one Snowplow instance can be used to monitor the performance of another Snowplow instance.

The EmrEtlRunner will fire an event whenever an EMR job starts, succeeds, or fails. These events include data about the name and status of the job and its individual steps. The StorageLoader will fire an event whenever a database load succeeds or fails. In the case of failure, the event will include the error message.

The new tags configuration field can hold a dictionary of name-value pairs. These will get attached to all the above Snowplow events as a context; in a future release we plan to also attach these tags to the running job on EMR.

5. Compression support

Dani Sola has added support for compressing enriched events using gzip. Redshift can automatically handle loading gzipped files, so this is a good way to reduce the total storage your enriched events require. It will also significantly speed up the related file move operations. Thanks again Dani!

6. Environment variables in configuration files

If you don’t want to hardcode your AWS credentials in the configuration file, you can now read them in from environment variables by using Ruby ERB templates:

aws:
  access_key_id: <%= ENV['AWS_SNOWPLOW_ACCESS_KEY'] %>
  secret_access_key: <%= ENV['AWS_SNOWPLOW_SECRET_KEY'] %>

Thanks to Snowplow community member Eric Pantera from Viadeo for contributing this feature!

7. Loading Postgres via stdin

When the StorageLoader is loading a PostgreSQL database, it now performs the COPY via stdin rather than directly from the downloaded local event files. This means that the StorageLoader doesn’t have to be run on the same physical machine as the Postgres database. This has two main advantages:

  1. The StorageLoader can now load Postgres databases running on Amazon RDS
  2. Many users had problems setting up the correct permissions for Postgres to read local files. This is no longer required

Big thanks to Matt Walker from Radico for contributing this feature!

8. Multiple in buckets

The “in” bucket section of the configuration YAML is now an array, because you can now process raw events from multiple buckets at once. All “in” buckets must use the same collector logging format currently.

9. New safety checks

When the EmrEtlRunner is launched, it will now immediately abort if the bucket for good shredded events is non-empty.

If the collector format is set to “thrift”, the processing bucket should always contain an even number of files when the job starts. This is because for every .lzo file there should be exactly one .lzo.index file containing the metadata on how to split it. (See the hadoop-lzo project for more information on splittable lzo.) If the number of files is odd, then at least one pair of files is incomplete, and the EmrEtlRunner will fail early.

10. Other changes

We have also:

  • Added the ability to read the config file via stdin in ErmEtlRunner and StorageLoader by setting the --config option to “-“ (#1772, #1773)
  • Moved the folder of sample enrichment configuration JSONs out of the EmrEtlRunner subproject (#1574)
  • Allowed the “bootstrap” configuration field to be nil (#1575)
  • Updated the sample configuration file to use m1.medium instead of m1.small (thanks Iain Gray)
  • Updated the Vagrant quickstart to automatically install Postgres (#1767)

11. Upgrading

Installing EmrEtlRunner and StorageLoader

Download the EmrEtlRunner and StorageLoader from our Bintray:

http://dl.bintray.com/snowplow/snowplow-generic/snowplow_emr_r70_bornean_green_magpie.zip

Unzip this file to a sensible location (e.g. /opt/snowplow-r70).

Check that you have a compatible JRE (1.7+) installed by invoking one of the two apps:

./snowplow-emr-etl-runner --version
snowplow-emr-etl-runner 0.17.0

That’s it - you are ready to update the configuration files.

Updating the configuration files

Your two old configuration files will no longer work. Use the aforementioned combine_configurations.rb script to turn them into a unified configuration file and a resolver JSON.

For reference:

Note that field names in the unified configuration file no longer start with a colon - so region: us-east-1 not :region: us-east-1.

Using the new command-line options

The EmrEtlRunner now requires a --resolver argument which should be the path to your new resolver JSON.

Also note that when specifying steps to skip using the --skip option, the “archive” step has been renamed to “archive_raw” in the EmrEtlRunner and “archive_enriched” in the StorageLoader. This is in preparation for merging the two applications into one.

12. Getting help

For more details on this release, please check out the R70 Bornean Green Magpie release notes on GitHub.

If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.