Snowplow 70 Bornean Green Magpie released
We are happy to announce the release of Snowplow version 70 Bornean Green Magpie. This release focuses on improving our StorageLoader and EmrEtlRunner components and is the first step towards combining the two into a single CLI application.
The rest of this post will cover the following topics:
- Combined configuration
- Move to JRuby
- Improved retry logic
- App monitoring with Snowplow
- Compression support
- Loading Postgres via stdin
- Multiple in buckets
- New safety checks
- Other changes
- Getting help
1. Combined configuration
This release unifies the configuration file format for the EmrEtlRunner and the StorageLoader. This means that you only need a single configuration file, shared between the two apps.
An example configuration file is available in the repository.
This release also includes a script named combine_configurations.rb which can be used to combine your existing configuration files into one. Use it like this:
This will result in the two configuration files being combined into a single file named
combined.yml. It will also extract the Iglu resolver into a JSON file named
resolver.json. This is because the Iglu resolver is now passed to the EmrEtlRunner as a self-describing JSON using a dedicated command-line argument (i.e. the same format as Scala Kinesis Enrich uses).
Please note that the
combine_configurations.rb script has only been tested on Snowplow R64 and upwards.
2. Move to JRuby
We have ported the EmrEtlRunner and StorageLoader to JRuby. Both apps are now deployable as single “fat jars” which can be run like any other jarfile: no Ruby installation is required, just a Java Runtime Environment (1.7+). This is an exciting step forwards for Snowplow, because dealing with Ruby, Bundler and RVM has been a major painpoint for Snowplow users.
Both applications are now available pre-built in a single zipfile hosted in our Bintray:
If you prefer to build yourself: both apps now have a
build.sh script which installs the necessary dependencies and saves the fat jar in the “deploy” directory.
Now that both apps use the same configuration file and are built in the same way, our next goal is to combine them into a single “Snowplow CLI” application, again built in JRuby.
3. Improved retry logic
Occasionally an EMR job will fail before any step has begun due to a “bootstrap failure”. In these cases, since no data has been moved or processed, it is always safe to restart the job. Rather than crashing, the new EmrEtlRunner version will detect that the job has halted due to a bootstrap failure and will keep attempting to restart the job until the job succeeds, the job fails for another reason, or the new
bootstrap_failure_tries configuration setting is exceeded.
Additionally, the process of polling the EMR job to check its status is now resilient to more errors; Dani Sola from Simply Business contributed error handling to prevent the connection timeouts from crashing the EmrEtlRunner. Thanks Dani!
4. App monitoring with Snowplow
You can now configure both apps to turn on internal Snowplow tracking - this is another step in us making Snowplow “self-hosting”, meaning that one Snowplow instance can be used to monitor the performance of another Snowplow instance.
The EmrEtlRunner will fire an event whenever an EMR job starts, succeeds, or fails. These events include data about the name and status of the job and its individual steps. The StorageLoader will fire an event whenever a database load succeeds or fails. In the case of failure, the event will include the error message.
tags configuration field can hold a dictionary of name-value pairs. These will get attached to all the above Snowplow events as a context; in a future release we plan to also attach these tags to the running job on EMR.
5. Compression support
Dani Sola has added support for compressing enriched events using gzip. Redshift can automatically handle loading gzipped files, so this is a good way to reduce the total storage your enriched events require. It will also significantly speed up the related file move operations. Thanks again Dani!
6. Environment variables in configuration files
If you don’t want to hardcode your AWS credentials in the configuration file, you can now read them in from environment variables by using Ruby ERB templates:
Thanks to Snowplow community member Eric Pantera from Viadeo for contributing this feature!
7. Loading Postgres via stdin
When the StorageLoader is loading a PostgreSQL database, it now performs the
COPY via stdin rather than directly from the downloaded local event files. This means that the StorageLoader doesn’t have to be run on the same physical machine as the Postgres database. This has two main advantages:
- The StorageLoader can now load Postgres databases running on Amazon RDS
- Many users had problems setting up the correct permissions for Postgres to read local files. This is no longer required
Big thanks to Matt Walker from Radico for contributing this feature!
8. Multiple in buckets
The “in” bucket section of the configuration YAML is now an array, because you can now process raw events from multiple buckets at once. All “in” buckets must use the same collector logging format currently.
9. New safety checks
When the EmrEtlRunner is launched, it will now immediately abort if the bucket for good shredded events is non-empty.
If the collector format is set to “thrift”, the processing bucket should always contain an even number of files when the job starts. This is because for every
.lzo file there should be exactly one
.lzo.index file containing the metadata on how to split it. (See the hadoop-lzo project for more information on splittable lzo.) If the number of files is odd, then at least one pair of files is incomplete, and the EmrEtlRunner will fail early.
10. Other changes
We have also:
- Added the ability to read the config file via stdin in ErmEtlRunner and StorageLoader by setting the
--configoption to “-“ (#1772, #1773)
- Moved the folder of sample enrichment configuration JSONs out of the EmrEtlRunner subproject (#1574)
- Allowed the “bootstrap” configuration field to be
- Updated the sample configuration file to use m1.medium instead of m1.small (thanks Iain Gray)
- Updated the Vagrant quickstart to automatically install Postgres (#1767)
Installing EmrEtlRunner and StorageLoader
Download the EmrEtlRunner and StorageLoader from our Bintray:
Unzip this file to a sensible location (e.g.
Check that you have a compatible JRE (1.7+) installed by invoking one of the two apps:
That’s it - you are ready to update the configuration files.
Updating the configuration files
Your two old configuration files will no longer work. Use the aforementioned combine_configurations.rb script to turn them into a unified configuration file and a resolver JSON.
- config/iglu_resolver.json - example resolver JSON
- emr-etl-runner/config/config.yml.sample - example unified configuration YAML
Note that field names in the unified configuration file no longer start with a colon - so
region: us-east-1 not
Using the new command-line options
The EmrEtlRunner now requires a
--resolver argument which should be the path to your new resolver JSON.
Also note that when specifying steps to skip using the
--skip option, the “archive” step has been renamed to “archive_raw” in the EmrEtlRunner and “archive_enriched” in the StorageLoader. This is in preparation for merging the two applications into one.
12. Getting help
For more details on this release, please check out the R70 Bornean Green Magpie release notes on GitHub.