This release unifies the configuration file format for the EmrEtlRunner and the StorageLoader. This means that you only need a single configuration file, shared between the two apps.
An example configuration file is available in the repository.
This release also includes a script named combine_configurations.rb which can be used to combine your existing configuration files into one. Use it like this:
This will result in the two configuration files being combined into a single file named
combined.yml. It will also extract the Iglu resolver into a JSON file named
resolver.json. This is because the Iglu resolver is now passed to the EmrEtlRunner as a self-describing JSON using a dedicated command-line argument (i.e. the same format as Scala Kinesis Enrich uses).
Please note that the
combine_configurations.rb script has only been tested on Snowplow R64 and upwards.
We have ported the EmrEtlRunner and StorageLoader to JRuby. Both apps are now deployable as single “fat jars” which can be run like any other jarfile: no Ruby installation is required, just a Java Runtime Environment (1.7+). This is an exciting step forwards for Snowplow, because dealing with Ruby, Bundler and RVM has been a major painpoint for Snowplow users.
Both applications are now available pre-built in a single zipfile hosted in our Bintray:
If you prefer to build yourself: both apps now have a
build.sh script which installs the necessary dependencies and saves the fat jar in the “deploy” directory.
Now that both apps use the same configuration file and are built in the same way, our next goal is to combine them into a single “Snowplow CLI” application, again built in JRuby.
Occasionally an EMR job will fail before any step has begun due to a “bootstrap failure”. In these cases, since no data has been moved or processed, it is always safe to restart the job. Rather than crashing, the new EmrEtlRunner version will detect that the job has halted due to a bootstrap failure and will keep attempting to restart the job until the job succeeds, the job fails for another reason, or the new
bootstrap_failure_tries configuration setting is exceeded.
Additionally, the process of polling the EMR job to check its status is now resilient to more errors; Dani Sola from Simply Business contributed error handling to prevent the connection timeouts from crashing the EmrEtlRunner. Thanks Dani!
You can now configure both apps to turn on internal Snowplow tracking - this is another step in us making Snowplow “self-hosting”, meaning that one Snowplow instance can be used to monitor the performance of another Snowplow instance.
The EmrEtlRunner will fire an event whenever an EMR job starts, succeeds, or fails. These events include data about the name and status of the job and its individual steps. The StorageLoader will fire an event whenever a database load succeeds or fails. In the case of failure, the event will include the error message.
tags configuration field can hold a dictionary of name-value pairs. These will get attached to all the above Snowplow events as a context; in a future release we plan to also attach these tags to the running job on EMR.
Dani Sola has added support for compressing enriched events using gzip. Redshift can automatically handle loading gzipped files, so this is a good way to reduce the total storage your enriched events require. It will also significantly speed up the related file move operations. Thanks again Dani!
If you don’t want to hardcode your AWS credentials in the configuration file, you can now read them in from environment variables by using Ruby ERB templates:
Thanks to Snowplow community member Eric Pantera from Viadeo for contributing this feature!
When the StorageLoader is loading a PostgreSQL database, it now performs the
COPY via stdin rather than directly from the downloaded local event files. This means that the StorageLoader doesn’t have to be run on the same physical machine as the Postgres database. This has two main advantages:
Big thanks to Matt Walker from Radico for contributing this feature!
The “in” bucket section of the configuration YAML is now an array, because you can now process raw events from multiple buckets at once. All “in” buckets must use the same collector logging format currently.
When the EmrEtlRunner is launched, it will now immediately abort if the bucket for good shredded events is non-empty.
If the collector format is set to “thrift”, the processing bucket should always contain an even number of files when the job starts. This is because for every
.lzo file there should be exactly one
.lzo.index file containing the metadata on how to split it. (See the hadoop-lzo project for more information on splittable lzo.) If the number of files is odd, then at least one pair of files is incomplete, and the EmrEtlRunner will fail early.
We have also:
--configoption to “-“ (#1772, #1773)
Download the EmrEtlRunner and StorageLoader from our Bintray:
Unzip this file to a sensible location (e.g.
Check that you have a compatible JRE (1.7+) installed by invoking one of the two apps:
That’s it - you are ready to update the configuration files.
Your two old configuration files will no longer work. Use the aforementioned combine_configurations.rb script to turn them into a unified configuration file and a resolver JSON.
Note that field names in the unified configuration file no longer start with a colon - so
region: us-east-1 not
The EmrEtlRunner now requires a
--resolver argument which should be the path to your new resolver JSON.
Also note that when specifying steps to skip using the
--skip option, the “archive” step has been renamed to “archive_raw” in the EmrEtlRunner and “archive_enriched” in the StorageLoader. This is in preparation for merging the two applications into one.
For more details on this release, please check out the R70 Bornean Green Magpie release notes on GitHub.
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.