Snowplow is now capable of running on version 4.x of Amazon EMR. The 4.x series of releases for Amazon EMR have some great new features, including support for Apache Spark 1.6, Apache Zeppelin and private VPCs.
Achieving this involved four changes:
NullPointerExceptionthrown by the
To get up to date with the latest AMI version, change the “ami_version” field of your configuration YAML to “4.3.0”. Make sure you also change the “hadoop_shred” field to at least “0.8.0” to get a compatible version of Scala Hadoop Shred.
At the moment, processing raw data using Snowplow involves two commands: you need to run both EmrEtlRunner, to process the data on Elastic MapReduce, and StorageLoader, to load the processed data into Redshift or Postgres.
In the future, StorageLoader will be invisible to the end user - it will become simply a custom jar step in the jobflow on EMR. In this release we have moved towards this goal in two ways.
When running StorageLoader on EC2, you no longer need to configure it with your AWS credentials. Instead you can set the credentials fields to “iam”:
StorageLoader will then look up the credentials using the EC2 instance metadata.
It is now possible to pass a Base64-encoded configuration string as a command line argument instead of the path to the configuration file. For example:
This will make it easier for us to invoke StorageLoader from Hadoop in the future.
Sometimes the process of bootstrapping the cluster before the job starts can fail. This release improves the ability of EmrEtlRunner to recognise these bootstrap failures and restart the job.
EmrEtlRunner’s internal Snowplow monitoring can be configured with name-value tags which are sent to a Snowplow collector with every monitoring event. In this release, we also attach those tags to the EMR job itself, so that you can see them in the EMR web UI.
We have also upgraded both apps to the latest version (0.5.2) of the Snowplow Ruby Tracker.
These scripts were originally used to run EmrEtlRunner and StorageLoader as native Ruby apps using RVM. Now that those apps are available on Bintray as easy-to-deploy JRuby jars, these scripts are no longer necessary.
Running EmrEtlRunner and StorageLoader as Ruby (rather than JRuby apps) is no longer actively supported.
Snowplow R77 Great Auk also includes some important bug fixes and improvements:
ANALYZEstatements immediately after the
COPYstatements (in fact in the same transaction), and before any
VACUUMstatements. It is more correct to perform
VACUUM, so we have reversed the order. Many thanks to Ryan Doherty for flagging this! (#1361)
config.yml, which you can use to access beta EMR features (#2211)
The latest version of the EmrEtlRunner and StorageLoader are available from our Bintray here.
The recommended AMI version to run Snowplow is now 4.3.0 - update your configuration YAML as follows:
You will need to update the jar versions in the same section:
For a complete example, see our sample
Note also that the
snowplow-runner-and-loader.sh script has been updated to use the JRuby binaries rather than the raw Ruby project.
Upcoming Snowplow releases include:
Note that these releases are always subject to change between now and the actual release date.
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.