Snowplow 0.8.5 released with ETL bug fixes
We are pleased to announce the immediate availability of Snowplow 0.8.5. This is a bug fixing release, following on from our launch last week of Snowplow 0.8.4 with geo-IP lookups.
This release fixes one showstopper issue with Snowplow 0.8.4, and also includes a set of smaller enhancements to help the Scalding ETL better handle “bad quality” event data from webpages. We recommend everybody on the Snowplow 0.8.x series upgrade to this version.
In this post we will cover:
Many thanks to Peter van Wesep for spotting the showstopper issue in the Snowplow 0.8.4 release: when the Snowplow ETL process was run from an Amazon Web Services account other than Snowplow’s own, the Hadoop ETL code was unable to read the MaxMind geo-IP data file from an S3:// link hosted from a Snowplow public bucket. This issue did not affect users who are self-hosting the ETL assets.
This has now been fixed - we now provide the MaxMind geo-IP file on an HTTP:// link, and the Scalding ETL downloads it and adds it to HDFS before running.
We have made a series of other improvements to the Scalding ETL, to make it more robust. These improvements are:
- We have widened the
- We now strip control characters (e.g. nulls) from fields alongside tabs and newlines, to prevent Redshift load errors
- The ETL no longer dies if a huge (larger than an integer) numeric value is sent in for a screen/view dimension
- We have increased the size of
se_valuefrom a float to a double
se_valueis always now output as a plain string, never in scientific notation, to prevent Redshift load errors
- It is now possible to build the ETL locally (we added a missing dependency to the project configuration)
There are three components to upgrade in this release:
- The Scalding ETL, to version 0.3.1
- EmrEtlRunner, to version 0.2.1
- The Redshift events table, to version 0.2.1
Alternatively, if you are still using Infobright with the legacy Hive ETL, you can upgrade your Infobright events table, to version 0.0.9.
Let’s take these in turn:
If you are using EmrEtlRunner, you need to update your configuration file,
config.yml, to the latest version of the Hadoop ETL:
You need to upgrade your EmrEtlRunner installation to the latest code (0.8.5 release) on GitHub:
$ git clone git://github.com/snowplow/snowplow.git $ git checkout 0.8.5
Redshift events table
We have updated the Redshift table definition - you can find the latest version in the GitHub repository here.
If you already have your Snowplow data in the previous version of the Redshift events table, we have written a migration script to handle the upgrade. Please review this script carefully before running and check that you are happy with how it handles the upgrade.
Infobright events table
If you are storing your events in Infobright Community Edition, you can also update your table definition. To make this easier for you, we have created a script:
Running this script will create a new table,
events_009 (version 0.0.9 of the Infobright table definition) in your
snowplow database, copying across all your data from your existing
events_008 table, which will not be modified in any way.
Once you have run this, don’t forget to update your StorageLoader’s
config.yml to load into the new
events_009 table, not your old
You can see the full list of issues delivered in Snowplow 0.8.5 on GitHub.