We’re very pleased to announce the release of Snowplow 0.8.11. This releases includes two different sets of updates:
- Critical update: support for Amazon’s new Cloudfront log file format (rolled out by Amazon during 21st October 2013)
- Nice-to-have additions - the most significant of which is IP anonymization
We’ll discuss the updates one at a time, before covering how to upgrade to the latest version.
- Critical upgrade: support for Amazon’s new CloudFront log file format
- IP address anonymization
- Other updates
Since August, Amazon has made a number of changes to their CloudFront log file format, the most recent of which was pushed live yesterday:
|Cloudfront log file format||Description|
|Original format||The original CloudFront log file format, around which Snowplow was originally developed.|
|12 Sep 2012 - 17 Aug 2013 format||The original format with three new fields appended.|
|August 17 unnanounced change||Surprise change around the URI encoding of fields. See the Google Group for details.|
|September 14 resolution||A new approach to URI encoding, different to the previous two. See this forum thread for details.|
|October 21 update||Amazon updated the latest log file format with three new fields. See this post for details.|
The latest version of Snowplow supports all the different versions of the file format listed above, including the new format that was rolled out yesterday. It is important to note that the October 21st CloudFront file format is not supported by previous Snowplow versions: as a result, we’d expect existing Snowplow users using the CloudFront collector to see a significant number of lines in their bad rows bucket in S3 with the following format:
Once you have upgraded your Snowplow installation to the latest version, you will need to reprocess those bad rows. Instructions on how to do so are given in this blog post.
As well as the critical update, there are a number of nice-to-have features bundled in this release. Chief amongst them is IP anonymization. The enrichment process can now be configured to mask IP addresses, so that privacy-conscious Snowplow users can prevent IP addresses being visible to analysts.
Snowplow administrators can setup IP masking via the EmrEtlRunner config file. Instructions on how to do this can be found in the section on upgrading below.
The most important of these updates is an update to the StorageLoader to make loading into PostgreSQL more robust, by fixing an issue where Postgres was accidentally escaping tabs in the file format, breaking the load. Many thanks to community member Rob Kingston for contributing this update.
There are also some additional command-line options for our two Ruby apps which should make the Snowplow Enrichment process more flexible:
- Run EmrEtlRunner with
--debugto make Elastic MapReduce’s job debugging available
- Run StorageLoader with
--include vacuumif you want to include a
VACUUMstep after your table load
- Run StorageLoader with
--skip analyzeif you don’t need to run a table
ANALYZEstep after your table load
- Run StorageLoader with
--include compupdateif you want to (re-)generate the compression encodings on your table’s fields. This setting uses the new
:comprows:parameter in the
config.ymlfile - see section 4.2 below for details
Finally, there are a set of “under the hood” stability and performance and improvements in this release.
For the definitive list of updates in this release, please see the v0.8.11 Release Notes on GitHub.
Upgrading is a three step process:
Let’s take these in term:
You need to update EmrEtlRunner to the latest code (0.8.11 release) on Github:
You also need to update the
config.yml file for EmrEtlRunner to use the latest version of the Hadoop ETL (0.3.5):
In addition, you need to add a new “enrichments” section in the
To enable IP enrichment, you need to set
anon_ip.enabled to true, and specify the level of anonymization with
anon_ip.anon_octets field. If, for example, my IP address is ‘22.214.171.124’, then setting it to different values between 0 and 4 would anonymize my IP address as follows:
| ||IP address displayed in Snowplow|
To see a complete example of the EmrEtlRunner
config.yml file, see the Github repo.
You need to upgrade your StorageLoader installation to the latest code (0.8.11) on Github:
config.yml file includes a new
:comprows: option for Redshift users. This determines the number of rows that Amazon analyzes in order to determine the best compression encoding format to use for each of the fields in your Redshift event table. Note that this is only used if the
--include compupdate option is specified when running the StorageLoader. For more information on Amazon’s
comprows functionality, see the Redshift documentation.
config.yml for StorageLoader for Redshift users, including the new setting, can be found on Github.
As described above, if you have been using the CloudFront collector, you will have a number of rows of data in your “bad bucket” on S3 generated after the new CloudFront log file format was rolled out on October 21st, because these data rows were not supported by the old version of the Snowplow.
You need to reprocess these rows so they are not missing from your final data set. For detailed instructions on how to do this, see our guide to reprocessing bad data in Snowplow.