Snowplow 0.8.4 released with MaxMind geo-IP lookups

16 May 2013  •  Alex Dean

We are pleased to announce the immediate availability of Snowplow 0.8.4. This is a big release, which adds geo-IP lookups to the Snowplow Enrichment stage, using the excellent GeoLite City database from MaxMind, Inc. This has been one of the most requested features from the Snowplow community, so we are delighted to launch it. Now you can determine the location of your website visitors directly from the Snowplow events table, and plot that data on a wide range of mapping tools including Tableau or Vincent:

Click on the image above to enlarge it

Here is some example geo-IP data:

Click on the image above to enlarge it

As well as geo-IP enrichment, there are a number of other code improvements to the Hadoop ETL, plus some minor improvements to EmrEtlRunner and some corresponding updates to the Redshift table. In this post we will cover:

  1. The new geo-IP capabilities
  2. Other changes
  3. Upgrading
  4. Getting help

1. The new geo-IP capabilities

When we released Snowplow 0.8.0 back in April, we promised that the new Scalding-based ETL process would provide a solid bedrock on which we could build a bunch of data enrichments to perform on the raw Snowplow logs to make Snowplow data more interesting to analyse. With version 0.8.1, we included a referer parsing enrichment, which looked up external page referers against a database of search engines and social networks, and used associatd rules to infer additional data about what drove those visitors to your site. This release adds our second big enrichment: the ETL process now lookups every IP address against MaxMind’s GeoLite City database in order to determine the location of a visitors - specifically (wherever possible):

  • geo_country - the two-letter ISO 3166-1 alpha-2 country code associated with the IP address
  • geo_region - the two letter ISO-3166-2 or FIPS 10-4 code for the state or region associated with the IP address
  • geo_city - the city or town name associated with the IP address
  • geo_zipcode - the zip or postal code associated with the IP address
  • geo_latitude - the latitude associated with the IP address
  • geo_longitude - the longitude associated with the IP address

For more information on these six fields we recommend reading the GeoIP CSV Databases technical reference on the MaxMind website.

Providing all this data directly in the Snowplow events table makes it easy to create geographic maps like the one below using Snowplow data directly: we will cover how to do this in a forthcoming blog post.

2. Other changes

In this release we have also made some functional improvements to the Hadoop (Scalding) ETL, plus some minor improvements to EmrEtlRunner and some updates to the Redshift table. To take these in turn:

Hadoop ETL

We have made various improvements to the Hadoop ETL:

  • We have bumped the version of referer-parser - the latest version includes a fix to better attribute Google referer URLs
  • We have added truncation of refr_urlpath, refr_urlquery and urlfragment, to prevent Redshift load errors
  • We now remove tabs and newlines from referer search terms (refr_term), again to prevent Redshift load errors
  • We have fixed a nasty bug where the client timestamp was being inaccurately localised to the Hadoop cluster’s local time (issue #238) - thanks Gabor for spotting this
  • We have made the code around page URL extraction more robust in the case that a page URL cannot be extracted
  • If you are running the latest version of the Clojure Collector, then the specific version number will now be extracted into the v_collector field

EmrEtlRunner

We have updated EmrEtlRunner to supply the location of the MaxMind GeoLite City database to the Scalding ETL.

We have also improved the notification messages when the ETL job is started on Elastic MapReduce, and the notification message if the job should fail.

Redshift events table

We have updated the Redshift events table to include new fields for the geo-IP location - see above for the six new field names.

Also, we have renamed the five ev_ fields in the Redshift table definition to start with se_, e.g. se_action. This is to make these column names consistent with our structured events terminology.

3. Upgrading

There are three components to upgrade in this release:

  • The Scalding ETL, to version 0.3.0
  • EmrEtlRunner, to version 0.2.0
  • The Redshift events table, to version 0.2.0

Let’s take these in turn:

Hadoop ETL

If you are using EmrEtlRunner, you need to update your configuration file, config.yml, to the latest version of the Hadoop ETL:

:snowplow:
  :hadoop_etl_version: 0.3.0 # Version of the Hadoop ETL

EmrEtlRunner

You need to upgrade your EmrEtlRunner installation to the latest code (0.8.4 release) on GitHub:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.8.4

Redshift events table

We have updated the Redshift table definition - you can find the latest version in the GitHub repository here.

If you already have your Snowplow data in the previous version of the Redshift events table, we have written a migration script to handle the upgrade. Please review this script carefully before running and check that you are happy with how it handles the upgrade.

4. Getting help

As always, if you do run into any issues or don’t understand any of the above changes, please raise an issue or get in touch with us via the usual channels.

You can see the full list of issues delivered in Snowplow 0.8.4 on GitHub.