Click on the image above to enlarge it
Here is some example geo-IP data:
Click on the image above to enlarge it
As well as geo-IP enrichment, there are a number of other code improvements to the Hadoop ETL, plus some minor improvements to EmrEtlRunner and some corresponding updates to the Redshift table. In this post we will cover:
When we released Snowplow 0.8.0 back in April, we promised that the new Scalding-based ETL process would provide a solid bedrock on which we could build a bunch of data enrichments to perform on the raw Snowplow logs to make Snowplow data more interesting to analyse. With version 0.8.1, we included a referer parsing enrichment, which looked up external page referers against a database of search engines and social networks, and used associatd rules to infer additional data about what drove those visitors to your site. This release adds our second big enrichment: the ETL process now lookups every IP address against MaxMind’s GeoLite City database in order to determine the location of a visitors - specifically (wherever possible):
geo_country- the two-letter ISO 3166-1 alpha-2 country code associated with the IP address
geo_region- the two letter ISO-3166-2 or FIPS 10-4 code for the state or region associated with the IP address
geo_city- the city or town name associated with the IP address
geo_zipcode- the zip or postal code associated with the IP address
geo_latitude- the latitude associated with the IP address
geo_longitude- the longitude associated with the IP address
For more information on these six fields we recommend reading the GeoIP CSV Databases technical reference on the MaxMind website.
Providing all this data directly in the Snowplow events table makes it easy to create geographic maps like the one below using Snowplow data directly: we will cover how to do this in a forthcoming blog post.
In this release we have also made some functional improvements to the Hadoop (Scalding) ETL, plus some minor improvements to EmrEtlRunner and some updates to the Redshift table. To take these in turn:
We have made various improvements to the Hadoop ETL:
urlfragment, to prevent Redshift load errors
refr_term), again to prevent Redshift load errors
We have updated EmrEtlRunner to supply the location of the MaxMind GeoLite City database to the Scalding ETL.
We have also improved the notification messages when the ETL job is started on Elastic MapReduce, and the notification message if the job should fail.
We have updated the Redshift events table to include new fields for the geo-IP location - see above for the six new field names.
Also, we have renamed the five
ev_ fields in the Redshift table definition to start with
se_action. This is to make these column names consistent with our structured events terminology.
There are three components to upgrade in this release:
Let’s take these in turn:
If you are using EmrEtlRunner, you need to update your configuration file,
config.yml, to the latest version of the Hadoop ETL:
:snowplow: :hadoop_etl_version: 0.3.0 # Version of the Hadoop ETL
You need to upgrade your EmrEtlRunner installation to the latest code (0.8.4 release) on GitHub:
$ git clone git://github.com/snowplow/snowplow.git $ git checkout 0.8.4
We have updated the Redshift table definition - you can find the latest version in the GitHub repository here.
If you already have your Snowplow data in the previous version of the Redshift events table, we have written a migration script to handle the upgrade. Please review this script carefully before running and check that you are happy with how it handles the upgrade.
As always, if you do run into any issues or don’t understand any of the above changes, please raise an issue or get in touch with us via the usual channels.
You can see the full list of issues delivered in Snowplow 0.8.4 on GitHub.