We are pleased to announce the release of Snowplow 0.9.6. This release does four things:
- It fixes some important bugs discovered in Snowplow 0.9.5, related to our new shredding functionality
- It introduces new JSON-based configurations for Snowplow’s existing enrichments
- It extends our geo-IP lookup enrichment to support all five of MaxMind’s commercial databases
- It extends our referer-parsing enrichment to support a user-configurable list of internal domains
We are really excited about our new JSON-configurable enrichments. This is the first step on our roadmap to make Snowplow enrichments completely pluggable. In the short-term, it means that we can release new enrichments which won’t require you to install a new version of EmrEtlRunner. It also means we can support much more complex (configuration-wise) enrichments than we could previously; finally it also means we can share the same enrichment configurations between our Hadoop and Kinesis-based flows.
The support for the various paid-for MaxMind databases is exciting too – we’ve been using this internally to see which companies are browsing the Snowplow website! We are very pleased to have MaxMind as our first commercial data partner and would encourage you to check out their IP database offerings.
Below the fold we will cover:
- Important bug fixes for 0.9.5
- New format for enrichment configuration
- An example: configuring the anon_ip enrichment
- The referer_parser enrichment
- The ip_lookups enrichment
- Changes to the atomic.events table
- Other changes
- Documentation and help
We have identified several bugs in our new shredding functionality released in 0.9.5 a fortnight ago, now fixed in 0.9.6. These are:
- Trailing empty fields in an enriched event TSV row would cause shredding for that row to fail with a “Line does not match Snowplow enriched event” error. Now fixed (#921)
- The StorageLoader now knows to look in Amazon’s eu-west-1 region for the
snowplow-hosted-assetsS3 bucket, regardless of which region the user has specified for their own JSON Path files (#895)
- We fixed the contract on the
partition_by_runfunction in EmrEtlRunner. This bug was causing issues if
:continue_on_unexpected_error:was set to
:errors:buckets empty (#894)
The new version of Snowplow supports three configurable enrichments: the
anon_ip enrichment, the
ip_lookups enrichment, and the
referer_parser enrichment. Each of these can be configured using a self-describing JSON. The enrichment configuration JSONs follow a common pattern:
"enabled" field lets you switch the enrichment on or off and the
"parameters" field contains the data specific to the enrichment.
These JSONs should be placed in a single directory, and that directory’s filepath should be passed to the EmrEtlRunner as a new command-line option called
For example, if you want to configure all three enrichments, your config directory might have this structure:
The JSON files in
config/enrichments will then be packaged up by EmrEtlRunner and sent to the Hadoop job. Some notes on this:
- The filenames do not matter, but only files with the
.jsonfile extension will be packaged up and sent to Hadoop
- Any enrichment for which no JSON can be found will be disabled (i.e. not run) in the Hadoop enrichment code
- Thus the
referer_parserenrichments no longer happen automatically – you must provide configuration JSONs with the “enabled” field set to
trueif you want them. Sensible default configuration JSONs are available on Github here.
The new JSON-based configurations are discussed in further detail on the Configuring enrichments wiki page.
The functionality of the IP anonymization enrichment remains unchanged: it lets you anonymize part (or all) of each user’s IP address. Here’s an example configuration JSON for this enrichment:
This is a simple enrichment: the only field in
"anonOctets", which is the number of octets of each IP address to anonymize. In this case it is set to 3, so 18.104.22.168 would be anonymized to 37.x.x.x.
Snowplow uses the Referer-Parser to extract useful information from referer URLs. For example, the referer:
would be identified as a Google search using the terms “snowplow” and “enrichments”.
If the referer URI’s host is the same as the current page’s host, the referer will be counted as internal.
The latest version of the referer-parser project adds the option to pass in a list of additional domains which should count as internal. The referer_parser enrichment can now be configured to take advantage of this:
Using the above configuration will ensure that all referrals from the internal subdomains “mysubdomain1.acme.com” and “mysubdomain2.acme.com” will be counted as internal rather than unknown.
Previous versions of Snowplow used a free MaxMind database to look up a user’s geographic location based on their IP address. This version expands on that functionality by adding the option to use other, paid-for, MaxMind databases to look up additional information. The full list of supported databases:
1) GeoIPCity and the free version GeoLiteCity look up a user’s geographic location. The ip_lookups enrichment uses this information to populate the
geo_region_name fields. The paid-for database is more accurate than the free version. [This blog post][maxmind-post] has more background information
2) GeoIP ISP looks up a user’s ISP address. This populates the new
3) GeoIP Organization looks up a user’s organization. This populates the new
4) GeoIP Domain looks up the second-level domain name associated with a user’s IP address. This populates the new
5) GeoIP Netspeed estimates a user’s connection speed. This populates the new
Here is An example configuration JSON, using the free GeoLiteCity database and the proprietary GeoIP ISP database only:
database field contains the name of the database file.
uri field contains the URI of the bucket in which the database file is found. The GeoLiteCity database is freely hosted by Snowplow at the supplied URI. In this example, the user has purchased MaxMind’s commercial “GeoIPISP.dat” and is hosting it in their own private S3 bucket.
We have updated the table definitions to support the extended MaxMind enrichment – see above for the new field names. We have also applied runlength encoding to all Redshift fields which are driven off the IP address (#883).
To bring the tables inline with the design changes made to contexts and unstructured events in recent releases, we have deleted the
ue_name fields and renamed
Finally, we have created a new
etl_tstamp field. This is populated by a timestamp created in the EmrEtlRunner, and describes when ETL for a particular row began.
We have also made some small but valuable improvements to the Hadoop-based Enrichment process:
- We are now extracting
network_useridif set, thanks community member Phil Kallos! (#855)
- We are now validating that the transaction ID field is an integer (#428)
- We can now extract the
event_idUUID from the incoming querystring if set. This should prove very helpful for the Kinesis flow wherever at-least-once processing is in effect (#723)
- We have upgraded the version of user-agent-utils we are using (thanks again Phil!)
You need to update EmrEtlRunner and StorageLoader to the latest code (0.9.6 release) on GitHub:
Update your EmrEtlRunner’s
config.yml file. First update both of your Hadoop job versions to, respectively:
Next, completely delete the
:enrichments: section at the bottom:
For a complete example, see our sample
Finally, if you wish to use any of the configurable enrichments, you need to create a directory of configuration JSONs and pass that directory to the EmrEtlRunner using the new
Important: don’t forget to update any Bash script that you use to run your EmrEtlRunner job, to include the
--enrichments argument. If you forget to do this, then all of your enrichments will be switched off. You can see updated versions of these Bash files here:
You need to use the appropriate migration script to update to the new table definition:
And that’s it – you should be fully upgraded.
Documentation relating to enrichments is available on the wiki:
For more details on this release, please check out the 0.9.6 Release Notes on GitHub.