Snowplow 63 Red-Cheeked Cordon-Bleu released

02 April 2015  •  Alex Dean

We are pleased to announce the immediate availability of Snowplow 63, Red-Cheeked Cordon-Bleu. This is a major release which adds two new enrichments, upgrades existing enrichments and significantly extends and improves our Canonical Event Model for loading into Redshift, Elasticsearch and Postgres.

red-cheeked-cordon-bleu

The new and upgraded enrichments are as follows:

  1. New enrichment: parsing useragent strings using the ua_parser library
  2. New enrichment: converting the money amounts in e-commerce transactions into a base currency using Open Exchange Rates
  3. Upgraded: extracting click IDs in our campaign attribution enrichment, so that Snowplow event data can be more precisely joined with campaign data
  4. Upgraded: our existing MaxMind-powered IP lookups
  5. Upgraded: useragent parsing using the user_agent_utils library can now be disabled

This release has been a huge team effort - with particular thanks going to Snowplow winterns Aalekh Nigam (2014/15) and Jiawen Zhou (2013/14) for their work on the new enrichments and the foundational scala-forex library respectively.

Table of contents:

  1. New enrichment: useragent parsing using ua_parser
  2. New enrichment: currency conversion for e-commerce transaction
  3. Upgraded enrichment: click ID extraction for campaign attribution
  4. Upgraded enrichment: timezone lookup from IP address using MaxMind
  5. Upgraded enrichment: useragent parsing using user_agent_utils
  6. Other improvements to Scala Common Enrich
  7. Updates to atomic.events
  8. Updates to the Kinesis applications
  9. Upgrading your Snowplow pipeline
  10. Getting help

1. New enrichment: useragent parsing using ua_parser

Since close to its inception, Snowplow has used the user-agent-utils Java library to perform useragent parsing. Various limitations with that library have led us to explore and evaluate other options, including the ua-parser project with its uap-java library for the JVM. Testing suggests that this library handles some useragent strings (such as mobile app useragents) better than user-agent-utils. We particularly like that the ‘database’ the ua-parser looks user agent strings up against is a YAML file, making it straightforward for users to update the treatment of new useragents as they emerge themselves, rather than waiting on a new release of user-agent-utils.

As part of our move towards pluggable enrichments, from this release Snowplow users can employ user-agent-utils, or ua-parser, or both, or neither. We believe we are the first analytics platform to give users such a high degree of choice in their Enrichment process.

The behavior of user-agent-utils if enabled is unchanged; if ua-parser is enabled, then the Snowplow Enrichment process will write its results into a new context, ua_parser_context.

For more details on this enrichment, see the ua parser enrichment wiki page.

2. New enrichment: currency conversion for e-commerce transactions

Since early 2014, Snowplow trackers including the JavaScript Tracker have let you record the currency in which an e-commerce transaction took place. This is the first release of Snowplow which extracts those currency fields into our Canonical Event Model - but we have gone further and this release also includes a new enrichment which will automatically convert these transactions into your “base” or “home” currency for reporting and analytics.

Working with Open Exchange Rates, our new currency conversion enrichment will look up the end-of-day (“EOD”) rate between your transaction’s currency and your preferred base currency for the day prior to the e-commerce transaction, and use this to convert all monetary amounts in your e-commerce currency into your base currency. The converted values are all stored in new fields, so you can continue to work with the original amounts as well.

To take advantage of this new enrichment, you will need to sign up for an account at Open Exchange Rates and provide your API key in the enrichment’s JSON configuration file.

For more details on this enrichment, see the Currency conversion enrichment wiki page.

3. Upgraded enrichment: click ID extraction for campaign attribution

Many advertising systems attach a globally unique “click ID” tracking parameter to destination URIs to help advertisers attribute clicks to campaigns. The most well known of these click IDs are gclid (Google), msclkid (Microsoft) and dclid (DoubleClick).

We have extended our existing campaign attribution enrichment to look for and extract the value assigned to this click ID, populating the mkt_clickid field with the click ID and the mkt_network field with the name of the network. This should make it possible to join Snowplow data back to campaign data from marketing channel that drove the user to your site in a much more precise way than is possible using the existing source / medium / term / campaign / content parameters, so you can report exactly what you paid for that click, and then calculate the return on investment for it, based on the user’s subsequent actions.

By default the campaign attribution enrichment identifies the three click IDs given above, but you can also configure the enrichment with your own list of click IDs and network names:

{
    "schema": "iglu:com.snowplowanalytics.snowplow/campaign_attribution/jsonschema/1-0-1",

    "data": {

        "name": "campaign_attribution",
        "vendor": "com.snowplowanalytics.snowplow",
        "enabled": false,
        "parameters": {
            "mapping": "static",
            "fields": {
                ...
                "mktClickId": {
                    "gclid": "Google AdWords",
                    "customclid": "Private Network"
                }
            }
        }
    }
}

If you know of another click ID that would be useful to the wider Snowplow community, please do add it to this ticket.

For more details on this enrichment, see the campaign attribution enrichment wiki page.

4. Upgraded enrichment: timezone lookup from IP address using MaxMind

We have extended the IP lookup enrichment to extract the timezone information that MaxMind provides about an IP address. This timezone information is stored in a new field in the Canonical Event Model, geo_timezone.

For more details on this enrichment, see the IP lookups enrichment wiki page.

5. Upgraded enrichment: useragent parsing using user_agent_utils

Our existing useragent parsing enrichment built on user-agent-utils is no longer hardcoded to run - instead, it is now a user-configurable enrichment. The fields that it populates in atomic.events are unchanged.

To enable it to run as before, you must add in a JSON configuration file into your folder of enrichments. See 9.1.1 Configuring enrichments for details.

For more details on this enrichment, see the ua parser enrichment wiki page.

6. Other improvements to Scala Common Enrich

A set of smaller new features and capabilities have been to Scala Common Enrich in this release:

  • Netaporter’s more permissive URI library is used to parse querystrings if the Apache Commons httpclient fails. Many thanks to Dani Solà for this contribution! (#1429)
  • The refr_domain_userid and refr_dvce_tstamp fields as set by the JavaScript Tracker’s new cross-domain linker are now extracted (#1461). This makes it possible for users running a network of websites to stitch user actions across domains using only first party cookies. (We will cover how to do this in a future blog post.)
  • The session_id field is now populated based on the “sid” parameter. Session ID is a client-side generated UUID to complement the existing session index (#1541)
  • The dvce_sent_tstamp field is now populated based on the “stm” parameter. This is useful for determining when a tracker sent an event (versus creating that event) ([#1383][issue1383-])
  • bumped referer-parser to 0.2.3 (#670)
  • extracted original IP address from CollectorPayload headers (#1372)

7. Updates to atomic.events

This release makes a comprehensive set of updates to the atomic.events table (whether Redshift or Postgres), specifically:

  1. New fields as per the updated Canonical Event Model. These new fields are largely for the new enrichments, but we are also aiming to somewhat “future-proof” atomic.events by adding new fields which we plan on using in the near future
  2. Updates to existing fields, primarily so Snowplow can record a wider range of values in those fields

The new fields required in atomic.events (whether Redshift or Postgres) are as follows:

Column name Data type (1) From (2) Description
tr_currency char(3) Tracker Currency for e-commerce transaction
tr_total_base dec(18, 2) CCE Conversion to base currency
tr_tax_base dec(18, 2) CCE Conversion to base currency
tr_shipping_base dec(18, 2) CCE Conversion to base currency
ti_currency char(3) Tracker Currency for e-commerce transaction item
ti_price_base dec(18, 2) CCE Conversion to base currency
base_currency char(3) CCE CCE configuration option
geo_timezone varchar(64) ILE Timezone for IP address
mkt_clickid varchar(64) CAE Unique ID for advertising click
mkt_network varchar(64) CAE Advertising network of click ID
etl_tags varchar(500) Enrich Tags describing this run. Not yet implemented
dvce_sent_tstamp timestamp Tracker When device sent event
refr_domain_userid varchar(36) Tracker Extracted from cross-domain linker
refr_dvce_tstamp timestamp Tracker Extracted from cross-domain linker
derived_contexts varchar(15000) Enrich Contexts derived in the Enrich process
domain_sessionid char(36) Tracker Client-side session ID, complements index
derived_tstamp timestamp Enrich Calculated timestamp. Not yet implemented

(1) The data type is taken from Redshift; data types for some columns in Postgres are different

(2) Where:

  • CCE = Currency conversion enrichment
  • ILE = IP lookups enrichment
  • CAE = Campaign attribution enrichment

We have also made the following changes to the table definitions:

Column name New data type (1) Old data type Reason
user_ipaddress</code> varchar(45) varchar(19) To support IPv6 addresses
unstruct_event varchar(15000)* varchar(10000) To support larger JSONs
domain_userid varchar(36) varchar(16) To support UUIDs as domain user IDs
contexts varchar(15000)* varchar(10000) To support more custom contexts
page_urlpath varchar(3000) varchar(1000) To support longer page URL paths
page_urlquery varchar(6000) varchar(3000) To support longer page URL queries (e.g. from ad servers)
page_urlfragment varchar(3000) varchar(255) To support longer page fragments
refr_urlpath varchar(3000) varchar(1000) To support longer page URL queries (e.g. from ad servers)
refr_urlquery varchar(6000) varchar(3000)
refr_urlfragment varchar(3000) varchar(255) To support longer page fragments
se_category varchar(1000) varchar(255) To support longer category labels
se_action varchar(1000) varchar(255)
se_label varchar(1000) varchar(255)
se_property varchar(1000) varchar(255)

* Also changed column encoding in Redshift from raw to lzo

(1) The data type is taken from Redshift; data types for some columns in Postgres are different

In addition to these changes, for Postgres we have removed the primary key constraint on event_id (#1187).

Finally, we have also added a foreign key constraint to all Redshift shredded JSON tables to make the joins back to the parent atomic.events table more performant (#1365).

8. Updates to the Kinesis applications

The main update to both Kinesis applications is to support the new enriched event format (see 7. Updates to atomic.events for details). Other noteworthy updates to the Scala Kinesis Enrich:

  • The Scala Kinesis Enrich application now uses Scala Common Enrich 0.13.0, the latest version (#1369). Previously it was using Scala Common Enrich 0.11.0. This means that you can take advantage of all the enrichment updates in the Kinesis flow, and it also brings the Kinesis flow up-to-date with the various encoding-related fixes implemented in Scala Common Enrich 0.12.0
  • Community member Kacper Bielecki updated the Scala Kinesis Enrich’s logging configuration (#1367)

There is also an important update to the Kinesis Elasticsearch Sink: we have stopped verifying the number of fields found in enriched event (#1333). This should make the Elasticsearch Sink more tolerant of potential future updates to Scala Kinesis Enrich.

This section contains upgrading instructions which are common to both our Elastic MapReduce and Kinesis pipelines.

To continue parsing useragent strings using the user_agent_utils library, you must add a new JSON configuration file into your folder of enrichment JSONs:

{
	"schema": "iglu:com.snowplowanalytics.snowplow/user_agent_utils_config/jsonschema/1-0-0",

	"data": {

		"vendor": "com.snowplowanalytics.snowplow",
		"name": "user_agent_utils_config",
		"enabled": true,
		"parameters": {}
	}
}

The name of the file is not important but must end in .json.

Configuring other enrichments is at your discretion. Useful resources here are:

There are two steps to upgrading the EMR pipeline:

  1. Upgrade your EmrEtlRunner to use the latest Hadoop job versions
  2. Upgrade your Redshift and/or Postgres atomic.events table to the latest definitions

This release bumps:

  1. The Hadoop Enrichment process to version 0.14.0
  2. The Hadoop Shredding process to version 0.4.0

In your EmrEtlRunner’s config.yml file, update your Hadoop jobs versions like so:

  :versions:
    :hadoop_enrich: 0.14.0 # WAS 0.13.0
    :hadoop_shred: 0.4.0 # WAS 0.3.0

For a complete example, see our sample config.yml template.

You need to use the appropriate migration script to update to the new table definition:

And that’s it - you should be fully upgraded.

If you want to make use of the new ua_parser based useragent parsing enrichment in Redshift, you must also deploy the new table into your atomic schema:

This release updates:

  1. Scala Kinesis Enrich, to version 0.4.0
  2. Kinesis Elasticsearch Sink, to version 0.2.0

The new version of the Kinesis pipeline is available on Bintray as snowplow_kinesis_r61_red_cheeked_cordon_bleu.zip. The download contains the latest versions of all of the Kinesis apps (Scala Stream Collector, Scala Kinesis Enrich, Kinesis Elasticsearch Sink, and Kinesis S3 Sink).

The components in the Kinesis topology updated in this release are highlighted in this graph:

r63-kinesis-changes

Our recommended approach for upgrading is as follows:

  1. Kill your Scala Kinesis Enrich cluster
  2. Leave your Kinesis Elasticsearch Sink cluster running until all remaining enriched events are loaded, then kill this cluster too
  3. Upgrade your Scala Kinesis Enrich cluster to the new version
  4. Upgrade your Kinesis Elasticsearch Sink cluster to the new version
  5. Restart your Scala Kinesis Enrich cluster
  6. Restart your Kinesis Elasticsearch Sink cluster

10. Getting help

For more details on this release, please check out the r63 Red-Cheeked Cordon-Bleu Release Notes on GitHub.

If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.