Snowplow 0.9.9 released with campaign attribution enrichment


We are pleased to announce the release of Snowplow 0.9.9. This is primarily a comprehensive bug fix release, although it also adds the new campaign_attribution enrichment to our enrichment registry. Here are the sections after the fold:

  1. The campaign_attribution enrichment
  2. Clojure Collector fixes
  3. StorageLoader fixes
  4. EmrEtlRunner fixes and enhancements
  5. Hadoop Enrich fixes and enhancements
  6. Upgrading
  7. Documentation and help

1. The campaign_attribution enrichment

Snowplow has five fields relating to campaign attribution: mkt_medium, mkt_source, mkt_term, mkt_content, and mkt_campaign. In previous versions of Snowplow, the values of these fields were based on the corresponding five utm_ fields supported by Google for campaign manual tagging.

The new campaign_attribution enrichment allows you to alter this behavior. For each of the five fields, you can specify an array of querystring fields to check for the appropriate value.

This is the configuration to use if you want to duplicate the functionality of previous Snowplow versions, populating the campaign fields from the standard utm_ querystring parameters:

{ "schema": "iglu:com.snowplowanalytics.snowplow/campaign_attribution/jsonschema/1-0-0", "data": { "name": "campaign_attribution", "vendor": "com.snowplowanalytics.snowplow", "enabled": true, "parameters": { "mapping": "static", "fields": { "mktMedium": ["utm_medium"], "mktSource": ["utm_source"], "mktTerm": ["utm_term"], "mktContent": ["utm_content"], "mktCampaign": ["utm_campaign"] } } } }

The JSON has the same format as the JSONs for the other enrichments: static name and vendor fields, an enabled field which can be used to turn the enrichment off, and a parameters field containing data specific to the enrichment:

With the above configuration, if the querystring contained


then the fields would be populated like this:

Field Value
mkt_medium "cpc"
mkt_source "google"
mkt_term "shoes"
mkt_content "logolink"
mkt_campaign "april_sale"

You can have more than one querystring field in each array:

{ "schema": "iglu:com.snowplowanalytics.snowplow/`campaign_attribution`/jsonschema/1-0-0", "data": { "name": "campaign_attribution", "vendor": "com.snowplowanalytics.snowplow", "enabled": true, "parameters": { "mapping": "static", "fields": { "mktMedium": ["utm_medium", "medium"], "mktSource": ["utm_source", "source"], "mktTerm": ["utm_term", "legacy_term"], "mktContent": ["utm_content"], "mktCampaign": ["utm_campaign", "cid", "legacy_campaign"], } } } }

The first field name found takes precedence. In this example, if there is a “utm_medium” field in the querystring, its value will be used as the ‘mkt_medium’; otherwise, if there is a “medium” field in the querystring, its value will be used; otherwise, the mkt_medium field will be null.

We plan on extending the campaign_attribution enrichment to also extract the advert’s click ID as well, if found (#1073). This will serve as a good basis for more granular campaign analytics.

We have also sketched out a potential option to set the "mapping" field to “script” to enable JavaScript scripting support (#436). This would allow the use of more complex custom transformations to extract campaign attribution values from the querystring.

2. Clojure Collector fixes

We have fixed a pair of bugs which caused issues with the IP addresses recorded by the Clojure Collector, especially when running in a VPC with multiple nodes. The tickets are here:

Thank you for your patience in the resolution of these issues – we have had the updated version in test with various respondents and everything seems to be functioning correctly now.

3. StorageLoader fixes

There was an issue (#1012) where the StorageLoader was attempting to fetch JSON Path files from the main Snowplow Hosted Assets bucket, which is in eu-west-1. For users trying to load shredded JSONs into a Redshift instance in another region, the COPY FROM JSON was failing because any JSON Path files must be in the same region as the target table.

We have fixed this by mirroring all of our hosted assets (including JSON Path files) to per-region buckets (s3://snowplow-hosted-assets-us-east-1 etc). Then StorageLoader chooses the correct Snowplow Hosted Assets bucket to use, based on the region of the target Redshift database.

4. EmrEtlRunner fixes and enhancements

We have resolved two issues which should facilitate the smoother running of EmrEtlRunner:

  1. We fixed a regression with --process-enrich, thanks to community member Rob Kingston for spotting this (#1089)
  2. Now if there are no rows to process, EmrEtlRunner correctly returns a 0 status code at the command-line, not a 1 as before (#1018)

To make EmrEtlRunner more robust in scenarios where it is run very frequently (e.g. every hour), we have added in checks that the :enriched:good and :shredded:good folders are empty before starting jobflow steps that would write additional data to them. Please see issue #1124 for more details on this.

5. Hadoop Enrich fixes and enhancements

0.9.9 fixes a bug in how Snowplow’s Hadoop Enrichment process validates an incoming (i.e. tracker-generated) event_id UUID. According to the specification, UUIDs with capital letters are valid on read. This release fixes the bug by downcasing all incoming UUIDs.

This release also now supports trackers sending in the original client’s useragent via the &ua= parameter (new in the Snowplow Tracker Protocol). This is useful for situations where your tracker does not reflect the true source of the event, e.g. with the Ruby Tracker reporting a user’s checkout event in Rails.

Finally, this version of the Hadoop Enrichment process introduces some more robust handling of numeric field validation (#570 and #1062).

6. Upgrading

You need to update EmrEtlRunner and StorageLoader to the latest code (0.9.2 and 0.3.3 respectively) on GitHub:

$ git clone git:// $ git checkout 0.9.9 $ cd snowplow/3-enrich/emr-etl-runner $ bundle install --deployment $ cd ../../4-storage/storage-loader $ bundle install --deployment

This release bumps the Hadoop Enrichment process to version 0.8.0.

In your EmrEtlRunner’s config.yml file, update your Hadoop enrich job’s version to 0.8.0, like so:

 :versions: :hadoop_enrich: 0.8.0 # WAS 0.7.0

For a complete example, see our sample config.yml template.

If you upgrade Hadoop Enrich to version 0.8.0 as above, you MUST also follow these steps, or else campaign attribution will be disabled.

To use the new enrichment, add a “campaign_attribution.json” file containing a campaign_attribution enrichment JSON to your enrichments directory. Note that the previously automatic behaviour of populating the mkt_ fields based on the utm_ querystring fields no longer occurs by default. To reproduce it you must use the Google-like manual tagging configuration.

This release bumps the Clojure Collector to version 0.8.0.

To upgrade to this release:

  1. Download the new warfile by right-clicking on this link and selecting “Save As…”
  2. Log in to your Amazon Elastic Beanstalk console
  3. Browse to your Clojure Collector’s application
  4. Click the “Upload New Version” and upload your warfile

7. Documentation and help

Documentation relating to enrichments is available on the wiki:

As always, if you do run into any issues or don’t understand any of the above changes, please raise an issue or get in touch with us via the usual channels.


Related articles