Snowplow 72 Great Spotted Kiwi released

15 October 2015  •  Alex Dean

We are pleased to announce the release of Snowplow version 72 Great Spotted Kiwi. This release adds the ability to track clicks through the Snowplow Clojure Collector, adds a cookie extractor enrichment and introduces new deduplication queries leveraging R71’s event fingerprint.

The rest of this post will cover the following topics:

  1. Click tracking
  2. New cookie extractor enrichment
  3. New deduplication queries
  4. Upgrading
  5. Getting help
  6. Upcoming releases

great-spotted-kiwi

1. Click tracking

Although the Snowplow JavaScript Tracker offers link click tracking, there are scenarios where you want to track a link click without having access to JavaScript. Two common examples are: tracking clicks on ad units, and users downloading files using curl or wget.

To support these use cases we have added a new URI redirect mode into the Clojure Collector. You update your link’s URI to point to your event collector, and the collector receives the click, logs a URI redirect event and then performs a 302 redirect to the intended URI. This is the exact model followed by ad servers to track ad clicks.

To use this functionality:

  • Set your collector path to /r/tp2? - the /r/tp2 tells Snowplow that you are attempting a URI redirect
  • Add a &u= argument to your collector URI, where `` is your URL-encoded final URI to redirect to
  • On clicking this link, the collector will register the link and then do a 302 redirect to the supplied ``
  • As well as the &u= parameter, you can populate the collector URI with any other fields from the Snowplow Tracker Protocol

The URI redirection will be recorded using the JSON Schema com.snowplowanalytics.snowplow/uri_redirect/jsonschema/1-0-0.

For more information on how this functionality works, check out the Click tracking section in our Pixel Tracker documentation.

We will be adding this capability into the Scala Stream Collector in Release 74.

One powerful attribute of having Snowplow event collection on your own domain (e.g. events.snowplowanalytics.com) is the ability to capture first-party cookies set by other services on your domain such as ad servers or CMSes; these cookies are stored as HTTP headers in the Thrift raw event payload by the Scala Stream Collector.

Prior to this release there was no way of accessing these cookies in the Snowplow Enrichment process - until now, with Snowplow community member Kacper Bielecki’s new Cookie Extractor Enrichment. This is our first community-contributed enrichment - a huge milestone and hopefully the first of many! Thanks so much Kacper.

The example configuration JSON for this enrichment is as follows:

{
    "schema": "iglu:com.snowplowanalytics.snowplow/cookie_extractor_config/jsonschema/1-0-0",
    "data": {
        "name": "cookie_extractor_config",
        "vendor": "com.snowplowanalytics.snowplow",
        "enabled": true,
        "parameters": {
            "cookies": ["sp"]
        }
    }
}

This default configuration is capturing the Scala Stream Collector’s own sp cookie - in practice you would probably extract other more valuable cookies available on your company domain. Each extracted cookie will end up a single derived context following the JSON Schema org.ietf/http_cookie/jsonschema/1-0-0.

For more information see the Cookie extractor enrichment page on the Snowplow wiki.

Please note that this enrichment only works with events recorded by the Scala Stream Collector - the CloudFront and Clojure Collectors do not capture HTTP headers.

3. New deduplication queries

This release comes with 3 new SQL scripts that deduplicate events in Redshift using the event fingerprint that was introduced in Snowplow R71. For more information on duplicates, see the recent blogpost that explores the phenomenon in more detail.

The first script deduplicates rows with the same event_id and event_fingerprint. Because these events are identical, the script leaves the earliest one in atomic and moves all others to a separate schema. There is an optional last step that also moves all remaining duplicates (same event_id but different event_fingerprint). Note that this could delete legitimate events from atomic.

The second is an optional script that deduplicates rows with the same event_id where at least one row has no event_fingerprint (older events). The script is identical to the first script, except that an event fingerprint is generated in SQL.

The third script is a template that can be used to deduplicate unstructured event or custom context tables. Note that contexts can have legitimate duplicates (e.g. 2 or more product contexts that join to the same parent event). If that is the case, make sure that the context is defined in such a way that no 2 identical contexts are ever sent with the same event. The script combines rows when all fields but root_tstamp are equal. There is an optional last step that moves all remaining duplicates (same root_id but at least one field other than root_tstamp is different) from atomic to duplicates. Note that this could delete legitimate events from atomic.

These scripts can be run after each load using SQL Runner. Make sure to run the setup queries first.

4. Upgrading

Upgrading the Clojure Collector

This release bumps the Clojure Collector to version 1.1.0.

To upgrade to this release:

  1. Download the new warfile by right-clicking on this link and selecting “Save As…”
  2. Log in to your Amazon Elastic Beanstalk console
  3. Browse to your Clojure Collector’s application
  4. Click the “Upload New Version” and upload your warfile

Updating the configuration files

You need to update the version of the Enrich jar in your configuration file:

    hadoop_enrich: 1.2.0 # Version of the Hadoop Enrichment process

If you wish to use the new cookie extractor enrichment, write a configuration JSON and add it to your enrichments folder. The example JSON can be found here.

Updating your database

Install the following tables in Redshift as required:

5. Getting help

For more details on this release, please check out the R72 Great Spotted Kiwi release notes on GitHub. Specific documentation on the two new features is available here:

If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.

6. Upcoming releases

By popular request, we are adding a section to these release blog posts to trail upcoming Snowplow releases. Note that these releases are always subject to change between now and the actual release date.

Upcoming releases are:

  • Release 73 Cuban Macaw, which removes the JSON fields from atomic.events and adds the ability to load bad rows into Elasticsearch
  • Release 74 Bird TBC, which brings the Kinesis pipeline up-to-date with the most recent Scala Common Enrich releases. This will also include click redirect support in the Scala Stream Collector

Other milestones being actively worked on include Avro support #1, Weather enrichment and Snowplow CLI #2.