Snowplow 71 Stork-Billed Kingfisher released

02 October 2015  •  Fred Blundun

We are pleased to announce the release of Snowplow version 71 Stork-Billed Kingfisher. This release significantly overhauls Snowplow’s handling of time and introduces event fingerprinting to support deduplication efforts. It also brings our validation of unstructured events and custom context JSONs “upstream” from our Hadoop Shred process into our Hadoop Enrich process.

The rest of this post will cover the following topics:

  1. Better handling of event time
  2. JSON validation in Scala Common Enrich
  3. New unstructured event fields in enriched events
  4. New event fingerprint enrichment
  5. More performant handling of missing schemas
  6. New CloudFront access log fields
  7. Other changes
  8. Using SSL in the StorageLoader
  9. New approach to atomic.events upgrades
  10. Upgrading
  11. Getting help

stork-billed-kingfisher

1. Better handling of event time

This release implements our new approach to determining when events occurred, as introduced in the recent blog post Improving Snowplow’s understanding of time.

Specifically, this release:

  • Renames dvce_tstamp to dvce_created_tstamp to remove ambiguity
  • Adds the derived_tstamp field to our Canonical Event Model
  • Adds the true_tstamp field, in readiness for our trackers adding support for this
  • Implements the algorithm set out in that blog post to calculate the most accurate derived_tstamp available

2. JSON validation in Scala Common Enrich

Previously, validation of unstructured events and custom context self-describing JSONs was only performed in our Hadoop Shred process, in preparation for loading Redshift. With self-describing JSONs growing more and more central to event modeling within Snowplow, it became increasingly important to bring this validation “upstream” into Scala Common Enrich.

Thanks to Dani Sola, the Scala Hadoop Shred validation code for unstructured event and custom context JSONs is now also executed from within Scala Common Enrich.

This means that Scala Hadoop Enrich now validates unstructured event and custom context JSONs; in the next Kinesis pipeline release, Scala Kinesis Enrich will validate these JSONs too.

Please note: if the unstructured event or any of the custom contexts fail validation against their respective JSON Schemas in Iglu, then the event will be failed and written to the bad bucket.

3. New unstructured event fields in enriched events

Now that we are validating unstructured events in Scala Common Enrich (rather than simply passing them through), we can extract some key information about the unstructured event for storage in our Canonical event model.

Therefore, Dani has added event_vendor, event_name, event_format, and event_version fields to our enriched event model. This makes it a lot easier to analyze the distribution of your event types just by looking at atomic.events. Many thanks Dani!

These are the values of the new event fields for our five “legacy” event types which aren’t (yet) modeled using self-describing JSON:

Legacy event type event_name event_vendor event_format event_version
Page view page_view com.snowplowanalytics.snowplow jsonschema 1-0-0
Page ping page_ping com.snowplowanalytics.snowplow jsonschema 1-0-0
Transaction transaction com.snowplowanalytics.snowplow jsonschema 1-0-0
Transaction item transaction_item com.snowplowanalytics.snowplow jsonschema 1-0-0
Structured event event com.google.analytics jsonschema 1-0-0

4. New event fingerprint enrichment

Duplicate events are a hot topic in the Snowplow community - see the recent blog post Dealing with duplicate event IDs for a detailed exploration of the phenomenon.

As a first step in making it easier to identify and quarantine duplicates, this release introduces a new Event fingerprint enrichment.

The new enrichment creates a fingerprint from a hash of the Tracker Protocol fields set in an event’s querystring (for GET requests) or body (for POST requests). You can configure a list of Tracker Protocol fields to exclude from the hash generation. For example, in our default configuration we exclude:

  1. “eid” (event_id), because we will typically review event IDs separately when investigating duplicates
  2. “stm” (dvce_sent_tstamp), since this field could change between two different attempts to send the same event
  3. “nuid” (network_userid), because a single event that is sent twice to a collector on a computer that does not accept third party cookies would be assigned different network_userids (despite being a duplicate)
  4. “cv” (v_collector), because this is attached by the Clojure Collector rather than by the tracker

The example configuration JSON for this enrichment is as follows:

{
  "schema": "iglu:com.snowplowanalytics.snowplow/event_fingerprint_config/jsonschema/1-0-0",
  "data": {
    "vendor": "com.snowplowanalytics.snowplow",
    "name": "event_fingerprint_config",
    "enabled": true,
    "parameters": {
      "excludeParameters": ["cv", "eid", "nuid", "stm"],
      "hashAlgorithm": "MD5"
    }
  }
}

5. New CloudFront access log fields

In July, an AWS update added four new fields to the CloudFront access log format.

The Snowplow CloudFront access log input format (not to be confused with the CloudFront Collector) now supports these new fields. You can use this migration script to upgrade your Redshift table accordingly.

6. More performant handling of missing schemas

Previously the Scala Hadoop Shred process would take an extremely long time to complete if a JSON Schema referenced across many events could not be found in any Iglu repository.

This was because, although our underlying Iglu client cached successfully-found schemas, it did not remember which schemas it had already failed to find; this led to an expensive HTTP lookup on every missing schema instance. The latest release fixes this problem.

7. Using SSL in the StorageLoader

Snowplow community member Dennis Waldron has contributed the ability to connect to Postgres and Redshift using SSL. To do this, add an “ssl_mode” field to each target in your configuration YAML:

  targets:
    - name: "My Redshift database"
      type: redshift
      host: ADD HERE # The endpoint as shown in the Redshift console
      database: ADD HERE # Name of database
      port: 5439 # Default Redshift port
      ssl_mode: disable # One of disable (default), require, verify-ca or verify-full
      table: atomic.events
      username: ADD HERE
      password: ADD HERE
      maxerror: 1 # Stop loading on first error, or increase to permit more load errors
      comprows: 200000 # Default for a 1 XL node cluster. Not used unless --include compupdate specified

Thanks Dennis!

8. New approach to atomic.events upgrades

Starting in this release, we are taking a new approach to upgrading the atomic.events table. Previous upgrades would typically rename the existing table as “atomic.events_{{old version}}”, create a new table with the new structure and copy all events over.

From this release onwards, our upgrades to atomic.events will always only mutate the existing table using ALTER statements. This is intended to make upgrades to existing Redshift databases much faster.

To prevent confusion about the version of a particular atomic.events table, the table creation and migration scripts now add the version to the table as a comment using the COMMENT statement.

9. Other improvements

We have also:

  • Upgraded Scala Hadoop Shred to use Hadoop version 2.4 (#1720)
  • Added validation for v_collector and collector_tstamp (#1611)
  • Upgraded to version 0.2.4 of the referer-parser (#1839)
  • Upgraded to version 1.16 of user-agent-utils (#1905)
  • Changed the BadRow class to use ProcessingMessages rather than Strings (#1936)
  • Added an exception handler around the whole of Scala Common Enrich (#1954)
  • Updated our web-incremental data models so that failure is recoverable (#1974)
  • Fixed a bug where Scala Hadoop Enrich didn’t correctly attach the original Thrift payloads to bad rows (#1950)

10. Upgrading

Installing EmrEtlRunner and StorageLoader

The latest version of the EmrEtlRunner and StorageLoadeder are available from our Bintray here.

Unzip this file to a sensible location (e.g. /opt/snowplow-r71).

Updating the configuration files

You should update the versions of the Enrich and Shred jars in your configuration file:

    hadoop_enrich: 1.1.0 # Version of the Hadoop Enrichment process
    hadoop_shred: 0.5.0 # Version of the Hadoop Shredding process

You should also update the AMI version field:

    ami_version: 3.7.0

For each of your database targets, you must add the new ssl_mode field:

  targets:
    - name: "My Redshift database"
      ...
      ssl_mode: disable # One of disable (default), require, verify-ca or verify-full

If you wish to use the new event fingerprint enrichment, write a configuration JSON and add it to your enrichments folder. The example JSON can be found here.

Updating your database

Use the appropriate migration script to update your version of the atomic.events table to the latest schema:

If you are ingesting Cloudfront access logs with Snowplow, use the Cloudfront access log migration script to update your com_amazon_aws_cloudfront_wd_access_log_1.sql table.

11. Getting help

For more details on this release, please check out the R71 Stork-Billed Kingfisher release notes on GitHub.

If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.