Snowplow 0.9.1 released with initial JSON support

11 April 2014  •  Alex Dean

We are hugely excited to announce the immediate availability of Snowplow 0.9.1. This release introduces initial support for JSON-based custom unstructured events and custom contexts in the Snowplow Enrichment and Storage processes; this is the most-requested feature from our community and a key building block for mobile and app event tracking in Snowplow.

Snowplow’s event trackers have supported custom unstructured events and custom contexts for some time, but prior to 0.9.1 there had been no way of working with these JSON-based objects “downstream” in the rest of the Snowplow data pipeline. This release adds preliminary support like this:

  1. Parse incoming custom unstructured events and contexts to ensure that they are valid JSON
  2. Where possible, clean up the JSON (e.g. remove whitespace)
  3. Store the JSON as json-type fields in Postgres, and in large varchar fields in Redshift

As well as this new JSON-based functionality, 0.9.1 also includes a host of additional features and updates, also discussed below.

In the rest of this post we will cover:

  1. Unstructured events and custom contexts
  2. VPC support on EMR
  3. Tracker Protocol-related improvements
  4. Other improvements
  5. Upgrading
  6. Getting help
  7. Roadmap

Unstructured events are stored in two new fields:

  • ue_name holds the name of the unstructured event
  • ue_properties holds the JSON object containing the name: value properties for this event

Custom contexts are stored in one new field: contexts.

In Postgres, ue_properties and contexts are columns of data type json, which is available in PostgreSQL 9.2 upwards. In Redshift, ue_properties and contexts are columns of data type varchar(10000), which should be plenty for most purposes. If an incoming JSON is greater than 10,000 characters, then the row is rejected to avoid truncated (i.e. corrupted) JSONs from being loaded into Redshift.

If you want to try out the new functionality, the first step is to start generating unstructured events and/or custom contexts from your tracker. For more information:

Once you have your unstructured events and contexts flowing through into Postgres or Redshift, you can then use those databases’ JSON capabilities to explore the data:

In December 2013 Amazon implemented a new VPC system and Elastic MapReduce now maps to this; this has been causing problems with EmrEtlRunner for some Snowplow users. We have updated EmrEtlRunner to have a new setting, :ec2_subnet_id:

:emr:
  # Can bump the below as EMR upgrades Hadoop
  :hadoop_version: 1.0.3
  :placement: ADD HERE     # Set even if running in VPC
  :ec2_subnet_id: ADD HERE # Leave blank if not running in VPC

Please set :ec2_subnet_id: if you are running Elastic MapReduce inside a named VPC. Also, please continue to set the :placement even if running within a VPC.

As an added bonus, in this release EmrEtlRunner now runs all jobs with the visible_to_all_users flag set, which should make debugging your jobs a little easier. Many thanks to community member Ryan Doherty for this suggestion.

We have made a small number of improvements around the Snowplow Tracker Protocol:

  • Platform codes - we have now added support in the Enrichment process for the full range of platform codes specified in the Snowplow Tracker Protocol. Many thanks to community member Andrew Lombardi for this contribution!
  • Tracker namespacing - new Tracker Protocol field tna is populated as name_tracker in our Storage targets. This is to support tracker namespacing, which is coming soon to our JavaScript Tracker
  • Event vendoring - new Tracker Protocol field evn populates through to event_vendor. Previously event_vendor was hardcoded to “com.snowplowanalytics”

The other updates in this release are as follows:

  • We have added the raw page_url and page_referrer URIs into the Storage targets, alongside the existing URI-component fields
  • We have updated the StorageLoader so that dvce_timestamp values outside of the standard range can be loaded into Redshift
  • We have updated the event_id field, which contains a UUID, from varchar(38) to char(36)
  • We have changed the DISTKEY for atomic.events in Redshift to be event_id, to optimize for table JOINs which are coming in future Snowplow releases

Upgrading is a three step process:

  1. Update EmrEtlRunner
  2. Update StorageLoader
  3. Upgrade atomic.events

Let’s take these in term:

You need to update EmrEtlRunner to the latest code (0.9.1 release) on Github:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.9.1
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment

You also need to update the config.yml file for EmrEtlRunner to use the latest version of the Hadoop ETL (0.4.0):

:snowplow:
  :hadoop_etl_version: 0.4.0

Don’t forget to add in the new subnet (VPC) argument too:

:emr:
  ...
  :ec2_subnet_id: ADD HERE # Leave blank if not running in VPC

To see a complete example of the EmrEtlRunner config.yml file, see the Github repo.

You need to upgrade your StorageLoader installation to the latest code (0.9.1) on Github:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.9.1
$ cd snowplow/4-storage/storage-loader
$ bundle install --deployment

We have updated the Redshift and Postgres table definitions for atomic.events. You can find the latest versions in the GitHub repository, along with migration scripts to handle the upgrade from recent prior versions. Please review any migration script carefully before running and check that you are happy with how it handles the upgrade.

Database Table definition Migration script
Redshift0.3.0Migrate from 0.2.2
Postgres0.2.0Migrate from 0.1.x

And that’s it! Your upgrade should now be complete.

As always, if you do run into any issues or don’t understand any of the above changes, please raise an issue or get in touch with us via the usual channels.

For more details on this release, please check out the 0.9.1 Release Notes on GitHub.

We are just getting started with our support for custom unstructured events and custom contexts in Snowplow! In coming releases we plan to:

  • Allow you to define the structure of your unstructured events and custom contexts using JSON Schema
  • Add support for validating your unstructured events and contexts against your own JSON Schemas
  • Automatically “shred” your unstructured events and contexts into dedicated Redshift and Postgres tables using JSON Path
  • Add new event types (e.g. link clicks) to Snowplow using custom unstructured events, rather than by extending the Tracker Protocol further

So a huge amount planned! We are super excited about Snowplow being the first open source analytics platform to make the leap into unstructured event analytics.

Stay tuned for further updates on this - and if you would like to read up for what is coming soon, we would encourage checking out this excellent guide to JSON Schema (PDF).