Snowplow 0.9.1 released with initial JSON support

Share

We are hugely excited to announce the immediate availability of Snowplow 0.9.1. This release introduces initial support for JSON-based custom unstructured events and custom contexts in the Snowplow Enrichment and Storage processes; this is the most-requested feature from our community and a key building block for mobile and app event tracking in Snowplow.

Snowplow’s event trackers have supported custom unstructured events and custom contexts for some time, but prior to 0.9.1 there had been no way of working with these JSON-based objects “downstream” in the rest of the Snowplow data pipeline. This release adds preliminary support like this:

  1. Parse incoming custom unstructured events and contexts to ensure that they are valid JSON
  2. Where possible, clean up the JSON (e.g. remove whitespace)
  3. Store the JSON as json-type fields in Postgres, and in large varchar fields in Redshift

As well as this new JSON-based functionality, 0.9.1 also includes a host of additional features and updates, also discussed below.

In the rest of this post we will cover:

  1. Unstructured events and custom contexts
  2. VPC support on EMR
  3. Tracker Protocol-related improvements
  4. Other improvements
  5. Upgrading
  6. Getting help
  7. Roadmap

Unstructured events are stored in two new fields:

Custom contexts are stored in one new field: contexts.

In Postgres, ue_properties and contexts are columns of data type json, which is available in PostgreSQL 9.2 upwards. In Redshift, ue_properties and contexts are columns of data type varchar(10000), which should be plenty for most purposes. If an incoming JSON is greater than 10,000 characters, then the row is rejected to avoid truncated (i.e. corrupted) JSONs from being loaded into Redshift.

If you want to try out the new functionality, the first step is to start generating unstructured events and/or custom contexts from your tracker. For more information:

Once you have your unstructured events and contexts flowing through into Postgres or Redshift, you can then use those databases’ JSON capabilities to explore the data:

In December 2013 Amazon implemented a new VPC system and Elastic MapReduce now maps to this; this has been causing problems with EmrEtlRunner for some Snowplow users. We have updated EmrEtlRunner to have a new setting, :ec2_subnet_id:

:emr: # Can bump the below as EMR upgrades Hadoop :hadoop_version: 1.0.3 :placement: ADD HERE # Set even if running in VPC :ec2_subnet_id: ADD HERE # Leave blank if not running in VPC

Please set :ec2_subnet_id: if you are running Elastic MapReduce inside a named VPC. Also, please continue to set the :placement even if running within a VPC.

As an added bonus, in this release EmrEtlRunner now runs all jobs with the visible_to_all_users flag set, which should make debugging your jobs a little easier. Many thanks to community member Ryan Doherty for this suggestion.

We have made a small number of improvements around the Snowplow Tracker Protocol:

The other updates in this release are as follows:

U
pgrading is a three step process:

  1. Update EmrEtlRunner
  2. Update StorageLoader
  3. Upgrade atomic.events

Let’s take these in term:

You need to update EmrEtlRunner to the latest code (0.9.1 release) on Github:

$ git clone git://github.com/snowplow/snowplow.git $ git checkout 0.9.1 $ cd snowplow/3-enrich/emr-etl-runner $ bundle install --deployment

You also need to update the config.yml file for EmrEtlRunner to use the latest version of the Hadoop ETL (0.4.0):

:snowplow: :hadoop_etl_version: 0.4.0

Don’t forget to add in the new subnet (VPC) argument too:

:emr: ... :ec2_subnet_id: ADD HERE # Leave blank if not running in VPC

To see a complete example of the EmrEtlRunner config.yml file, see the Github repo.

You need to upgrade your StorageLoader installation to the latest code (0.9.1) on Github:

$ git clone git://github.com/snowplow/snowplow.git $ git checkout 0.9.1 $ cd snowplow/4-storage/storage-loader $ bundle install --deployment

We have updated the Redshift and Postgres table definitions for atomic.events. You can find the latest versions in the GitHub repository, along with migration scripts to handle the upgrade from recent prior versions. Please review any migration script carefully before running and check that you are happy with how it handles the upgrade.

Database Table definition Migration script
Redshift 0.3.0 Migrate from 0.2.2
Postgres 0.2.0 Migrate from 0.1.x

And that’s it! Your upgrade should now be complete.

As always, if you do run into any issues or don’t understand any of the above changes, please raise an issue or get in touch with us via the usual channels.

For more details on this release, please check out the 0.9.1 Release Notes on GitHub.

We are just getting started with our support for custom unstructured events and custom contexts in Snowplow! In coming releases we plan to:

So a huge amount planned! We are super excited about Snowplow being the first open source analytics platform to make the leap into unstructured event analytics.

Stay tuned for further updates on this – and if you would like to read up for what is coming soon, we would encourage checking out this excellent guide to JSON Schema (PDF).

Share

Related articles