Snowplow 0.9.1 released with initial JSON support
We are hugely excited to announce the immediate availability of Snowplow 0.9.1. This release introduces initial support for JSON-based custom unstructured events and custom contexts in the Snowplow Enrichment and Storage processes; this is the most-requested feature from our community and a key building block for mobile and app event tracking in Snowplow.
Snowplow’s event trackers have supported custom unstructured events and custom contexts for some time, but prior to 0.9.1 there had been no way of working with these JSON-based objects “downstream” in the rest of the Snowplow data pipeline. This release adds preliminary support like this:
- Parse incoming custom unstructured events and contexts to ensure that they are valid JSON
- Where possible, clean up the JSON (e.g. remove whitespace)
- Store the JSON as
json-type fields in Postgres, and in large
varcharfields in Redshift
As well as this new JSON-based functionality, 0.9.1 also includes a host of additional features and updates, also discussed below.
In the rest of this post we will cover:
- Unstructured events and custom contexts
- VPC support on EMR
- Tracker Protocol-related improvements
- Other improvements
- Getting help
Unstructured events are stored in two new fields:
ue_nameholds the name of the unstructured event
ue_propertiesholds the JSON object containing the name: value properties for this event
Custom contexts are stored in one new field:
contexts are columns of data type
json, which is available in PostgreSQL 9.2 upwards. In Redshift,
contexts are columns of data type
varchar(10000), which should be plenty for most purposes. If an incoming JSON is greater than 10,000 characters, then the row is rejected to avoid truncated (i.e. corrupted) JSONs from being loaded into Redshift.
If you want to try out the new functionality, the first step is to start generating unstructured events and/or custom contexts from your tracker. For more information:
Once you have your unstructured events and contexts flowing through into Postgres or Redshift, you can then use those databases’ JSON capabilities to explore the data:
In December 2013 Amazon implemented a new VPC system and Elastic MapReduce now maps to this; this has been causing problems with EmrEtlRunner for some Snowplow users. We have updated EmrEtlRunner to have a new setting,
:ec2_subnet_id: if you are running Elastic MapReduce inside a named VPC. Also, please continue to set the
:placement even if running within a VPC.
As an added bonus, in this release EmrEtlRunner now runs all jobs with the
visible_to_all_users flag set, which should make debugging your jobs a little easier. Many thanks to community member Ryan Doherty for this suggestion.
We have made a small number of improvements around the Snowplow Tracker Protocol:
- Platform codes - we have now added support in the Enrichment process for the full range of platform codes specified in the Snowplow Tracker Protocol. Many thanks to community member Andrew Lombardi for this contribution!
- Tracker namespacing - new Tracker Protocol field
tnais populated as
- Event vendoring - new Tracker Protocol field
evnpopulates through to
event_vendorwas hardcoded to “com.snowplowanalytics”
The other updates in this release are as follows:
- We have added the raw
page_referrerURIs into the Storage targets, alongside the existing URI-component fields
- We have updated the StorageLoader so that
dvce_timestampvalues outside of the standard range can be loaded into Redshift
- We have updated the
event_idfield, which contains a UUID, from
- We have changed the
atomic.eventsin Redshift to be
event_id, to optimize for table JOINs which are coming in future Snowplow releases
Upgrading is a three step process:
Let’s take these in term:
You need to update EmrEtlRunner to the latest code (0.9.1 release) on Github:
You also need to update the
config.yml file for EmrEtlRunner to use the latest version of the Hadoop ETL (0.4.0):
Don’t forget to add in the new subnet (VPC) argument too:
To see a complete example of the EmrEtlRunner
config.yml file, see the Github repo.
You need to upgrade your StorageLoader installation to the latest code (0.9.1) on Github:
We have updated the Redshift and Postgres table definitions for
atomic.events. You can find the latest versions in the GitHub repository, along with migration scripts to handle the upgrade from recent prior versions. Please review any migration script carefully before running and check that you are happy with how it handles the upgrade.
|Database||Table definition||Migration script|
|Redshift||0.3.0||Migrate from 0.2.2|
|Postgres||0.2.0||Migrate from 0.1.x|
And that’s it! Your upgrade should now be complete.
For more details on this release, please check out the 0.9.1 Release Notes on GitHub.
We are just getting started with our support for custom unstructured events and custom contexts in Snowplow! In coming releases we plan to:
- Allow you to define the structure of your unstructured events and custom contexts using JSON Schema
- Add support for validating your unstructured events and contexts against your own JSON Schemas
- Automatically “shred” your unstructured events and contexts into dedicated Redshift and Postgres tables using JSON Path
- Add new event types (e.g. link clicks) to Snowplow using custom unstructured events, rather than by extending the Tracker Protocol further
So a huge amount planned! We are super excited about Snowplow being the first open source analytics platform to make the leap into unstructured event analytics.
Stay tuned for further updates on this - and if you would like to read up for what is coming soon, we would encourage checking out this excellent guide to JSON Schema (PDF).