With the release of Snowplow 0.9.1 back in April, we were able to load unstructured events and custom contexts as JSONs into dedicated fields in our “fat” events table in Postgres and Redshift. This was a good start, but we wanted to do much more, particularly:
We designed Snowplow 0.9.5 to deliver on these goals, working in concert with our recent Iglu 0.1.0 release.
Read on below the fold for:
There are three great use cases for our new shredding functionality:
If you are not interested in using the new shredding functionality, that’s fine too - both EmrEtlRunner and StorageLoader now support a new
--skip shred option.
It may be helpful to understand how our shredding process is architected. You can see it laid out on the right-hand-side of this diagram, highlighted in blue:
The shredding process consists of two parts:
COPY FROM JSONfunctionality
The architecture is discussed further in the Shredding wiki page.
One of the most exciting things about the new JSON validation and shredding functionality in 0.9.5 is the ability for us to add new events to Snowplow without having to modify the existing codebase.
In this release we are bundling the following new event types and contexts:
Each of these new events/contexts includes:
In Upgrading below we cover how to add support for these new event types to your Snowplow installation.
We have deliberately kept other new functionality in this release to a minimum.
In the 1-trackers sub-folder on the Snowplow repo, we have updated git submodules to point to the latest tracker releases, and also added new entries for the new trackers released recently, namely the Ruby and Java Trackers.
We have made a small number of enhancements to EmrEtlRunner:
--skip s3distcpoption lets you skip reading from and writing to HDFS, i.e. the Hadoop jobs will read from and write to directly S3
We have made one small improvement to StorageLoader: Redshift
COPYs now include the
ACCEPTINVCHARS option, so that event data can be loaded into VARCHAR columns even if the data contains invalid UTF-8 characters.
You need to update EmrEtlRunner to the latest code (0.9.5 release) on GitHub:
You also need to update the
config.yml file for EmrEtlRunner. You can find an example of this file in GitHub here. For more information on how to populate the new configuration file correctly, see the Configuration section of the EmrEtlRunner setup guide.
You need to upgrade your StorageLoader installation to the latest code (0.9.5) on Github:
You also need to update the
config.yml file for StorageLoader. You can find an examples of this file in GitHub:
For more information on how to populate the new configuration file correctly, see the Configuration section of the StorageLoader setup guide.
If you want to add support for the new Snowplow-authored events e.g. link clicks to your Snowplow installation, this is a two step process:
Snowplow 0.9.5 lets you define your own custom unstructured events and contexts, and configure Snowplow to processing these from collection through into Redshift and even Looker.
Setting this up is outside of the scope of this release blog post. We have documented the process on our wiki, split into two pages:
Validating and shredding JSONs is a young and fast-evolving area - Snowplow 0.9.5 is only our first release here and so it is important to manage expectations on what it can and cannot do yet.
First off: while the validation and shredding process (Scala Hadoop Shred) works regardless of ultimate storage target (whether Redshift/Postgres/S3), at this time we are only able to load shredded types into Redshift. Postgres does not have an analog to
COPY FROM JSON, and so significant additional work would be required to support loading shredded types into Postgres.
Secondly, as you will see from the documentation, setting up a new shredded type (from JSON Schema through to Redshift table definition) is a very manual process. We hope to simplify this in the future.
Thirdly, our shredding process does not (yet) support nested objects. In other words, we can only shred a given JSON instance into one row in one table, not N tables (plus potentially multiple rows per table for arrays). This is something we plan to explore soon.
Finally, although this is more a limitation of Iglu: we do not currently support for private (i.e. authenticated) Iglu schema repositories. In the meantime, we recommend practising “privacy through obscurity” (i.e. host your schema repository on a URI nobody else knows).
The shredding functionality in Snowplow 0.9.5 is very new and experimental - we’re excited to see how it plays out and look forward to the community’s feedback.
The main documentation on shredding is all on the wiki:
As always, if you do run into any issues or don’t understand any of the above changes, please raise an issue or get in touch with us via the usual channels.
For more details on this release, please check out the 0.9.5 Release Notes on GitHub.