We are hugely excited to announce the release of Iglu, our first new product since launching our Snowplow prototype two and a half years ago.
Iglu is a machine-readable schema repository initially supporting JSON Schemas. It is a key building block of the next Snowplow release, 0.9.5, which will validate incoming unstructured events and custom contexts using JSON Schema.
As far as we know, Iglu is the first machine-readable schema repository for JSON Schema, and the first technology which supports schema resolution from multiple public and (soon) private schema repositories.
In the rest of this post we will cover:
Snowplow is evolving from a web analytics platform into a general event analytics platform, supporting events coming from mobile apps, the internet of things, games, cars, connected TVs and so forth. This means an explosion in the variety of events that Snowplow needs to support: games saved, clusters started, tills opened, cars started - in fact the potential variety of events is almost infinite.
Historically, there have been two approaches to dealing with the explosion of possible event types:
Custom variables as used by Google Analytics, SiteCatalyst, Piwik and other web analytics packages are extremely limited - we plan to explore these limitations in a future blog post. Schema-less JSONs as offered by Mixpanel, KISSmetrics and others are much more powerful, but they have a different set of problems:
The issues illustrated above primarily relate to the lack of a defined schema for these events as they flow into and then thru the analytics system. More generally, we could say that the problem is that the original schemas have been lost. The entities snapshotted in an event typically started life as Active Record models, Protocol Buffers, Backbone.js models or N(Hibernate) objects or similar (and before that, often as RDBMS or NoSQL records). In other words, they started life with a schema, but that schema has been discarded on ingest into the analytics system.
As a result, the business analyst or data scientist typically has to maintain a mental model of the source data schemas when using the analytics system:
This is a hugely error-prone and wasteful exercise:
The obvious answer was to introduce JSON Schema support for all JSONs sent in to Snowplow - i.e. unstructured events and custom contexts. JSON Schema is a standard for describing a JSON data format; it supports validating that a given JSON conforms to a given JSON Schema.
But as we started to experiment with JSON Schema, it became obvious that JSON Schema was just one building block: there were several other pieces we needed, none of which seem to exist already. In defining and building these missing pieces, Iglu was born.
As you’ve seen, we made the design decision that whenever a developer or analyst wanted to send in any JSON to Snowplow, they should first create a JSON Schema for that event. Here is an example JSON Schema for a
video_played event based on the Mixpanel example above:
(Note that this is actually a self-describing JSON Schema.)
We made a further design decision that the JSON sent in to Snowplow should report the exact JSON Schema that could be used to validate it. Rather than embed the JSON Schema inside the JSON, which would be extremely wasteful of space, we came up with a convenient short-hand that looked like this:
We called this format a self-describing JSON. The
iglu: entry is what we are calling an Iglu “schema key”, consisting of the following parts:
We explained the origins of SchemaVer, our schema versioning system, in our blog post Introducing SchemaVer for semantic versioning of schemas.
Next, we needed somewhere to store JSON Schemas like
video_played above - a home for schemas where:
It became obvious that we needed some kind of “registry” or “repository” of schemas:
As we worked on Snowplow 0.9.5, we were able to firm up a set of core requirements for our schema repository:
With this laundry list of requirements, we started to look at what open-source software was already available.
We looked to see if there were any existing solutions around schema registries or repositories for JSON Schema or other schema systems.
We found very little in the way of schema systems for JSON Schema or XML: for JSON Schema we only found this static repository sample-json-schemas by Francis Galiegue, one of the JSON Schema authors. Googling for “XML schema repository” turned up very little: only xml.org, but this seemed to be article-oriented rather than machine-readable.
By contrast, the Apache Avro community seemed ahead of the pack. We found two projects to develop machine-readable schema repositories for Avro:
The main differences we could ascertain between our requirements and the Avro efforts were as follows:
Given these differences, we decided to take the learnings from the Avro community and start work on our own repository technology designed to meet Snowplow’s specific requirements around schemas: Iglu.
Iglu is a machine-readable, open-source (Apache License 2.0) schema repository, initially for JSON Schema only. A schema repository (sometimes called a schema registry) is like npm or Maven or git but holds data schemas instead of software or code.
Iglu consists of three key technical aspects:
These pieces fit together like this:
Iglu Central is a public repository of JSON Schemas. Think of Iglu Central as like RubyGems.org or Maven Central but for storing publically-available JSON Schemas.
We are using Iglu Central to host all of the JSON Schemas which are used in different parts of Snowplow; the schemas for Iglu Central are stored in GitHub, in snowplow/iglu-central.
Here is an illustration of various Iglu clients talking to Iglu Central; we also show an Iglu Central mirror for a client working behind a firewall:
As far as we know, Iglu Central is the first public machine-readable schema repository - all prior efforts we have seen are human-browsable directories of articles about schemas (e.g. schema.org).
Iglu Central is hosted by Snowplow at http://iglucentral.com. Although Iglu Central is primarily designed to be consumed by Iglu clients, the root index page for Iglu Central links to all schemas currently hosted on Iglu Central.
While we have deliberately engineered Iglu as a standalone product, we expect that most initial usage of Iglu will be in conjunction with Snowplow.
Based on our early internal testing of Iglu, we envisage that a Snowplow user will want to:
Separately, we hope that software vendors, analysts and data scientists will contribute their own schemas to Iglu Central; it would be awesome in particular if companies offering streaming APIs or webhooks would publish JSON Schemas for their event streams into Iglu Central. Let’s schema everything!
We will discuss how to use Iglu with Snowplow in much more detail following the release of Snowplow 0.9.5.
While heavily influenced by our requirements for Snowplow, we have deliberately created Iglu as a standalone product, one which we hope will be broadly useful as a schema repository technology.
If you are interested in using Iglu without Snowplow, then we would recommend reading the Iglu wiki in detail. Wherever you find blocking gaps in the documentation, do please raise an issue in GitHub.
For an in-depth understanding of how Iglu works, we recommend browsing through the source for the Iglu Scala client. The next Snowplow release, 0.9.5, will make heavy use of our new Scala client for Iglu, so the client code is a good starting point for understanding the underlying design of Iglu.
We have deliberately tried to keep the scope of Iglu 0.1.0 as minimal as possible. The major known technical limitations at this time are:
Our first development priority for Iglu is creating a RESTful schema repository server which allows users to publish new schemas to the repository, and has some basic authentication to keep schemas private. For more details on what is coming next in Iglu, check out the Product roadmap on the wiki.
When we created Snowplow at the beginning of 2012, it didn’t need a lot of explanation - as an open source web analytics system, it fitted into a well-understood software category. As a schema repository, Iglu is a much more unusual beast - so do please get in touch and tell us your feedback, ask any questions or contribute!
The key resource for learning more about Iglu is the Iglu wiki on GitHub - do check it out. Wherever you find blocking gaps in the documentation, please raise an issue in GitHub.
We are hugely excited about the release of Iglu - we hope that the Snowplow community shares our excitement. Let’s work together to make end-to-end-schemas a reality for web and event analytics. And stay tuned for the Snowplow 0.9.5 release (coming soon) for some more guidance on using Iglu with Snowplow!