Introducing SchemaVer for semantic versioning of schemas

Alex Dean

Initial SchemaVer draft. Date: 13 March 2014. Draft authors: Alexander Dean, Frederick Blundun.

As we start to re-structure Snowplow away from implicit data models and wiki-based tracker protocols towards formal schemas (initially Thrift and JSON Schema, later Apache Avro), we have started to think about schema versioning.

“There are only two types of developer: the developer who versions his code, and developer_new_newer_newest_v2”

Proper versioning of software is taken for granted these days – there are various different approaches, but at Snowplow we are big believers in Semantic Versioning (SemVer for short). Here is creator Tom Preston-Werner explaining the crucial semantic aspect of SemVer: “Under this scheme, version numbers and the way they change convey meaning about the underlying code and what has been modified from one version to the next.”

We looked around and couldn’t find much prior art around semantic versioning of data schemas. The Avro community seems to have gone down something of a rabbithole with their schema versioning – something we are keen to avoid at Snowplow.

Our initial thought was just to fall back to SemVer for schema versioning – after all, database table definitions are a form of schema, and we have been using SemVer for ours (example) for some time. However, the more we dug into it, the more we realized that SemVer was not the right fit for semantic versioning of schemas, and we would need to come up with something new. We are calling this new versioning formula for data schemas “SchemaVer”.

In the rest of the post, I will go through:

  1. SemVer – providing some background for those who are unfamiliar with it
  2. SchemaVer – providing our formula for using SchemaVer
  3. Design considerations – explaining why SchemaVer is structured the way it is
  4. Use cases – where should we be using SchemaVer
  5. Call for feedback – SchemaVer is a draft, and we would love feedback before we formalize it in Snowplow

If you are a business/web analyst or data scientist rather than coder, you may not be familiar with Semantic Versioning. SemVer provides a simple formula for managing the version of your software as you roll out new versions. That formula has some edge cases, but at its simplest it looks like:

Given a version number MAJOR.MINOR.PATCH, increment:

– MAJOR when you make incompatible API changes,
– MINOR when you add functionality in a backwards-compatible manner, and
– PATCH when you make backwards-compatible bug fixes.

It is important to understand what backwards compatibility means here. For SemVer, backwards compatibility is about providing guarantees (through version numbers), that a piece of software can update its dependency on a SemVer-respecting dependency without either:

  1. its code interfacing with the dependency’s public API breaking, or:
  2. the semantics of the dependency’s existing functionality changing – e.g. .multiply() suddenly starts dividing

Semantic Versioning is a great fit for managing the evolution of software in a way that protects the users of that software. But it’s not a great fit for versioning schemas, because schemas are used in a fundamentally different way to software.

When versioning a data schema, we are concerned with the backwards-compatibility between the new schema and existing data represented in earlier versions of the schema. This is the fundamental building block of SchemaVer, and explains the divergence from SemVer.

Let’s propose a simple formula for SchemaVer:

Given a version number MODEL-REVISION-ADDITION, increment the:

Syntactically this feels similar to SemVer – but as you can see from the increment rules, the semantics of each element are very different from SemVer.

Let’s make SchemaVer more concrete with some examples using JSON Schema, in reverse order:

We have an existing JSON Schema, let’s call this 1-0-0:

{ "$schema": "", "type": "object", "properties": { "bannerId": { "type": "string" } }, "required": ["bannerId"], "additionalProperties": false }

Now we want to add an additional field to our schema:

{ "$schema": "", "type": "object", "properties": { "bannerId": { "type": "string" }, "impressionId": {
 "type": "string" } }, "required": ["bannerId"], "additionalProperties": false }

Because our new impressionId field is not a required field, and because version 1-0-0 had additionalProperties set to false, we know that all historical data will work with this new schema.

Therefore we are looking at an ADDITION, and so we bump the schema version to 1-0-1.

Let’s now make our JSON Schema support additionalProperties – this constitutes another ADDITION, so we are now on 1-0-2:

{ "$schema": "", "type": "object", "properties": { "bannerId": { "type": "string" }, "impressionId": { "type": "string" } }, "required": ["bannerId"], "additionalProperties": true }

After a while, we add a new field, cost:

{ "$schema": "", "type": "object", "properties": { "bannerId": { "type": "string" }, "impressionId": { "type": "string" }, "cost": { "type": "number", "minimum": 0 } }, "required": ["bannerId"], "additionalProperties": true }

Will this new schema validate all historical data? Unfortunately we can’t be certain, because there could be historical JSONs where the analyst added their own cost field, possibly set to a string rather than a number (or a negative number).

So we are effectively making a REVISION to the data schema – so we bump the version to 1-1-0 (resetting ADDITION to 0).

Oh dear – we have just realized that we can identify our clicks through a unique clickId – no need to be storing the bannerId or impressionId. Here is our new JSON Schema:

{ "$schema": "", "type": "object", "properties": { "clickId": { "type": "string" }, "cost": { "type": "number", "minimum": 0 } }, "required": ["clickId"], "additionalProperties":
 false }

We have changed our MODEL – because we can have no reasonable expectation that any of the historical data can interact with this schema. That means our new version is 2-0-0.

Note that we also decided to use this “reboot” of the MODEL to change additionalProperties back to false, because (as we have learnt) it will help us to avoid unnecessary REVISIONs in the future.

At this point we should probably add a few supplementary rules around SchemaVer, especially as they differ from SemVer:

If we have designed SchemaVer right, then hopefully it should seem straightforward, perhaps even obvious. However, we evaluated and discarded many different options while designing SchemaVer. We’ll go through some of these in this section, to “show our working”.

First off, the names MODEL, ADDITION and REVISION went through many revisions. We are pretty happy with these now.

Initially we were keen to use periods to separate the version elements, to allow existing SemVer libraries to work with SchemaVer. Unfortunately, we realized that an analyst looking at a table definition versioned as 1.0.5 would have no idea if the table was schema’ed using SemVer or SchemaVer. So we needed a visual cue to indicate that this was SchemaVer – hence the hyphens.

We gave some serious thought to recreating SemVer’s unstable MAJOR version 0 idea. On balance, this seemed a bad idea for SchemaVer: because inevitably some MODEL version 0s will go into production, and then we lose our all-important guarantees about the relationship between schema versions and the historical data.

We experimented with ways to make SchemaVer fully deterministic – in other words, could we come up with a formula whereby a computer could correctly auto-increment the SchemaVer just by studying the new and previous schema definition?

We have succeeded in making ADDITION fully deterministic – but there are clearly shades of grey in the separation of MODEL and REVISION. We think those shades of grey are useful – because they allow schema authors to exercise their own discretion in not incrementing the MODEL unless absolutely necessary.

We plan to use SchemaVer throughout Snowplow to add semantic versioning to all of our data structures. In fact the process of designing SchemaVer has already helped us: the process has made us much more aware of the types of schema constraints (and lack of constraints) which lead to MODEL, REVISION and ADDITION increments. We are now actively working to minimize MODEL and REVISION increments for Snowplow schemas – and we would encourage our community to do the same when creating schemas for their custom contexts and unstructured events.

We hope that SchemaVer is useful outside of just JSON Schema versioning. We are exploring approaches to versioning database table definitions with SchemaVer, and hope to start a dialog with the Apache Avro community, who have a lot of prior experience attempting to uniquely identify, validate and version data schemas (see e.g. AVRO-1006).

More broadly, we believe that there are some interesting potential use cases for SchemaVer outside of Snowplow. For example in RESTful APIs: many of these are versioned at the API level (“ API v2”), but we would like to see the data structures returned from API GET requests conforming to publically available, SchemaVer-versioned JSON Schemas. This would make interactions with RESTful APIs much less error-prone.

We are also keen to explore adjacent use cases for SchemaVer in other document-oriented software systems, such as CMSes, ecommerce solutions and NoSQL datastores.

Above all, we would like to stress that this is a draft proposal, and we would love to get feedback from the Snowplow community and beyond on semantic schema versioning. Now is the best time for us to get feedback – before we have started to formalize SchemaVer into the coming Snowplow releases.

So do please get in touch if you have thoughts on semantic schema versioning or our proposed SchemaVer specification – we’d love to make this a more collaborative effort!

Related articles