Initial SchemaVer draft. Date: 13 March 2014. Draft authors: Alexander Dean, Frederick Blundun.
As we start to re-structure Snowplow away from implicit data models and wiki-based tracker protocols towards formal schemas (initially Thrift and JSON Schema, later Apache Avro), we have started to think about schema versioning.
"There are only two types of developer: the developer who versions his code, and developer_new_newer_newest_v2"
Proper versioning of software is taken for granted these days - there are various different approaches, but at Snowplow we are big believers in Semantic Versioning (SemVer for short). Here is creator Tom Preston-Werner explaining the crucial semantic aspect of SemVer: “Under this scheme, version numbers and the way they change convey meaning about the underlying code and what has been modified from one version to the next.”
We looked around and couldn’t find much prior art around semantic versioning of data schemas. The Avro community seems to have gone down something of a rabbithole with their schema versioning - something we are keen to avoid at Snowplow.
Our initial thought was just to fall back to SemVer for schema versioning - after all, database table definitions are a form of schema, and we have been using SemVer for ours (example) for some time. However, the more we dug into it, the more we realized that SemVer was not the right fit for semantic versioning of schemas, and we would need to come up with something new. We are calling this new versioning formula for data schemas “SchemaVer”.
In the rest of the post, I will go through:
- SemVer - providing some background for those who are unfamiliar with it
- SchemaVer - providing our formula for using SchemaVer
- Design considerations - explaining why SchemaVer is structured the way it is
- Use cases - where should we be using SchemaVer
- Call for feedback - SchemaVer is a draft, and we would love feedback before we formalize it in Snowplow
If you are a business/web analyst or data scientist rather than coder, you may not be familiar with Semantic Versioning. SemVer provides a simple formula for managing the version of your software as you roll out new versions. That formula has some edge cases, but at its simplest it looks like:
Given a version number MAJOR.MINOR.PATCH, increment:
- MAJOR when you make incompatible API changes,
- MINOR when you add functionality in a backwards-compatible manner, and
- PATCH when you make backwards-compatible bug fixes.
It is important to understand what backwards compatibility means here. For SemVer, backwards compatibility is about providing guarantees (through version numbers), that a piece of software can update its dependency on a SemVer-respecting dependency without either:
- its code interfacing with the dependency’s public API breaking, or:
- the semantics of the dependency’s existing functionality changing - e.g.
.multiply()suddenly starts dividing
Semantic Versioning is a great fit for managing the evolution of software in a way that protects the users of that software. But it’s not a great fit for versioning schemas, because schemas are used in a fundamentally different way to software.
When versioning a data schema, we are concerned with the backwards-compatibility between the new schema and existing data represented in earlier versions of the schema. This is the fundamental building block of SchemaVer, and explains the divergence from SemVer.
Let’s propose a simple formula for SchemaVer:
Given a version number
MODEL-REVISION-ADDITION, increment the:
MODELwhen you make a breaking schema change which will prevent interaction with any historical data
REVISIONwhen you make a schema change which may prevent interaction with some historical data
ADDITIONwhen you make a schema change that is compatible with all historical data
Syntactically this feels similar to SemVer - but as you can see from the increment rules, the semantics of each element are very different from SemVer.
Let’s make SchemaVer more concrete with some examples using JSON Schema, in reverse order:
We have an existing JSON Schema, let’s call this
Now we want to add an additional field to our schema:
Because our new
impressionId field is not a required field, and because version
additionalProperties set to false, we know that all historical data will work with this new schema.
Therefore we are looking at an
ADDITION, and so we bump the schema version to
Let’s now make our JSON Schema support
additionalProperties - this constitutes another
ADDITION, so we are now on
After a while, we add a new field,
Will this new schema validate all historical data? Unfortunately we can’t be certain, because there could be historical JSONs where the analyst added their own
cost field, possibly set to a string rather than a number (or a negative number).
So we are effectively making a
REVISION to the data schema - so we bump the version to
ADDITION to 0).
Oh dear - we have just realized that we can identify our clicks through a unique
clickId - no need to be storing the
impressionId. Here is our new JSON Schema:
We have changed our
MODEL - because we can have no reasonable expectation that any of the historical data can interact with this schema. That means our new version is
Note that we also decided to use this “reboot” of the
MODEL to change
additionalProperties back to false, because (as we have learnt) it will help us to avoid unnecessary
REVISIONs in the future.
At this point we should probably add a few supplementary rules around SchemaVer, especially as they differ from SemVer:
- We use hyphens (
-s) to separate the version parts, not periods (
.s) as in SemVer
- Versioning starts from 1, not 0 as in SemVer
- SemVer has a “get out of jail free” card, where you start your initial development release at 0.1.0 and then increment the
MINORversion for each subsequent release. There is no equivalent for SchemaVer: we don’t start on an unstable development version 0
If we have designed SchemaVer right, then hopefully it should seem straightforward, perhaps even obvious. However, we evaluated and discarded many different options while designing SchemaVer. We’ll go through some of these in this section, to “show our working”.
First off, the names
REVISION went through many revisions. We are pretty happy with these now.
Initially we were keen to use periods to separate the version elements, to allow existing SemVer libraries to work with SchemaVer. Unfortunately, we realized that an analyst looking at a table definition versioned as
1.0.5 would have no idea if the table was schema’ed using SemVer or SchemaVer. So we needed a visual cue to indicate that this was SchemaVer - hence the hyphens.
We gave some serious thought to recreating SemVer’s unstable
MAJOR version 0 idea. On balance, this seemed a bad idea for SchemaVer: because inevitably some
MODEL version 0s will go into production, and then we lose our all-important guarantees about the relationship between schema versions and the historical data.
We experimented with ways to make
SchemaVer fully deterministic - in other words, could we come up with a formula whereby a computer could correctly auto-increment the SchemaVer just by studying the new and previous schema definition?
We have succeeded in making
ADDITION fully deterministic - but there are clearly shades of grey in the separation of
REVISION. We think those shades of grey are useful - because they allow schema authors to exercise their own discretion in not incrementing the
MODEL unless absolutely necessary.
We plan to use SchemaVer throughout Snowplow to add semantic versioning to all of our data structures. In fact the process of designing SchemaVer has already helped us: the process has made us much more aware of the types of schema constraints (and lack of constraints) which lead to
ADDITION increments. We are now actively working to minimize
REVISION increments for Snowplow schemas - and we would encourage our community to do the same when creating schemas for their custom contexts and unstructured events.
We hope that SchemaVer is useful outside of just JSON Schema versioning. We are exploring approaches to versioning database table definitions with SchemaVer, and hope to start a dialog with the Apache Avro community, who have a lot of prior experience attempting to uniquely identify, validate and version data schemas (see e.g. AVRO-1006).
More broadly, we believe that there are some interesting potential use cases for SchemaVer outside of Snowplow. For example in RESTful APIs: many of these are versioned at the API level (“Desk.com API v2”), but we would like to see the data structures returned from API GET requests conforming to publically available, SchemaVer-versioned JSON Schemas. This would make interactions with RESTful APIs much less error-prone.
We are also keen to explore adjacent use cases for SchemaVer in other document-oriented software systems, such as CMSes, ecommerce solutions and NoSQL datastores.
Above all, we would like to stress that this is a draft proposal, and we would love to get feedback from the Snowplow community and beyond on semantic schema versioning. Now is the best time for us to get feedback - before we have started to formalize SchemaVer into the coming Snowplow releases.
So do please get in touch if you have thoughts on semantic schema versioning or our proposed SchemaVer specification - we’d love to make this a more collaborative effort!