It is important to understand what backwards compatibility means here. For SemVer, backwards compatibility is about providing guarantees (through version numbers), that a piece of software can update its dependency on a SemVer-respecting dependency without either:
.multiply()suddenly starts dividing
Semantic Versioning is a great fit for managing the evolution of software in a way that protects the users of that software. But it’s not a great fit for versioning schemas, because schemas are used in a fundamentally different way to software.
When versioning a data schema, we are concerned with the backwards-compatibility between the new schema and existing data represented in earlier versions of the schema. This is the fundamental building block of SchemaVer, and explains the divergence from SemVer.
Let’s propose a simple formula for SchemaVer:
Given a version number
MODEL-REVISION-ADDITION, increment the:
MODELwhen you make a breaking schema change which will prevent interaction with any historical data
REVISIONwhen you make a schema change which may prevent interaction with some historical data
ADDITIONwhen you make a schema change that is compatible with all historical data
Syntactically this feels similar to SemVer - but as you can see from the increment rules, the semantics of each element are very different from SemVer.
Let’s make SchemaVer more concrete with some examples using JSON Schema, in reverse order:
We have an existing JSON Schema, let’s call this
Now we want to add an additional field to our schema:
Because our new
impressionId field is not a required field, and because version
additionalProperties set to false, we know that all historical data will work with this new schema.
Therefore we are looking at an
ADDITION, and so we bump the schema version to
Let’s now make our JSON Schema support
additionalProperties - this constitutes another
ADDITION, so we are now on
After a while, we add a new field,
Will this new schema validate all historical data? Unfortunately we can’t be certain, because there could be historical JSONs where the analyst added their own
cost field, possibly set to a string rather than a number (or a negative number).
So we are effectively making a
REVISION to the data schema - so we bump the version to
ADDITION to 0).
Oh dear - we have just realized that we can identify our clicks through a unique
clickId - no need to be storing the
impressionId. Here is our new JSON Schema:
We have changed our
MODEL - because we can have no reasonable expectation that any of the historical data can interact with this schema. That means our new version is
Note that we also decided to use this “reboot” of the
MODEL to change
additionalProperties back to false, because (as we have learnt) it will help us to avoid unnecessary
REVISIONs in the future.
At this point we should probably add a few supplementary rules around SchemaVer, especially as they differ from SemVer:
-s) to separate the version parts, not periods (
.s) as in SemVer
MINORversion for each subsequent release. There is no equivalent for SchemaVer: we don’t start on an unstable development version 0
If we have designed SchemaVer right, then hopefully it should seem straightforward, perhaps even obvious. However, we evaluated and discarded many different options while designing SchemaVer. We’ll go through some of these in this section, to “show our working”.
First off, the names
REVISION went through many revisions. We are pretty happy with these now.
Initially we were keen to use periods to separate the version elements, to allow existing SemVer libraries to work with SchemaVer. Unfortunately, we realized that an analyst looking at a table definition versioned as
1.0.5 would have no idea if the table was schema’ed using SemVer or SchemaVer. So we needed a visual cue to indicate that this was SchemaVer - hence the hyphens.
We gave some serious thought to recreating SemVer’s unstable
MAJOR version 0 idea. On balance, this seemed a bad idea for SchemaVer: because inevitably some
MODEL version 0s will go into production, and then we lose our all-important guarantees about the relationship between schema versions and the historical data.
We experimented with ways to make
SchemaVer fully deterministic - in other words, could we come up with a formula whereby a computer could correctly auto-increment the SchemaVer just by studying the new and previous schema definition?
We have succeeded in making
ADDITION fully deterministic - but there are clearly shades of grey in the separation of
REVISION. We think those shades of grey are useful - because they allow schema authors to exercise their own discretion in not incrementing the
MODEL unless absolutely necessary.
We plan to use SchemaVer throughout Snowplow to add semantic versioning to all of our data structures. In fact the process of designing SchemaVer has already helped us: the process has made us much more aware of the types of schema constraints (and lack of constraints) which lead to
ADDITION increments. We are now actively working to minimize
REVISION increments for Snowplow schemas - and we would encourage our community to do the same when creating schemas for their custom contexts and unstructured events.
We hope that SchemaVer is useful outside of just JSON Schema versioning. We are exploring approaches to versioning database table definitions with SchemaVer, and hope to start a dialog with the Apache Avro community, who have a lot of prior experience attempting to uniquely identify, validate and version data schemas (see e.g. AVRO-1006).
More broadly, we believe that there are some interesting potential use cases for SchemaVer outside of Snowplow. For example in RESTful APIs: many of these are versioned at the API level (“Desk.com API v2”), but we would like to see the data structures returned from API GET requests conforming to publically available, SchemaVer-versioned JSON Schemas. This would make interactions with RESTful APIs much less error-prone.
We are also keen to explore adjacent use cases for SchemaVer in other document-oriented software systems, such as CMSes, ecommerce solutions and NoSQL datastores.
Above all, we would like to stress that this is a draft proposal, and we would love to get feedback from the Snowplow community and beyond on semantic schema versioning. Now is the best time for us to get feedback - before we have started to formalize SchemaVer into the coming Snowplow releases.
So do please get in touch if you have thoughts on semantic schema versioning or our proposed SchemaVer specification - we’d love to make this a more collaborative effort!