Schema Guru 0.1.0 released for deriving JSON Schemas from JSONs

03 June 2015  •  Anton Parkhomenko

We’re pleased to announce the first release of Schema Guru, a tool for automatic deriving JSON Schemas from a collection of JSON instances. This release is part of a new R&D focus at Snowplow Analytics in improving the tooling available around JSON Schema, a technology used widely in our own Snowplow and Iglu projects.

schema-guru-shape-sorter

Read on after the fold for:

  1. Why Schema Guru?
  2. Current features
  3. Design principles
  4. A fuller example
  5. Getting help
  6. Roadmap

If you want several different apps or services to communicate, at some point you will need to describe a protocol for this communication. JSON Schema can be very helpful here: it is a declarative format for expressing rules about JSON structures.

So you open your text editor and start writing your JSON Schema, specifying all the keys, types, validation parameters, nested objects and so on. But this quickly becomes painful - especially if your instances including lots of keys and complex structure where objects nest deeply in other objects. And things get even worse if your developers have already generated JSON instances somehow and you need to cross-check these instances against your schema.

What if we could automate this process somehow? There are a few pre-existing tools, most notably the jsonschema.net website. Unfortunately, these tools all derive your schema from just one JSON instance. This is problematic because JSONs often have very “jagged edges”: two JSON instances which should belong to the same schema may have a different subset of properties, types and formats.

So, to generate a JSON Schema safely, we need to work from as many JSON instances as possible. Schema Guru lets us derive our schema from a whole collection of JSON instances: the law of large numbers should do the rest!

The initial 0.1.0 release of Schema Guru has the following features:

  • Derive all types defined in JSON Schema specification
  • Derive all string formats defined in specification
  • Derive integer ranges according byte size and possibility to be negative
  • Derive product types (e.g. if one field is integer and string in different instances)

Our deriving of JSON Schemas from multiple instances is possible due to the observation that a JSON Schema is a semigroup with an associative binary merge operation. For example, the merger of these two valid schemas:

{"key": {"type": "integer"}} merge {"key": {"type": "string"}}

Will result in another valid schema:

{"key": {"type": ["integer", "string"]}}

Which is basically a product type. To put it another way: the merger of two JSON Schemas yields a third, equally- or more-permissive schema, against which any JSON instance which validates against either or both of the two parent schemas will also validate.

The fact that this merge operation is associative means that we should be able to scale Schema Guru to massively parallel schema-derivation workloads, running in Hadoop, Spark or similar.

From the last example we can see that Schema Guru supports JSON Schema’s various types. But Schema Guru can also detect the various JSON Schema validation properties, such as format or maximum.

Let’s give an example. Here is a JSON instance:

{ "event": {
    "id": "f1e89550-7fda-11e4-bbe8-22000ad9bf74",
    "length": 42 }}

And a second one:

{ "event": {
    "id": 123,
    "length": null }}

Running Schema Guru against both of these instances generates the following JSON Schema:

{ "type" : "object",
  "properties" : {
    "event" : {
      "type" : "object",
      "properties" : {
        "id" : {
          "type" : [ "string", "integer" ],
          "format" : "uuid",
          "minimum" : 0,
          "maximum" : 32767 },
        "length" : {
          "type" : [ "integer", "null" ],
          "minimum" : 0,
          "maximum" : 32767 } },
      "additionalProperties" : false } },
  "additionalProperties" : false

You can see that our generated JSON Schema now contains two properties, where:

  1. The id property could be a UUID string or a small integer
  2. The length property could be a small integer or null

As you can see we generate a pretty strict schema, where the additionalProperties setting rules out any properties not observed in the instances fed to Schema Guru. We’re planning on adding options to Schema Guru to make these types of settings more “tunable”.

5. Getting help

Schema Guru is of course very young - so we look forward to community feedback on what new features to prioritize. Feel free to get in touch or raise an issue on GitHub!

6. Roadmap

We have lots of features planned for Schema Guru:

  • A web UI with ability to adjust you schema
  • Support for other output formats such as Avro
  • Enum detection
  • Warnings about suspiciously-similar keys
  • Auto-submitting generated schemas to your Iglu repository
  • Outputting self-describing JSON Schemas
  • Running Schema Guru as a Spark job on JSON collections stored in Amazon S3 (thanks to semigroups)
  • …and much more, ideas are coming up every day!