Almost a month has passed since the first release of Schema Guru, our tool for deriving JSON Schemas from multiple JSON instances. That release was something of a proof-of-concept - in this 0.2.0 release we are adding much richer functionality, plus deeper integration with the Snowplow platform.
This release post will cover the following new features:
- Web UI
- Newline-delimited JSON
- Duplicated keys warning
- Base64 pattern
- Schema segmentation
- Self-describing schemas
- Getting help
- Plans for the next release
The first big feature of version 0.2.0 is the new web UI, which you can try out at schemaguru.snowplowanalytics.com.
Sometimes you just want to create a schema quickly and don’t want to mess with a CLI. For this use case we implemented a single page web app version of Schema Guru which embeds the same logic as the CLI.
The web UI also shows you a “diff” of how your schema changes with the addition of each extra JSON instance:
Frequently you will have multiple JSON instances stored in a single file; in fact a specification for this exists, called Newline delimited JSON. The specification states that every JSON instance must exist on one line and delimited with others by newline symbol.
The specification also states that files following this format must have the
.ndjson extension; if you want the Schema Guru web UI to process NDJSON, then your files must have the
.ndjson extension currently.
You also can switch configure the Schema Guru CLI to process NDJSON files by passing it
--ndjson flag. Again, if you you want to process a whole directory of NDJSON, each files must have the
.ndjson extension currently.
Developers are humans too and can sometimes make mistakes when generating JSONs. One common case is case conflicts, for example if the last version of your app ran in Python and used
snake_case for its keys, while the new version of your app is written in Java and uses
camelCase. Another common issue is typos introduced into JSON property names.
Now if Schema Guru encounters suspiciously similar keys, it will warn you; this works both in the CLI and the web UI. Under the hood we use Levenshtein distance to detect the duplicated keys.
In the previous release we implementing all string formats supported by the JSON Schema specification has. Another common format for strings in JSON is Base64 encoding. From this release, if a string value matches the Base64 regular expression, Schema Guru will add this regex to string’s pattern.
Like Schema Guru detecting string formats, if even a single input JSON instance does not match pattern, then the pattern won’t be added to the final schema.
We are pleased to add support for another JSON Schema feature: enums.
By default enum recognition is disabled; to enable it, specify an enum cardinality tolerance in either the CLI or the web UI. If the number of discrete values found for a JSON property is less than or equal to this cardinality, then the property will be defined using a fully-specified enum in the JSON Schema.
In future versions we plan to add pre-defined enum sets such as ISO 4217, ISO 3166-1, months, days of weeks, etc.
Sometimes you will have a whole collection of newline-delimited JSONs which are lumped into the same folder but represent a set of fundamentally different types. A good example of this are the JSON event archives provided by analytics companies such as Mixpanel, Keen.io and Segment.
To derive JSON Schemas from these JSON collections, you can now use a JSON Path to specify which property in the JSON instances determines the type of the JSON instance, and thus which named JSON Schema the instance will be used to derive.
Let’s take these two JSONs:
These JSONs contain information about two different event types, so we should use them to derive two distinct schemas. We can use the new
--schema-by CLI argument to achieve this:
Now at least two schemas will be written to the
If any supplied JSON instance doesn’t contain the property at the specified JSON Path, or the property is not a string, then that instance will instead be used to derive a new
unmatched.json JSON Schema.
The last new feature is support for self-describing JSON Schema. Enabling this feature will add metadata to the schema, specifically the properties: vendor, name, version and format.
For now, the format will always be
jsonschema. You can specify the other properties manually with the following CLI options:
--version 2-0-0. The default is 0-1-0 but will be changed to 1-0-0 following this bug fix
If you are segmenting schemas with
--schema-by, then the
name property will be auto-filled, so the only required option is
Simply download the latest Schema Guru from Bintray:
Assuming you have a recent JVM installed, running should be as simple as:
For more details on this release, please check out the Schema Guru 0.2.0 on GitHub.
We will be building a dedicated wiki for Huskimo to support its usage; in the meantime, if you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.
In our next release we are planning to:
- Implement Apache Spark support to allow the derivation of JSON schemas from much larger JSON archives stored in Amazon S3
- Make the new web UI user-friendly and featureful
- Improve the integration of Schema Guru with our upcomming iglu-utils tool