Iglu Schema Registry 5 Scinde Dawk released
We are pleased to announce the fifth release of the Iglu Schema Registry System, with an initial release of
igluctl - an Iglu command-line tool and Schema DDL as part of Iglu project.
Read on for more information on Release 5 Scinde Dawk, named after the first postage stamp in Asia:
The main feature of this release is our new
igluctl command-line application, which collects various separate Iglu-related tools into a single easy-to-use CLI app.
igluctl includes three commands:
static generate- which started life as Schema Guru’s
static push- which originated as Iglu’s Registry Syncer bash script
lint- a brand new command, which performs syntax and consistency checking for JSON Schemas
1.1 static generate
To centralize all tools related to Iglu in one place, we decided to factor out this functionality from Schema Guru and embed it into
igluctl. In this release there are no new features and the command’s interface remains the same as it was for Schema Guru, except
schema-guru ddl has been replaced with
igluctl static generate.
1.2 static push
Another ported command is
static push, which was previously a dedicated bash script inside Iglu project on GitHub. It allows you to upload set of JSON Schemas from a local static registry to Iglu’s Scala schema registry.
static push accepts three required positional arguments:
inputis just a directory containing JSON Schemas
hostis domain name or IP address of your Scala schema registry
apikeyis the master API key, which you must create manually, and willl be used to create temporary read and write keys (they will be automatically deleted after command completed)
You can find out more Iglu’s Scala schema registry and how to set it up on its dedicated wiki page.
The third and most exciting subcommand of
lint, which allows you to perform syntax and consistency checking of your JSON Schemas.
lint is an all-new command, and we’re not aware of anything similar outside of the Snowplow ecosystem.
igluctl lintaccepts one required argument,
input, which is a path to a local static registry or a single JSON Schema, and one optional argument,
--skip-warnings which forces
igluctl to omit warnings about unknown properties if required.
A typical use of the
lint command would look like following:
We strongly advise you to use
igluctl lint to increase quality of your JSON Schemas! This command can surface difficult to detect mistakes in the schema’ing process across the following categories:
Syntax errors. These are the most obvious errors. Each JSON Schema must conform to the JSON Meta Schema, which states for example that the value of property
maximum must be a positive integer, the value of
required must be a non-empty array and so on.
Consistency check. Unfortunately, the current specification of JSON Schema does not include some checks. For example, if you have the key
required property, some predefined set of keys in
foo and at the same time
false, it’s still a valid JSON Schema, but it cannot validate any possible JSON instance.
igluctl can identify these and other inconsistencies.
Iglu-specific errors. These are errors that don’t make a JSON Schema invalid or unusable itself, but they do make their behavior unpredictable inside Iglu-aware applications. The most notable example is when the schema’s path within the Iglu static registry conflicts with the schema’s self-describing metadata.
Minor errors. These are errors that other validation tools mark as warnings (and can be omitted by
--skip-warnings), but in fact can prove critical. The most common case is an unknown property: JSON Schema tolerates this and states that it is up to the validating application to implement some property, but this latitude can leads to confusions and further mistakes. For example, we’ve seen people confuse the
maximum property with
maxValue from JSON Schema pre-DraftV4, which means that the JSON Schema will validate incorrect instances, and tools like
static generate will then produce incorrect DDL.
Performing these kinds of check will help you to maintain quality of your JSON Schemas at a high level, which can reduce data loss and increase the stability of your data pipeline. In future versions we’re planning to introduce severity levels to handle even more subtle things that can possibly lead to undesired behavior.
2. Schema DDL
The standalone Schema DDL library has been in use inside Snowplow for about year, providing a partial abstract syntax tree for Redshift tables. As part of the restructuring of Iglu, we are moving Schema DDL into the main Iglu project.
As part of this move, the main package is
com.snowplowanalytics.iglu.schemaddl, instead of the previous
com.snowplowanalytics.schemaddl. This breaking change allowed us to reorganize the existing package stucture, making the Redshift DDL AST available as
com.snowplowanalytics.iglu.schemaddl.redshift, the first of many. This highlights the purpose of the Schema DDL project: to contain abstract syntax trees (ASTs) and related functions for various data definition languages and schema formats.
And to double down on this, in this release we also introduce a new AST for JSON Schema, available at
com.snowplowanalytics.iglu.schemaddl.jsonschema. JSON Schema’s AST can be used to parse arbitray JSON into typesafe AST and drives schema linting and DDL derivation for JSON Schema; in future it will be used more widely for various Iglu-related tasks.
The Schema DDL artifact now is also available on JCenter and Maven Central, and can be included into SBT project as follows:
3. Migration guide
Given that Redshift table and JSON Paths file generation is now available as part of
igluctl, we will be deprecating
schema-guru ddl command - of course, everything related to JSON Schema derivation remains. It means we strongly encourage you to switch to
igluctl as soon as possible for DDL generation.
You can download
igluctl from Bintray using the following link:
Migration should be fairly easy: you just need to replace
./schema-guru ddl with
./igluctl static generate. All options remain the same with only two minor behavioral changes:
- The command now exits with status 1 if any error has been encountered in any JSON Schema
- The default Redshift encoding for BOOLEAN column is now
4. Iglu roadmap
At this moment, we have two major independent goals for Iglu:
- First-class support for database table definitions and mappings between these definitions and corresponding schemas. This should allow users of the Snowplow platform to concentrate on schema definitions and forget about tedious table deployments and manual data migrations
- Schema inference. This is one more step towards making Iglu “just work”, without users having to do exhaustive upfront schema definition