We’re tremendously excited to announce the new 0.6.0 release of the Iglu Scala Client, a library in charge of schema resolution and data validation in all Snowplow components, including enrichment jobs and loaders. This release brings enormous amount of API changes we’ve made in order to facilitate implementation of Snowplow Platform Improvement Proposals, including new bad rows format, Amazon Redshift automigrations and deprecation of a batch pipeline.
In the rest of this post we will cover:
- API Changes
- Semantic Changes
- New Validator
- Roadmap and Upcoming Features
- Getting Help
1. API changes
Iglu Scala Client 0.6.0 exposes a new class called
Client which consists of two independent entities:
Resolver is responsible for schema resolution, caching and error handling and
Validator receives resolved schemas, datum the user wants to validate and returns the validation report. These entities can be used separately or even re-defined by the user, but it is recommended to use
Client class as an abstraction for the most common use case - validation of self-describing entities.
Client class defines only one function:
F[_]is an abstract effect type, requiring a tagless final capabilities
cats.effect.Clockas well as instance of
cats.Monadtype class. The two most common effect types are
Id, all necessary machinery for them are provided out-of-box, but users can also define this machinery for ZIO, Monix Task or sophisticated test types.
Ais a type of self-describing entity, such as JSON. Since 0.6.0, Iglu Client is based primarily on circe
Jsontype, but we’re trying to leave it generic whenever possible
ClientErroris a possible unsuccessful outcome, either at resolution or validation step. Unlike
ProcessingMessagefrom pre-0.6.0 it provides a type-safe and well-structured information about the failure. This type is widely used in upcoming Snowplow bad rows
From this very short excerpt an astute Scala developer might notice that we replaced several libraries with their modern counterparts:
- Circe is used instead of Json4s. Circe provides much better performance characteristics, does not rely on runtime reflection, provides very clean idiomatic API and remains one of the most popular JSON libraries in Scala ecosystem for last couple of years
- Cats Effect is used for managing side-effects, instead of implicit effect management. One can use Iglu Client 0.6.0 without bothering about Cats Effect, but it is highly recommended in async-heavy environments
- Cats is used instead of Scalaz 7.0 as FP library of choice. Cats is a transitive dependency of Circe and Cats Effect, which makes it a natural choice, since no other dependencies are shipped with Scalaz.
- networknt json-schema-validator is used instead of FGE JSON Validator. One more change library, driving our bad rows effort. networknt json-schema-validator is actively maintained, provides clean API and shows very impressive results in benchmarks
You can find more usage examples on dedicated wiki page.
2. Semantic Changes
In a batch ETL world, we tried to reduce the load on Iglu Registries by leveraging a very simple retry-and-cache algorithm that was making some configurable attempts before deciding whether the schema is missing or invalid and caching this failure. The only thing that potentially could reset this cached value is the cacheTtl property, that would force the resolver to retry whether the cached value was a success (in case somebody mutated schema) or a failure (in case the registry had a long outage).
This approach does not work for RT-first world anymore. There’s no meaningful amount of attempts that resolver needs to make before considering a schema missing or invalid. Streaming application can keep working for many weeks without restart and if during this time, one registry goes down for couple of minutes and resolver will try to resolve a schema it means that until next TTL eviction all data will be invalid. And retries won’t help here because they all will happen in a short period of time.
However, we still need to have certain retry behavior, because registries always can go offline. In a streaming world, the best practice for retries is backoff period. In 0.6.0 the Iglu resolver will attempt to refetch failed schemas with steadily growing period of time between attempts. This period grows from subsecond delays to approximately 20 minutes. What is also very important, these re-attempts will be made only for non-successful responses.
ResolutionError data type (subtype of
ClientError) has two properties to reflect the history of attempts:
lastAttempt a timestamp of last attempt being made and
attempts reflecting the amount of attempts taken so far.
3. New Validator
As it was mentioned before, Iglu Scala Client uses the new JSON Schema validator under the hood (the hover can be replaced with any custom one). Even though this validator also targets JSON Schema spec v4, it nevertheless can have incompatibilities with our previous JSON Schema validator. As a result some instances that were considered valid by Iglu Scala Client pre-0.6.0 can now be silently invalidated.
Here’s a short list of the most widely used Snowplow components we’re planning to release with Iglu Client 0.6.0:
- Stream Enrich 0.22.0 (Snowplow R117)
- RDB Shredder 0.15.0 (RDB Loader R31)
- BigQuery Loader 0.2.0
Please, monitor your bad rows produced by above assets.
This is a huge release, overhauling the core part of Snowplow and we were developing and testing it since Fall 2018. During this time, we received an enormous amount of contributions from outside of core Snowplow Engineering team. Huge thanks to our Summer 2018 intern Andrzej Sołtysik, Hacktoberfest 2018 participant Sajith Appukuttan and our partner from The Globe and Mail Inc. Saeed Zareian.
5. Roadmap and Upcomming Features
This release is planned to be a last one in 0.x series. Next release will likely include a relatively small amount of user-facing improvements and will have a 1.0.0 version, marking stability of API. From 1.0.0 onwards we plan to introduce MiMa-compatibility checks to our libraries in order to make the update process more reliable.