Snowplow RDB Loader R31 released
We are pleased to announce the release of Snowplow RDB Loader R31. This updates the format of bad rows emitted to a new format, as part of a broader piece of work we have been undertaking to improve Snowplow’s capabilities around bad row handling, debugging and recovery; and fixes a data quality issue with synthetic duplicates in the RDB Shredder.
Please read on after the fold for:
1. New bad rows format
At the beginning of 2019 we at Snowplow initiated a big effort to redesign the Snowplow bad row format. The purpose of the new format is to make bad rows easier to query and therefore monitor, debug and recover, by making them much more highly structured. In the new approach, bad rows will be in a self-describing JSON format.
While the primary source of bad rows for Snowplow users is the Stream Enrich job that validates every event being processed by the pipeline, every microservice that makes up a Snowplow pipeline, including the RDB Loader, will emit a bad row if an attempt to process the event (in this case load it into Redshift) is unsuccessful. We’re happy to announce that Snowplow RDB Shredder is the first component of Snowplow pipeline released supporting the new format of bad rows.
The RDB Shredder 0.15.0 can produce the following types of bad rows:
iglu:com.snowplowanalytics.snowplow.badrows/loader_parsing_error/jsonschema/1-0-0- a generic loader error, caused if the event is not a valid “enriched event”
iglu:com.snowplowanalytics.snowplow.badrows/loader_iglu_error/jsonschema/1-0-0- an error raised by Iglu Client
iglu:com.snowplowanalytics.snowplow.badrows/loader_runtime_error/jsonschema/1-0-0- an unstructured error, raised by one of low-level shredding components, such as DynamoDB outage during cross-batch deduplication
2. Synthetic duplicates issue
In Snowplow R86 Petra we introduced an in-batch synthetic deduplication step. This step identified rows of data that were distinct (i.e. have different event fingerprints), but that had the same
com.snowplowanalytics.snowplow/duplicate/jsonschema/1-0-0 context that is attached to any event that is mutated in this way.
Recently, we discovered an anomaly in shredded data, where some events, which were clearly synthetic duplicates, were not “attached” to any context or unstructured event tables. Vice versa, some contexts and unstructured events had
root_id values which did not correspond to any rows in the
atomic.events table, effectively resulting in “orphaned events”.
It turned out that this was caused by an issue with how Apache Spark caches parts of its execution DAG. After the shred job creates an event object with a new unique
event_id, it caches this object and proceeds to the next step of creating an object containing all shredded entities with the same id. But if for some reasons there is not enough memory to store the initial cached object, a new event object would be generated, along with a new unique event id. This second event id would only be used only in the shredded entities since the original event would already be written to HDFS, resulting in the
event_id on the row in the events table not matching the
root_id on the associated shredded tables.
As a result, it was possible to have an event with an original event ID of A, end up with:
Cin all contexts and unstruct event tables
duplicate_1table (but with
Ain its column)
This problem manifests itself quite rarely, depending on cluster load (which affects cache utiliztion) and overall amount of synthetic duplicates. In an analysis of a large number of our customers, we found it occurs to ~1% of synthetic duplicates.
In order to check if your pipeline has been affected, you can count how many rows in
com_snowplowanalytics_snowplow_duplicate_1 don’t have a parent in
Whilst this issue only impacts a small % of data, and then only synthetic duplicates that are generated by bot activity, we have still treated this issue as a data quality issue and prioritised fully understanding the issue and putting out a fix.
3. Other improvements
- We bumped Apache Spark to 2.3.2, which means that the required EMR AMI version is now 5.19. This means that M5 and C5 instances are now supported
- Iglu Scala Client has been updated to 0.6.0, making the JSON validator stricter #141
- RDB Shredder does not depend on Scala Common Enrich anymore and works on top of Scala Analytics SDK
- String values in atomic events are truncated according to
atomicJSON Schema rather than hard-coded values #143
4.1 Checks to perform prior to upgrading
The new RDB Loader uses a new version of the Iglu Client library, that itself uses a new library for JSON schema validation. This is stricter than the old library. As a result, a number of issues with schemas that previously wouldn’t have prevented RDB Loader from successfully loading events will now cause events to fail to successfully load. We therefore recommend checking the following before performing upgrade.
4.1.1 All schemas have the correct
$schema value set
All Iglu Schemas are self-describing themselves, and therefore contain a
$schema property. The property should be set as follows:
i.e. the value should always be
http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#. This is the “meta-schema” that describes the schemas themselves.
The previous version of Iglu Client library was insensitive to the value that this field was set at, so it wasn’t uncommon for Snowplow users to create schemas and set alternative values. (E.g. owing to a misunderstanding about what the property meant, and what that value should be.)
Please therefore check this value in any schemas in your own Iglu Schema Registry before performing an upgrade, as any mistakes here may cause the corresponding events or contexts to not be loaded into Redshift successfully.
4.1.2 Any string fields that are specified format
uri have to be full URIs and not simply paths
The jsonschema standard supports a
uri format for strings.
The previous Iglu Client library would accept paths e.g.
/my/path as valid URIs. The new library will not, and as a result any fields defined in the schema as a URI but with paths as values will not load into Redshift successfully.
4.2 Performing the upgrade
If you are using EmrEtlRunner, you’ll need to update your
Bear in mind that the new minimum required AMI version might now support some old instance types, such as c1.medium or m1.small. In that case you need to upgrade your instance type as well.
Once you have upgraded we recommended closely monitoring the location on S3 where the RDB Loader is configured to write any bad rows, to make sure that none of the issues flagged in 4.2.1 mean that data has failed to load into Redshift as a result of the upgrade.
5. Getting help
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problems, please visit our Discourse forum.