Read on below the fold for:
1. Bad Row Support
Data quality has always been a huge focus for Snowplow. Being a non-lossy data pipeline is one of the central pieces of this data quality puzzle. Indeed, other pieces such as data validity (through our schema validation technology) or data richness (through our enrichments) connect directly with the non-lossiness piece.
In practice, for a Snowplow pipeline, non-lossiness meant that when something went wrong anywhere in the pipeline, instead of discarding it, the data impacted was parked as “bad” for later inspection, fixing and reprocessing. In Snowplow jargon, this bad data is called “bad rows”.
With this release, we are adding bad row support to Snowflake Transformer with its new format. From now on, Snowflake Transformer will continue to run when it encounters unexpected data and it will write the bad rows to a separate place for further inspection instead of halting with exception. Snowflake Loader is the second component of the pipeline to introduce the new bad row format after RDB Loader.
In order to make bad rows easy to understand, we separated them into two types, loader parsing error and Snowflake error. As the names imply, loader parsing error represents cases where enriched events can not be parsed to Snowplow Event successfully and Snowflake error represents cases where enriched events are parsed without problem, however something went wrong internally while trying to transform the event into a suitable format for Snowflake. You can look at schema of loader parsing error and schema of Snowflake error if you want to learn more about their format.
In order to start to use Snowflake Loader with new bad row support, you have to specify where you want to store the bad rows in the config file. You can find detailed information about it in the upgrading part.
To make use of the new versions of the Snowflake Transformer and Loader, you will need to update your Dataflow Runner configurations to use the following jar files:
Due to bad row support, AWS S3 Url for bad rows need to be specified in the config file. Also, you need to update the schema version of the self describing Snowflake config json to
1-0-2. In the end, your schema should look like:
3. Getting help
For more details on this release, check out the 0.5.0 release notes on GitHub.
If you have any questions or run into any problems, please visit our Discourse forum.