Snowplow 81 Kangaroo Island Emu released
We are happy to announce the release of Snowplow 81 Kangaroo Island Emu! At the heart of this release is the Hadoop Event Recovery project, which allows you to fix up Snowplow bad rows and make them ready for reprocessing.
1. Hadoop Event Recovery
Snowplow was the first event data pipeline to let you discover and investigate your invalid events - now we are the first pipeline to let you actively fix those bad events!
While this is a powerful tool, using it can be quite involved. To along with this release, we have published a tutorial on Discourse, Using Hadoop Event Recovery to recover events with a missing schema. This tutorial walks you through one common use case for event recovery: where some of your events failed validation because you forgot to upload a particular schema.
You can also check out the wiki documentation for Hadoop Event Recovery.
2. Stream Enrich race condition
Our Scala Common Enrich library uses the Apache Commons Base64 class. Version 0.5 of this library wasn’t thread-safe. This didn’t matter when running the batch pipeline, since each worker node only uses one thread to process events. But in Stream Enrich it caused a race condition where multiple threads could simultaneously access the same Base64 object, sometimes resulting in erroneous Base64 decoding.
This issue was particularly affecting high-volume users running Stream Enrich on servers with 4+ vCPUs.
If this issue is affecting you, you’ll see potentially many bad rows where the error message reports corrupt-looking JSON, but if you Base64-decode the bad row’s original line, the JSON contained within it is valid.
In this release we have therefore upgraded our Stream Enrich component to use version 1.10 of the affected library, which makes the class thread-safe. Although non-critical, this update will come to the Hadoop pipeline in a future release.
3. New schemas
We have added JSON Paths files and Redshift DDLs for the following schemas:
The Kinesis apps for R81 Kangaroo Island Emu are all available in a single zip file here:
Only the Stream Enrich app has actually changed. The change is not breaking, so you don’t have to make any changes to your configuration file. To upgrade Stream Enrich:
- Install the new Stream Enrich app on each server in your Stream Enrich auto-scaling group
- Update your supervisor process to point to the new Stream Enrich app
- Restart the supervisor process on each server running Stream Enrich
5. Getting help
For more details on this release, please check out the release notes on GitHub.
The wiki has full information on how to use Hadoop Event Recovery.