In R102 Afontova Gora we presented a new Stream Enrich mode for EmrEtlRunner, evolving the Snowplow Lambda architecture towards something more performant and cost-effective.
Unfortunately, several critical bugs were introduced in the recovery process of pipelines with Stream Enrich mode enabled; these issues combined can lead to folders becoming “stalled” in
enriched.good or archived without proper shredding and loading (though no data should be lost).
In Stream Enrich mode, EmrEtlRunner has a new skippable step,
staging_stream_enrich, which replaces both
enrich steps from the classic Batch Enrich mode.
The problem is that EmrEtlRunner R012 running in Stream Enrich mode still accepted the inappropriate
enrich steps as valid skip values; recovery scripts which were not updated to skip
staging_stream_enrich instead of
staging_stream_enrichstep, which would incorrectly stage new enriched data into an
enriched.goodand never processed
Another related bug was EmrEtlRunner returning a false negative for the “ongoing run” check when enriched event folders had stalled in
These issues have been addressed in R104 Stoplesteinan.
The bugs described above impact only Stream Enrich mode and do not cause issues in classic Batch Enrich mode. A corresponding Snowplow pipeline likely was affected by these bugs if a recovery attempt was made with R102:
enrich- you should check
enriched.goodfor leftover folders
rdb_load- you should check if Redshift is missing any data from folders present in
If you find that you are missing data in Redshift and in
shredded.archive, then first upgrade to R104.
To recover the data, you can simply restage data from the run folders to the
enriched.stream folder, to be staged and processed during your next launch.
The latest version of EmrEtlRunner is available from our Bintray here.
There are no configuration-level changes in this release.
When you upgrade, make sure to update any recovery scripts you have which previously featured
--skip staging,enrich and change them to either
--resume-from shred or
Upcoming Snowplow releases will include:
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problems, please visit our Discourse forum.