When we ported across the RDB Loader from the StorageLoader in R90, we implemented a behavior of skipping the loading of the shredded data (self-describing events and contexts) if the
shred step was skipped.
This was a mistake (#3403) - it meant that if you needed to resume your pipeline due to for example a Redshift problem, then although the
atomic.events table would be loaded, the shredded types (events and contexts) would not.
We have a comprehensive guide to this problem on Discourse, in case you have been affected by it.
This bug has been corrected in R92.
Prior to R92, the
archive_enriched step encompassed both the fact of archiving the enriched events as well as the shredded ones. This was confusing but also difficult to work with:
shredbut did not not skip
archive_enriched, then the S3DistCp step trying to archive the shredded events would fail because there would be no shredded events.
archive_enrichwhile also skipping
shred, the enriched events would be left in place which would prevent the next EmrEtlRunner run from starting due to a
enriched:goodbucket not empty no-op, as described below
As a result, a standalone
archive_shredded step has been introduced which is skippable as usual through the
--skip EmrEtlRunner option.
When running EmrEtlRunner, there are a few situations that will prevent it from launching an EMR cluster:
We refer to those situations as “no-ops” (for no operations to perform).
The locking mechanism introduced in R91 suffered from a bug (#3396): it failed to release the lock in cases of a no-op. This has been fixed in R92.
The logs produced by RDB Loader are stored in S3 and downloaded by EmrEtlRunner to be displayed as log messages. This release improves on this process with the following measures:
Following the release of the RDB Loader v0.13.0, we have now removed the RDB Shredder and RDB Loader components from the Snowplow “mono-repo”. This represents an important milestone in us decoupling database-specific loader applications from the core Snowplow release process.
The latest version of EmrEtlRunner is available from our Bintray.
In order to use recently released RDB Loader, remember to make following update to your configuration YAML:
Upcoming Snowplow releases include:
For more details on this release, please check out the release notes on Github.
If you have any questions or run into any problems, please visit our Discourse forum.