We are pleased to announce the release of Snowplow 92 Maiden Castle.
This release is a direct follow-up of Snowplow 91 Stonehenge, incorporating various improvements from seeing R90 and R91 operate in the wild. In particular, this release fixes some important gotchas in EmrEtlRunner’s --skip
behavior, as well as a bug in the handling of run locks.
If you’d like to know more about R92 Maiden Castle, named after [the Iron Age hill fort in England][maiden-castle], please read on:
- Fixing the skip shred bug
- The new archive_shredded step
- Fixing the run lock bug
- Better RDB Loader logs management
- Removal of RDB Shredder and Loader
- Upgrading
- Roadmap
- Help
1. Fixing the skip shred bug
When we ported across the RDB Loader from the StorageLoader in R90, we implemented a behavior of skipping the loading of the shredded data (self-describing events and contexts) if the shred
step was skipped.
This was a mistake (#3403) – it meant that if you needed to resume your pipeline due to for example a Redshift problem, then although the atomic.events
table would be loaded, the shredded types (events and contexts) would not.
We have a comprehensive guide to this problem on Discourse, in case you have been affected by it.
This bug has been corrected in R92.
2. The new archive_shredded step
Prior to R92, the archive_enriched
step encompassed both the fact of archiving the enriched events as well as the shredded ones. This was confusing but also difficult to work with:
- If you skipped
shred
but did not not skiparchive_enriched
, then the S3DistCp step trying to archive the shredded events would fail because there would be no shredded events. - Conversely, if you skipped
archive_enrich
while also skippingshred
, the enriched events would be left in place which would prevent the next EmrEtlRunner run from starting due to aenriched:good
bucket not empty no-op, as described below
As a result, a standalone archive_shredded
step has been introduced which is skippable as usual through the --skip
EmrEtlRunner option.
3. Fixing the run lock bug
When running EmrEtlRunner, there are a few situations that will prevent it from launching an EMR cluster:
- There are no log files in the
in
buckets - There are files present in the
enriched:good
bucket - There are files present in the
shredded:good
bucket
We refer to those situations as “no-ops” (for no operations to perform).
The locking mechanism introduced in R91 suffered from a bug (#3396): it failed to release the lock in cases of a no-op. This has been fixed in R92.
4. Better RDB Loader logs management
The logs produced by RDB Loader are stored in S3 and downloaded by EmrEtlRunner to be displayed as log messages. This release improves on this process with the following measures:
- An attempt to retrieve those logs will happen even if the RDB Loader EMR step is cancelled
- These log messages will be output using an appropriate log level, according to the state of the RDB Loader EMR step (i.e. error if failed, warning if cancelled, info if successful)
- After they have been displayed they will be removed from the box running EmrEtlRunner
5. Removal of RDB Shredder and Loader
Following the release of the RDB Loader v0.13.0, we have now removed the RDB Shredder and RDB Loader components from the Snowplow “mono-repo”. This represents an important milestone in us decoupling database-specific loader applications from the core Snowplow release process.
6. Upgrading
The latest version of EmrEtlRunner is available from our Bintray.
In order to use recently released RDB Loader, remember to make following update to your configuration YAML:
storage: versions: rdb_loader: 0.13.0 # Was 0.12.0
7. Roadmap
Upcoming Snowplow releases include:
- R93 [STR] Virunum, a general upgrade of the apps constituting our stream processing pipeline
- [R94 [BAT] ZSTD support][r94], enhancing our Redshift event storage with the ZSTD encoding
- R9x [STR] Priority fixes, removing the potential for data loss in the stream processing pipeline
- R9x [BAT] 4 webhooks, which will add support for 4 new webhooks (Mailgun, Olark, Unbounce, StatusGator)
8. Getting help
For more details on this release, please check out the release notes on Github.
If you have any questions or run into any problems, please visit our Discourse forum.