Snowplow 92 Maiden Castle released

11 September 2017  •  Ben Fradet

We are pleased to announce the release of Snowplow 92 Maiden Castle.

This release is a direct follow-up of Snowplow 91 Stonehenge, incorporating various improvements from seeing R90 and R91 operate in the wild. In particular, this release fixes some important gotchas in EmrEtlRunner’s --skip behavior, as well as a bug in the handling of run locks.

If you’d like to know more about R92 Maiden Castle, named after [the Iron Age hill fort in England][maiden-castle], please read on:

  1. Fixing the skip shred bug
  2. The new archive_shredded step
  3. Fixing the run lock bug
  4. Better RDB Loader logs management
  5. Removal of RDB Shredder and Loader
  6. Upgrading
  7. Roadmap
  8. Help

1. Fixing the skip shred bug

When we ported across the RDB Loader from the StorageLoader in R90, we implemented a behavior of skipping the loading of the shredded data (self-describing events and contexts) if the shred step was skipped.

This was a mistake (#3403) - it meant that if you needed to resume your pipeline due to for example a Redshift problem, then although the atomic.events table would be loaded, the shredded types (events and contexts) would not.

We have a comprehensive guide to this problem on Discourse, in case you have been affected by it.

This bug has been corrected in R92.

2. The new archive_shredded step

Prior to R92, the archive_enriched step encompassed both the fact of archiving the enriched events as well as the shredded ones. This was confusing but also difficult to work with:

  1. If you skipped shred but did not not skip archive_enriched, then the S3DistCp step trying to archive the shredded events would fail because there would be no shredded events.
  2. Conversely, if you skipped archive_enrich while also skipping shred, the enriched events would be left in place which would prevent the next EmrEtlRunner run from starting due to a enriched:good bucket not empty no-op, as described below

As a result, a standalone archive_shredded step has been introduced which is skippable as usual through the --skip EmrEtlRunner option.

3. Fixing the run lock bug

When running EmrEtlRunner, there are a few situations that will prevent it from launching an EMR cluster:

  • There are no log files in the in buckets
  • There are files present in the enriched:good bucket
  • There are files present in the shredded:good bucket

We refer to those situations as “no-ops” (for no operations to perform).

The locking mechanism introduced in R91 suffered from a bug (#3396): it failed to release the lock in cases of a no-op. This has been fixed in R92.

4. Better RDB Loader logs management

The logs produced by RDB Loader are stored in S3 and downloaded by EmrEtlRunner to be displayed as log messages. This release improves on this process with the following measures:

  • An attempt to retrieve those logs will happen even if the RDB Loader EMR step is cancelled
  • These log messages will be output using an appropriate log level, according to the state of the RDB Loader EMR step (i.e. error if failed, warning if cancelled, info if successful)
  • After they have been displayed they will be removed from the box running EmrEtlRunner

5. Removal of RDB Shredder and Loader

Following the release of the RDB Loader v0.13.0, we have now removed the RDB Shredder and RDB Loader components from the Snowplow “mono-repo”. This represents an important milestone in us decoupling database-specific loader applications from the core Snowplow release process.

6. Upgrading

The latest version of EmrEtlRunner is available from our Bintray.

In order to use recently released RDB Loader, remember to make following update to your configuration YAML:

storage:
  versions:
    rdb_loader: 0.13.0        # Was 0.12.0

7. Roadmap

Upcoming Snowplow releases include:

  • R93 [STR] Virunum, a general upgrade of the apps constituting our stream processing pipeline
  • [R94 [BAT] ZSTD support][r94], enhancing our Redshift event storage with the ZSTD encoding
  • R9x [STR] Priority fixes, removing the potential for data loss in the stream processing pipeline
  • R9x [BAT] 4 webhooks, which will add support for 4 new webhooks (Mailgun, Olark, Unbounce, StatusGator)

8. Getting help

For more details on this release, please check out the release notes on Github.

If you have any questions or run into any problems, please visit our Discourse forum.