We are pleased to announce the release of Snowplow 91 Stonehenge.
This release revolves around making EmrEtlRunner, the component launching the EMR steps for the batch pipeline, significantly more robust. Most notably, this release fixes a long-standing bug in the way the staging step was performed, which affected all users of the Clojure Collector (issue #3085).
This release also lays important groundwork for our planned migration away from EmrEtlRunner towards separate snowplowctl and Dataflow Runner tools, per our RFC.
If you’d like to know more about R91 Stonehenge, named after [the prehistoric monument in England][stonehenge], please read on:
- Moving the staging step to S3DistCp
- New EmrEtlRunner commands
- New EmrEtlRunner CLI options
1. Moving the staging step to S3DistCp
This release overhauls the initial staging step which moves data from the raw input bucket(s) in S3 to a processing bucket for further processing.
Part of the staging operation for Clojure Collector logs involved transforming timestamps from a format to another. Unfortunately, a race condition in JRuby’s multi-threaded environment could result in files overwriting each other, leading to data loss. Many thanks to community member Victor Ceron for isolating this bug.
In our experiments, the loss rate approached 1% of all Clojure Collector raw event files.
The fix introduced in this release (issue #276) delegates the staging step to S3DistCp and doesn’t do any renaming.
1.2 Illustration of problem and fix
To illustrate the previous point, let’s take the example of having a couple of Clojure instances and log files according to the following folder structure in your input bucket:
Previously, once the staging step took place, these files would end up as shown below:
This renaming is where the occasional overwrite could occur.
From now on, the S3DistCp file moves will leave the filenames as-is, except for the leading underscores which will be removed:
1.3 Post mortem
We take data quality and the potential for data loss extremely seriously at Snowplow.
With any fast-evolving project like Snowplow, new issues and regressions will be introduced, and old underlying issues (such as this one) can be uncovered - but the important thing is how we handle these issues. In this case, we should have triaged and scheduled a fix for Victor’s issue much sooner. Our sincere apologies for this.
Going forwards, we are putting pipeline quality and safety issues front and center with dedicated labels for data-loss, data-quality and security. Please make sure to flag any such issues to us, and we will work to always prioritise these issues in our release roadmap (see below for details).
2. New EmrEtlRunner commands
We’ve decided to refactor EmrEtlRunner into different subcommands in order to improve its modularity. The following subsections detail those new commands.
2.1 Run command
The previous EmrEtlRunner behavior has been incorporated into a
run command. Because of this, an old EmrEtlRunner launch which looked like this:
will now look like the following:
2.2 Lint command
You can now lint your resolver file as well as your enrichments thanks to a
Those commands will check that the provided files are valid with respect to their schemas.
There are plans to support linting for storage targets in issue #3364.
2.3 Backend for a generate command
This release also introduces a backend for a
generate command which will be able to generate the necessary Dataflow Runner configuration files.
This command will be formally introduced in a subsequent release when we start to smoothly transition away from EmrEtlRunner, read our RFC on splitting EmrEtlRunner for more background.
3. New and retired EmrEtlRunner CLI options
This release also introduces and retires a few options to the
In order to prevent overlapping job runs, this release introduces a locking mechanism. This translates into a
--lock flag to the
run command. When specifying this flag, a lock will be acquired at the start of the job and released upon its successful completion.
This is much more robust than our previous approach, which involved checking folder contents in S3 to attempt to determine if a previous run was ongoing (or had failed partway through).
There are two strategies for storing the lock: local and distributed.
3.1.1 Local lock
You can leverage a local lock when launching EmrEtlRunner with:
This prevents anyone on this machine from launching another run of EmrEtlRunner with
path/to/lock as lock. The lock will be represented by a file on a disk at the specifed path.
3.1.2 Distributed lock
Anoter strategy is to leverage Consul to enforce a distributed lock:
That way, anyone using
path/to/lock as lock and this Consul server will have to respect the lock.
In this case, the lock will be materialized by a key-value pair in Consul, the key being at the specified path.
3.2 Resume from step
This release introduces a
--resume-from flag to be able to resume the EMR job from a particular step, it’s particularly useful when recovering from a failed step.
The steps you can resume from are, in order:
For example if your Redshift load failed due to a maintenance window, you might want to relaunch EmrEtlRunner with:
3.3 Removal of the start and end flags
--end flags, which used to allow you to process files for a specific time period have been removed. This is because the new S3DistCp-based staging process doesn’t inspect the timestamps in the filenames.
3.4 Removal of the process-enrich and process-shred flags
--process-shred options, which let you run only the
enrich step, and
shred step respectively, have also been retired, to simplify the EmrEtlRunner.
The latest version of EmrEtlRunner is available from our Bintray.
Upgrading is straightforward:
- Make use of the new
runcommand when launching EmrEtlRunner
- Set up and configure one of the two locking options (see above)
We strongly recommend setting up and configuring one of the two locking options. This is the most secure way of preventing a race condition, whereby a second scheduled EmrEtlRunner run starts while the last run is still partway through.
Upcoming Snowplow releases include:
- R92 [STR] Virunum, a general upgrade of the apps constituting our stream processing pipeline
- R9x [BAT] Priority fixes and ZSTD support, working on data quality and security issues and enhancing our Redshift event storage with the ZSTD encoding
- R9x [STR] Priority fixes, removing the potential for data loss in the stream processing pipeline
- R9x [BAT] 4 webhooks, which will add support for 4 new webhooks (Mailgun, Olark, Unbounce, StatusGator)
6. Getting help
For more details on this release, please check out the release notes on Github.
If you have any questions or run into any problems, please visit our Discourse forum.