Snowplow 91 Stonehenge released with important bug fix

17 August 2017  •  Ben Fradet

We are pleased to announce the release of Snowplow 91 Stonehenge.

This release revolves around making EmrEtlRunner, the component launching the EMR steps for the batch pipeline, significantly more robust. Most notably, this release fixes a long-standing bug in the way the staging step was performed, which affected all users of the Clojure Collector (issue #3085).

This release also lays important groundwork for our planned migration away from EmrEtlRunner towards separate snowplowctl and Dataflow Runner tools, per our RFC.

If you’d like to know more about R91 Stonehenge, named after [the prehistoric monument in England][stonehenge], please read on:

  1. Moving the staging step to S3DistCp
  2. New EmrEtlRunner commands
  3. New EmrEtlRunner CLI options
  4. Upgrading
  5. Roadmap
  6. Help

1. Moving the staging step to S3DistCp

1.1 Overview

This release overhauls the initial staging step which moves data from the raw input bucket(s) in S3 to a processing bucket for further processing.

Before this release, this step was run on the EmrEtlRunner host machine, using our Sluice library for S3 operations. From this release, this step is run as an EMR step using S3DistCp.

Part of the staging operation for Clojure Collector logs involved transforming timestamps from a format to another. Unfortunately, a race condition in JRuby’s multi-threaded environment could result in files overwriting each other, leading to data loss. Many thanks to community member Victor Ceron for isolating this bug.

In our experiments, the loss rate approached 1% of all Clojure Collector raw event files.

The fix introduced in this release (issue #276) delegates the staging step to S3DistCp and doesn’t do any renaming.

1.2 Illustration of problem and fix

To illustrate the previous point, let’s take the example of having a couple of Clojure instances and log files according to the following folder structure in your input bucket:

$ aws s3 ls --recursive s3://raw/in/
i-1/
  _var_log_tomcat8_rotated_localhost_access_log.txt1502463662.gz
i-2/
  _var_log_tomcat8_rotated_localhost_access_log.txt1502467262.gz

Previously, once the staging step took place, these files would end up as shown below:

$ aws s3 ls --recursive s3://raw/processing/
var_log_tomcat8_rotated_localhost_access_log.2017-08-11-15.eu-west-1.i-1.txt.raw.gz
var_log_tomcat8_rotated_localhost_access_log.2017-08-11-16.eu-west-1.i-2.txt.raw.gz

This renaming is where the occasional overwrite could occur.

From now on, the S3DistCp file moves will leave the filenames as-is, except for the leading underscores which will be removed:

$ aws s3 ls --recursive s3://raw/processing/
i-1/
  var_log_tomcat8_rotated_localhost_access_log.txt1502463662.gz
i-2/
  var_log_tomcat8_rotated_localhost_access_log.txt1502467262.gz

1.3 Post mortem

We take data quality and the potential for data loss extremely seriously at Snowplow.

With any fast-evolving project like Snowplow, new issues and regressions will be introduced, and old underlying issues (such as this one) can be uncovered - but the important thing is how we handle these issues. In this case, we should have triaged and scheduled a fix for Victor’s issue much sooner. Our sincere apologies for this.

Going forwards, we are putting pipeline quality and safety issues front and center with dedicated labels for data-loss, data-quality and security. Please make sure to flag any such issues to us, and we will work to always prioritise these issues in our release roadmap (see below for details).

2. New EmrEtlRunner commands

We’ve decided to refactor EmrEtlRunner into different subcommands in order to improve its modularity. The following subsections detail those new commands.

2.1 Run command

The previous EmrEtlRunner behavior has been incorporated into a run command. Because of this, an old EmrEtlRunner launch which looked like this:

./snowplow-emr-etl-runner -c config.yml -r resolver.json

will now look like the following:

./snowplow-emr-etl-runner run -c config.yml -r resolver.json

2.2 Lint command

You can now lint your resolver file as well as your enrichments thanks to a lint command:

./snowplow-emr-etl-runner lint resolver    -r resolver.json
./snowplow-emr-etl-runner lint enrichments -r resolver.json -n enrichments/directory/

Those commands will check that the provided files are valid with respect to their schemas.

There are plans to support linting for storage targets in issue #3364.

2.3 Backend for a generate command

This release also introduces a backend for a generate command which will be able to generate the necessary Dataflow Runner configuration files.

This command will be formally introduced in a subsequent release when we start to smoothly transition away from EmrEtlRunner, read our RFC on splitting EmrEtlRunner for more background.

3. New and retired EmrEtlRunner CLI options

This release also introduces and retires a few options to the run command.

3.1 Lock

In order to prevent overlapping job runs, this release introduces a locking mechanism. This translates into a --lock flag to the run command. When specifying this flag, a lock will be acquired at the start of the job and released upon its successful completion.

This is much more robust than our previous approach, which involved checking folder contents in S3 to attempt to determine if a previous run was ongoing (or had failed partway through).

There are two strategies for storing the lock: local and distributed.

3.1.1 Local lock

You can leverage a local lock when launching EmrEtlRunner with:

./snowplow-emr-etl-runner run \
  -c     config.yml \
  -r     resolver.json \
  --lock path/to/lock

This prevents anyone on this machine from launching another run of EmrEtlRunner with path/to/lock as lock. The lock will be represented by a file on a disk at the specifed path.

3.1.2 Distributed lock

Anoter strategy is to leverage Consul to enforce a distributed lock:

./snowplow-emr-etl-runner run \
  -c       config.yml \
  -r       resolver.json \
  --lock   path/to/lock \
  --consul http://127.0.0.1:8500

That way, anyone using path/to/lock as lock and this Consul server will have to respect the lock.

In this case, the lock will be materialized by a key-value pair in Consul, the key being at the specified path.

3.2 Resume from step

This release introduces a --resume-from flag to be able to resume the EMR job from a particular step, it’s particularly useful when recovering from a failed step.

The steps you can resume from are, in order:

  • enrich
  • shred
  • elasticsearch
  • archive_raw
  • rdb_load
  • analyze
  • archive_enriched

For example if your Redshift load failed due to a maintenance window, you might want to relaunch EmrEtlRunner with:

./snowplow-emr-etl-runner run \
  -c            config.yml \
  -r            resolver.json \
  --resume-from rdb_load

3.3 Removal of the start and end flags

The --start and --end flags, which used to allow you to process files for a specific time period have been removed. This is because the new S3DistCp-based staging process doesn’t inspect the timestamps in the filenames.

3.4 Removal of the process-enrich and process-shred flags

The --process-enrich and --process-shred options, which let you run only the enrich step, and shred step respectively, have also been retired, to simplify the EmrEtlRunner.

4. Upgrading

The latest version of EmrEtlRunner is available from our Bintray.

Upgrading is straightforward:

  1. Make use of the new run command when launching EmrEtlRunner
  2. Set up and configure one of the two locking options (see above)

We strongly recommend setting up and configuring one of the two locking options. This is the most secure way of preventing a race condition, whereby a second scheduled EmrEtlRunner run starts while the last run is still partway through.

5. Roadmap

Upcoming Snowplow releases include:

6. Getting help

For more details on this release, please check out the release notes on Github.

If you have any questions or run into any problems, please visit our Discourse forum.