This release overhauls the initial staging step which moves data from the raw input bucket(s) in S3 to a processing bucket for further processing.
Before this release, this step was run on the EmrEtlRunner host machine, using our Sluice library for S3 operations. From this release, this step is run as an EMR step using S3DistCp.
Part of the staging operation for Clojure Collector logs involved transforming timestamps from a format to another. Unfortunately, a race condition in JRuby’s multi-threaded environment could result in files overwriting each other, leading to data loss. Many thanks to community member Victor Ceron for isolating this bug.
In our experiments, the loss rate approached 1% of all Clojure Collector raw event files.
The fix introduced in this release (issue #276) delegates the staging step to S3DistCp and doesn’t do any renaming.
To illustrate the previous point, let’s take the example of having a couple of Clojure instances and log files according to the following folder structure in your input bucket:
Previously, once the staging step took place, these files would end up as shown below:
This renaming is where the occasional overwrite could occur.
From now on, the S3DistCp file moves will leave the filenames as-is, except for the leading underscores which will be removed:
We take data quality and the potential for data loss extremely seriously at Snowplow.
With any fast-evolving project like Snowplow, new issues and regressions will be introduced, and old underlying issues (such as this one) can be uncovered - but the important thing is how we handle these issues. In this case, we should have triaged and scheduled a fix for Victor’s issue much sooner. Our sincere apologies for this.
Going forwards, we are putting pipeline quality and safety issues front and center with dedicated labels for data-loss, data-quality and security. Please make sure to flag any such issues to us, and we will work to always prioritise these issues in our release roadmap (see below for details).
We’ve decided to refactor EmrEtlRunner into different subcommands in order to improve its modularity. The following subsections detail those new commands.
The previous EmrEtlRunner behavior has been incorporated into a
run command. Because of this, an old EmrEtlRunner launch which looked like this:
will now look like the following:
You can now lint your resolver file as well as your enrichments thanks to a
Those commands will check that the provided files are valid with respect to their schemas.
There are plans to support linting for storage targets in issue #3364.
This release also introduces a backend for a
generate command which will be able to generate the necessary Dataflow Runner configuration files.
This command will be formally introduced in a subsequent release when we start to smoothly transition away from EmrEtlRunner, read our RFC on splitting EmrEtlRunner for more background.
This release also introduces and retires a few options to the
In order to prevent overlapping job runs, this release introduces a locking mechanism. This translates into a
--lock flag to the
run command. When specifying this flag, a lock will be acquired at the start of the job and released upon its successful completion.
This is much more robust than our previous approach, which involved checking folder contents in S3 to attempt to determine if a previous run was ongoing (or had failed partway through).
There are two strategies for storing the lock: local and distributed.
You can leverage a local lock when launching EmrEtlRunner with:
This prevents anyone on this machine from launching another run of EmrEtlRunner with
path/to/lock as lock. The lock will be represented by a file on a disk at the specifed path.
Anoter strategy is to leverage Consul to enforce a distributed lock:
That way, anyone using
path/to/lock as lock and this Consul server will have to respect the lock.
In this case, the lock will be materialized by a key-value pair in Consul, the key being at the specified path.
This release introduces a
--resume-from flag to be able to resume the EMR job from a particular step, it’s particularly useful when recovering from a failed step.
The steps you can resume from are, in order:
For example if your Redshift load failed due to a maintenance window, you might want to relaunch EmrEtlRunner with:
--end flags, which used to allow you to process files for a specific time period have been removed. This is because the new S3DistCp-based staging process doesn’t inspect the timestamps in the filenames.
--process-shred options, which let you run only the
enrich step, and
shred step respectively, have also been retired, to simplify the EmrEtlRunner.
The latest version of EmrEtlRunner is available from our Bintray.
Upgrading is straightforward:
runcommand when launching EmrEtlRunner
We strongly recommend setting up and configuring one of the two locking options. This is the most secure way of preventing a race condition, whereby a second scheduled EmrEtlRunner run starts while the last run is still partway through.
Upcoming Snowplow releases include:
For more details on this release, please check out the release notes on Github.
If you have any questions or run into any problems, please visit our Discourse forum.