Snowplow R112 Baalbek batch pipeline reliability improvements

20 February 2019  •  Ben Fradet

Snowplow 112 Baalbek, named after the city in Eastern Lebanon, is a release focusing on reliability improvements for the batch pipeline.

Please read on after the fold for:

  1. EmrEtlRunner improvements
  2. Clojure Collector improvement
  3. Redshift (and Postgres) data model improvement
  4. Upgrading
  5. Roadmap
  6. Help

baalbek
BlingBling10 at the English Wikipedia CC BY-SA 3.0

1. EmrEtlRunner improvements

This release is focused on improving EmrEtlRunner by adding new features and to make it more robust with respect to AWS services at the same time.

1.1 Support for persistent EMR clusters

EmrEtlRunner is now able to run steps on a long-running cluster saving up and down time through a --use-persistent-jobflow argument. Additionally it’s possible to specify the time-to-live of this long-running cluster before spinning up a new one through --persistent-jobflow-duration.

This feature, together with EmrEtlRunner’s stream enrich mode, enables drip-feeding into Redshift, to, for example, load your enriched data into Redshift every ten minutes or less.

1.2 Recovery from EMR timeouts and S3 internal errors

In recent months, we’ve observed an increase in reliability issues with Amazon EMR with more and more timeouts. With this release we’re aiming to tackle this issue by retrying in case of timeouts using an exponential backoff. We’re adopting the same strategy regarding S3 internal errors.

1.3 Compaction steps for the different jobs' output

Some users have been hitting S3 rate limits because of too many small files being written at the same time. This is the result of making heavy use of contexts while having large volume of data. For example, a user using 50 different contexts with a shred job running concurrently on 20 Spark executors will result in 1000, often very small, files being written concurrently. This is just for the Snowplow pipeline and excludes any kind of processing that could happen on your end.

To solve this issue, this release will consolidate those small files into bigger ones during their transfer from HDFS to S3. This will result in less writes and should mean fewer contention issues.

1.4 Renaming of the EMR steps

We’ve taken advantage of this release to also rename the EMR steps and remove some of the clutter accumulated over the years.

2. Clojure Collector improvement

We’ve increased the number of files that can be opened concurrently on the Elastic Beanstalk machines running the Clojure Collector from 1024 to 65536. This ensures that the collector is never throttled on the number of open file handles it can accept.

3. Redshift (and Postgres) data model improvement

The geo_region column has been bumped to 3 characters in order to accommodate for a change in the underlying library the IP lookups enrichment used to fill this column.

Thanks to Mike from Snowflake Analytics for this open source contribution.

4. Upgrading

4.1 Upgrading EmrEtlRunner

The latest version of the EmrEtlRunner is available from our Bintray here.

A setting is needed to enable or disable compaction of the output of the shred job discussed above.

aws:
  s3:
    consolidate_shredded_output: false

For a complete example, see our sample config.yml template.

4.2 Upgrading the Clojure Collector

The new Clojure Collector is available in S3 at:

s3://snowplow-hosted-assets/2-collectors/clojure-collector/clojure-collector-2.1.3-standalone.war

4.3 Upgrading the events table definition

We’ve put together migration scripts which are available on GitHub:

5. Roadmap

Upcoming Snowplow releases include:

Stay tuned for announcements of more upcoming Snowplow releases soon!

6. Getting help

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problem, please visit our Discourse forum.