Snowplow R112 Baalbek batch pipeline reliability improvements
Please read on after the fold for:
- EmrEtlRunner improvements
- Clojure Collector improvement
- Redshift (and Postgres) data model improvement
BlingBling10 at the English Wikipedia CC BY-SA 3.0
1. EmrEtlRunner improvements
This release is focused on improving EmrEtlRunner by adding new features and to make it more robust with respect to AWS services at the same time.
1.1 Support for persistent EMR clusters
EmrEtlRunner is now able to run steps on a long-running cluster saving up and down time through a
--use-persistent-jobflow argument. Additionally it’s possible to specify the time-to-live of this long-running cluster before spinning up a new one through
This feature, together with EmrEtlRunner’s stream enrich mode, enables drip-feeding into Redshift, to, for example, load your enriched data into Redshift every ten minutes or less.
1.2 Recovery from EMR timeouts and S3 internal errors
In recent months, we’ve observed an increase in reliability issues with Amazon EMR with more and more timeouts. With this release we’re aiming to tackle this issue by retrying in case of timeouts using an exponential backoff. We’re adopting the same strategy regarding S3 internal errors.
1.3 Compaction steps for the different jobs' output
Some users have been hitting S3 rate limits because of too many small files being written at the same time. This is the result of making heavy use of contexts while having large volume of data. For example, a user using 50 different contexts with a shred job running concurrently on 20 Spark executors will result in 1000, often very small, files being written concurrently. This is just for the Snowplow pipeline and excludes any kind of processing that could happen on your end.
To solve this issue, this release will consolidate those small files into bigger ones during their transfer from HDFS to S3. This will result in less writes and should mean fewer contention issues.
1.4 Renaming of the EMR steps
We’ve taken advantage of this release to also rename the EMR steps and remove some of the clutter accumulated over the years.
2. Clojure Collector improvement
We’ve increased the number of files that can be opened concurrently on the Elastic Beanstalk machines running the Clojure Collector from 1024 to 65536. This ensures that the collector is never throttled on the number of open file handles it can accept.
3. Redshift (and Postgres) data model improvement
geo_region column has been bumped to 3 characters in order to accommodate for a change in the underlying library the IP lookups enrichment used to fill this column.
4.1 Upgrading EmrEtlRunner
The latest version of the EmrEtlRunner is available from our Bintray here.
A setting is needed to enable or disable compaction of the output of the shred job discussed above.
For a complete example, see our sample
4.2 Upgrading the Clojure Collector
The new Clojure Collector is available in S3 at:
4.3 Upgrading the events table definition
We’ve put together migration scripts which are available on GitHub:
Upcoming Snowplow releases include:
- R113 [RT] Real-time maintenance, a release focusing on the real-time pipeline with plenty of community contributions
Stay tuned for announcements of more upcoming Snowplow releases soon!
6. Getting help
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problem, please visit our Discourse forum.