This release is focused on improving EmrEtlRunner by adding new features and to make it more robust with respect to AWS services at the same time.
EmrEtlRunner is now able to run steps on a long-running cluster saving up and down time through a
--use-persistent-jobflow argument. Additionally it’s possible to specify the time-to-live of this long-running cluster before spinning up a new one through
This feature, together with EmrEtlRunner’s stream enrich mode, enables drip-feeding into Redshift, to, for example, load your enriched data into Redshift every ten minutes or less.
In recent months, we’ve observed an increase in reliability issues with Amazon EMR with more and more timeouts. With this release we’re aiming to tackle this issue by retrying in case of timeouts using an exponential backoff. We’re adopting the same strategy regarding S3 internal errors.
Some users have been hitting S3 rate limits because of too many small files being written at the same time. This is the result of making heavy use of contexts while having large volume of data. For example, a user using 50 different contexts with a shred job running concurrently on 20 Spark executors will result in 1000, often very small, files being written concurrently. This is just for the Snowplow pipeline and excludes any kind of processing that could happen on your end.
To solve this issue, this release will consolidate those small files into bigger ones during their transfer from HDFS to S3. This will result in less writes and should mean fewer contention issues.
We’ve taken advantage of this release to also rename the EMR steps and remove some of the clutter accumulated over the years.
We’ve increased the number of files that can be opened concurrently on the Elastic Beanstalk machines running the Clojure Collector from 1024 to 65536. This ensures that the collector is never throttled on the number of open file handles it can accept.
geo_region column has been bumped to 3 characters in order to accommodate for a change in the underlying library the IP lookups enrichment used to fill this column.
Thanks to Mike from Snowflake Analytics for this open source contribution.
The latest version of the EmrEtlRunner is available from our Bintray here.
A setting is needed to enable or disable compaction of the output of the shred job discussed above.
For a complete example, see our sample
The new Clojure Collector is available in S3 at:
We’ve put together migration scripts which are available on GitHub:
Upcoming Snowplow releases include:
Stay tuned for announcements of more upcoming Snowplow releases soon!
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problem, please visit our Discourse forum.