We are very pleased to announce the release of Snowplow 0.8.6, with two significant performance-related improvements to the Hadoop ETL. These improvements are:
- The Hadoop ETL process is now much faster at processing raw Snowplow log files generated by the CloudFront Collector, because we have tackled the Hadoop “small files problem”
- You can now configure your ETL process on Elastic MapReduce to use Task instances alongside your Master and Core instances; optionally these task instances can be spot (bid-based) instances rather than on-demand
In this post, then, we will cover:
We are very pleased in this release to finally address Hadoop’s “small files problem” for Snowplow users relying on our CloudFront Collector. As some of you may know, the CloudFront Collector can generate large numbers of very small files – and this is something that can really impede Hadoop’s performance.
With this fix in place, ETL processing speeds will be significantly faster if you previously were processing thousands of smaller CloudFront files. In particularly severe cases, we have seen speed-ups of 1,867%.
For more information on Hadoop’s small files problem, how badly it was slowing down our ETL process and what we did to fix it, do check out our companion blog post, Dealing with Hadoop’s small files problem.
With this release you can now add Task instances to your ETL process, alongside your existing Master instance and Core instance(s). The additional configuration options in EmrEtlRunner’s
config.yml look like this:
:task_instance_bid: variable – this lets you bid an upper bound (in US Dollars) that you are willing to pay for Task instances to be added to your job. Leave this blank if you would prefer on-demand Task instances at the standard EMR prices.
The best introduction to configuring the various instance groups for your job (Master, Core and Task) is in the post Run Amazon Elastic MapReduce on EC2 Spot Instances on the Amazon Web Services Blog. In the language of this blog post, the Snowpow ETL process is typically a Data-Critical Workload:
“If the overall cost is more important than the time to completion and you don’t want to lose any partial work, run the Master and Core instance groups on On-Demand instances, making sure that you run enough Core instance groups to hold all of your data in HDFS. Add Spot Instances as needed to reduce the overall processing speed and the total cost.”
We recommend you experiment with different Task instance configurations (including different bids) to find the best cost-time balance for you.
There are two components to upgrade in this release:
- The Scalding ETL, to version 0.3.2
- EmrEtlRunner, to version 0.3.0
Let’s take these in turn:
If you are using EmrEtlRunner, you need to update your configuration file,
config.yml, to the latest version of the Hadoop ETL:
You need to upgrade your EmrEtlRunner installation to the latest code (0.8.6 release) on GitHub:
$ git clone git://github.com/snowplow/snowplow.git $ git checkout 0.8.6
Next, you need to update the format of your
config.yml, specifically the
:jobflow: section. The new format looks like this:
Note that almost all of these variables are either new or renamed. For recommended settings for the new
task_instance settings, please see Using task instances in your ETL process above.
You can see the full list of issues delivered in Snowplow 0.8.6 on GitHub.