Introducing Avalanche for load-testing Snowplow

20 May 2016  •  Joshua Beemster

We are pleased to announce the very first release of Avalanche, the Snowplow load-testing project.

As the Snowplow platform matures and is adopted more and more widely, understanding how Snowplow performs under various event scales and distributions becomes increasingly important.

Our new open-source Avalanche project is our attempt to create a standardized framework for testing Snowplow batch and real-time pipelines under various loads. It will hopefully also expand ours and the community’s knowledge on what configurations work best and to discover (and then remove!) limitations that we might come across.

At launch, Avalanche is wholly focused on load-testing of the Snowplow collector components. Over time we hope to extend this to: load-testing other Snowplow components (and indeed the end-to-end pipeline); automated auditing of test runs; extending Avalanche to test other event platforms.

In the rest of this post we will cover:

  1. How to setup the environment
  2. How to access results
  3. The Clojure Collector and what we have learned
  4. Roadmap
  5. Documentation
  6. Getting help

1. How to setup the environment

Avalanche comes pre-packaged as an AMI available directly from the Community AMIs section when launching a fresh EC2 instance. Simply search for snowplow-avalanche-0.1.0 to find the required AMI and then follow these setup instructions to get started.

Under the hood Avalanche uses the excellent Gatling.io load-testing framework to send huge volumes of requests to our Snowplow event collectors.

Once the instance has been launched and you have SSH’ed onto the box you will need to setup your environment variables for the simulation:

  • SP_COLLECTOR_URL: your Snowplow Collector endpoint
  • SP_SIM_TIME: the total simulation time in minutes
  • SP_BASELINE_USERS: the base amount of users that are pinging the collector
  • SP_PEAK_USERS: the peak amount of users to load test up until

You can then go ahead and launch Gatling using either our launch script:

ubuntu$ ./snowplow/scripts/2_run.sh

Or you can launch it yourself:

ubuntu$ /home/ubuntu/snowplow/gatling/gatling-charts-highcharts-bundle-2.2.1-SNAPSHOT/bin/gatling.sh -sf /home/ubuntu/snowplow/src

After which you can select the simulation you wish to run:

Choose a simulation number:
     [0] com.snowplowanalytics.avalanche.ExponentialPeak
     [1] com.snowplowanalytics.avalanche.LinearPeak

Or to directly launch the simulation without any interaction:

ubuntu$ /home/ubuntu/snowplow/gatling/gatling-charts-highcharts-bundle-2.2.1-SNAPSHOT/bin/gatling.sh -sf /home/ubuntu/snowplow/src -s com.snowplowanalytics.avalanche.ExponentialPeak

The above can be useful if you wish to run Avalanche across many EC2 instances at the same time and would like to supply the launch command within the User-Data section in place of having to SSH onto the instance.

For very high throughputs, you will need to contact Amazon Technical Support to have them pre-warm your Load Balancer to be able to handle the throughput being generated by Gatling.

Note: in using Gatling we comfortably managed 825,000 requests per minute from a single c4.8xlarge instance. For much more than this we recommend moving to running Avalanche from multiple instances.

2. How to access results

Gatling generates results as a simple webpage. The directory these result pages are stored in is determined by the -rf flag being passed when you launch Gatling. When launching via the 2_run.sh script above, this is set to /home/ubuntu/snowplow/results.

To easily view these simulations from the actual instance:

ubuntu$ cd /home/ubuntu/snowplow/results
ubuntu$ sudo python -m SimpleHTTPServer 80

Then navigate to the instance DNS found within your EC2 console and select the relevant result file you wish to review.

3. The Clojure Collector and what we have learned

Avalanche has the ability to simulate both linear and exponential increases in load up to 825 thousand requests per minute (from a single instance - potentially more!). At Snowplow we have used this to assert that the scaling rules we have in place are working optimally and will continue to work under any amount of load.

We recently put the Clojure Collector, our most battle-tested collector, through its paces with Avalanche. The test was configured as follows:

  • Instance Type: m3.medium
  • Simulation time: 180 minutes
  • Baseline users: 100 (~6,000 requests per minute)
  • Peak users: 15000 (~900,000 requests per minute)
  • Loading type: Linear
  • Scaling based on CPU and Network Latency

This test resulted in an 150x load increase in just under 60 minutes. At this point the load was held for ~35 minutes before being scaled back down over the next 60 minutes.

Requests Over Time Average Latency CPU Average

The images above illustrate that the initial load increase gave a very high latency increase (Image 2) and that a lot of the requests were initially failing, as seen by the less than linear line of HTTP requests initially (Image 1). However, the scaling rules aggressivly began provisioning instances to deal with the influx. Within 15 minutes the latency had been reduced back to ~5-10 milliseconds, the linear line had been restored and the scaling had stabilized at 17 collectors. This number was scaled up to 55 collectors at its peak to deal with the 825,000 odd requests per minute.

This result was only attained after much trial and error with our scaling rules. Previously we had limited our scaling logic to using solely CPU and then scaling quite passively, just one instance at a time. From testing with Avalanche, we have now updated this logic to be much more aggressive and, as you can see, able to withstand huge growth comfortably so as to recover with a minimum of fuss.

The changes involved:

  1. Stepwise scaling rules so that we can better handle the situation at hand: if the collectors are hovering at roughly 60% CPU usage, then a single extra instance will bring it back to within a healthy zone. If however they are already at 90%+, then an extra instance while helpful will not bring the cluster back to a healthy state. Therefore, in this case we now bring 3 instances in to quickly resolve the extra load
  2. Latency based scaling: a Snowplow event collector is fundamentally an HTTP web server, and the most important metric for a web-server is its latency, i.e. how long it takes to get a response from it. If a collector has very high latency (regardless of its CPU usage), then something is struggling and we probably need another instance. As above, depending on how high the latency has gone we will provision a different number of extra instances to recover as quickly as possible

We will be working to feed these findings back into the standard Snowplow documentation over the coming weeks.

4. Roadmap

We’re excited about what we have already learned from the initial version of Avalanche, and have big plans for the future, including:

  • Building an interactive front-end UI with full auditing capabilities (#6)
  • Adding ability to run Avalanche on Blazemeter (#2)
  • Collaborating with other event analytics platform vendors to test their platforms using Avalanche

But above all, we are excited about the potential for the Snowplow community to use Avalanche, extend it with your own scenarios and share your findings on our forums. Let’s start a conversation about operating event data pipelines at massive scale!

5. Documentation

Please check out the Avalanche usage manual for setting up and running your own load tests.

6. Getting help

We hope that you find Avalanche useful - of course, this is only its first release, so don’t be afraid to get in touch or raise an issue on GitHub!