At Snowplow we are in the unique position of running 200+ Snowplow data pipelines across more than 150 different AWS and GCP accounts. Across our customer base there is a great variety in the volume of data processed per pipeline and the traffic patterns across them.
Because of this, we work hard on our autoscaling technology to ensure pipelines automatically scale up to handle sudden increases in traffic, and scale down as volumes decrease to avoid unnecessary infrastructure costs. The speed with which we can scale up pipelines is vitally important, because it determines how fast the pipeline can respond to sudden increases in traffic volume. If scale up times are slow, scaling needs to be more “pre-emptive” so as not to introduce undue latency, or in the worst case, even data loss during extreme traffic spikes. This can occur if sufficient capacity is not provisioned in time to cope with the surge in data.
We have a number of customers on both AWS and GCP that have:
- Very high data volumes (some peaking at 22,000 events per second)
- Very spikey traffic patterns
At this extreme scale, very steep traffic surges can create issues for Kinesis. Due to the time taken to provision new shards and reshard a stream from, for example, 200 shards to 400 shards, it is sometimes necessary to force Kinesis to maintain a higher shard count than is necessary for normal operation, which can have cost implications.
There is no equivalent challenge on GCP, since additional Pub/Sub capacity is “hot”. This means that it is already provisioned and just needs to be made available to the customer that requires it when needed.
To meet this Kinesis-specific challenge for the largest Snowplow deployments, we have built ‘Surge Protection’.
Introducing Surge Protection
At Snowplow, we have released a new feature on AWS so our customers can have confidence that their pipeline will successfully scale to handle even the most extreme traffic spikes.
We have achieved this by adding Amazon Simple Queue Service (SQS) as a buffer mechanism, acting as a pressure valve between the collector and Kinesis and preventing messages from having to wait in the collector’s memory while Kinesis is scaling. Instead, messages are written to SQS where they are queued whilst Kinesis is resizing, and the sqs2kinesis application is then responsible for reading the messages and writing to Kinesis once it is ready. With Surge Protection, customers now have even greater assurance that their pipeline will scale faster to handle even the most extreme data surges, without having to pre-provision capacity.
- We will be rolling out this functionality for Snowplow Insights customers
- If you would like early access to this functionality, contact us at firstname.lastname@example.org and we’ll get you up and running
- For more details on how to set up Surge Protection as an Open Source user please visit our docs page
Not a Snowplow Insights customer yet? Get in touch with us here to learn more.