Unpicking the Snowplow data pipeline and how it drives AWS costs

09 July 2013  •  Yali Sassoon

Back in March, Robert Kingston suggested that we develop a Total Cost of Ownership model for Snowplow: something that would enable a user or prospective user to accurately estimate their Amazon Web Services monthly charges going forwards, and see how those costs vary with different traffic levels. We thought this was an excellent idea.

Since Rob’s suggestion, we’ve made a number of important changes to the Snowplow platform that have changed the way Snowplow costs scale with the number of events served:

  1. We replaced the Hive-based ETL with the Scalding-based enrichment process
  2. We dealt with the small files problem, dramatically reducing EMR costs
  3. We enabled support for task and spot instances

As a result of the pending updates, we held off building the model. But now that they have been delivered, we have started putting together the model. In this post we’ll walk through the Snowplow data pipeline in detail, and show how different parts of the pipeline drive different AWS costs. In a follow-on post, we will ask Snowplow users to share with us data points to help us accurately model those costs.

What drives AWS costs for Snowplow users?

It is worth distinguishing the different AWS services, and examining how each scales with volume of events per day, and over time. If we take a typical Snowplow user (i.e. one running the Cloudfront collector rather than the Clojure collector), and storing their data on Redshift for analysis, rather than analyzing their data in S3 using EMR) then we need to account for:

  1. Cloudfront costs,
  2. S3 costs,
  3. EMR costs and
  4. Redshift costs

1. Cloudfront costs

Snowplow uses Cloudfront to serve both sp.js, the Snowplow Javascript, and the Snowplow pixel i. Broadly, Cloudfront costs scale linearly with the volume of data served out of Cloudfront, with a couple of provisos:

  • The cost per MB served goes down, as volumes rise (because Amazon tier their pricing)
  • The exact cost varies by AWS account region, and the region the browser that loads the content served by Cloudfront is in (so the exact cost depends on the distribution of your visitors geographically)

Modelling the costs is therefore reasonably straightforward. The i pixel is served with every event tracked: so we can calculate the cost of serving i based on the size of i (it is 37 bytes), the number of events, and the geographic split of visitors by Amazon region. The costs scale almost linearly with the number of events tracked per day.

Modelling the cost of serving sp.js is a little trickier. As discussed in our blog post on browser caching, it is possible to set sp.js so that it is only served once per unique visitor, rather than once per event. Because sp.js is 37KB (so a lot larger than i), this has a significant impact on your Cloudfront costs. From a modelling perspective, then, we should estimate costs, based on the number of unique visitors per month, and their geographic distribution by Amazon regions. The costs scale almost linearly with the number of unique visitors to the site / network.

2. S3 costs

Snowplow uses S3 to store event data. Amazon charges for S3 based on:

  • The volume of data stored in S3
  • The number of requests for that data (in the form of GET / PUT / COPY requests)

Modelling how the volume of data grows with increasing numbers of events, and over time (as the amount of data stored in S3 grows, because Snowplow users generally never throw data away) is straightforward: we calculate how large a ‘typical’ row of Snowplow data is in the raw collector logs, and then in the enriched Snowplow event files. We then assume one row of each type for every event that has been tracked, sum to find the total required storage space, and then multiply by Amazon S3’s cost per GB.

What is more tricky is modelling the number of requests to S3 for the files. To understand why, we need to examine the Snowplow data pipeline, and in particular, the part of the pipeline that takes the raw data generated by the Snowplow collector, cleans that data, enriches it, and uploads the enriched data into Amazon Redshift for analysis.

The first part of the data pipeline is orchestrated by EmrEtlRunner. This performs the bulk of data processing work:

emr-etl-runner

This encapsulates the bulk of the data processing:

  • Raw collector log files that need to be processed are identified in the in-bucket, and moved to the processing bucket
  • EmrEtlRunner then triggers the Enrichment process to run. This spins up an EMR cluster, loads the data in the processing bucket into HDFS, loads Scalding Enrichment process (as a JAR) and uses that JAR to process the raw logs uploaded into HDFS.
  • The output of that Scalding Enrichment process is then written straight into S3. The EMR cluster is then shut down.
  • Once the job has completed, the raw logs are moved from the processing bucket to the archive bucket.

For modelling AWS costs (and S3 costs in particular), we need to note that two COPY requests are executed for each collector log file written to S3: one to move that data from the in-bucket to the processing-bucket, and then another one to move the same file from the processing-bucket to the archive bucket.

In addition, one GET request is executed per raw collector log file, when the data is read from S3 for the purposes of writing into HDFS.

The second part of the data pipeline is orchestrated by the StorageLoader:

storage-loader

This is a much simpler stage of the data processing pipeline:

  • Data from enriched event files generated by the Scalding process on EMR is read and written to Amazon Redshift
  • The enriched event files are then moved from the in-bucket (which was the archive bucket for EmrEtlRunner) to the archive bucket (for the StorageLoader)

Again, for modelling purposes, we note that a single GET request is executed for each enriched event file (when the file is read for the purposes of copying the data into Redshift). and then a COPY request is executed for that file to move it from the in to the out bucket.

This means that we can accurately forecast costs based on the number of raw collector log files generated and the number of enriched event files generated. Unfortunately, modelling how this number scales with visitors / page views and events is not straightforward because it is not clear how the number of collector log files and Snowplow event files scales with numbers of events tracked. This is one of the things we hope the community can help us with, by providing us with data points to help us unpick the relevant relationships.

To articulate our challenge: we need to understand

  1. How the number of raw collector log files (for the Cloudfront collector) scales with number of events
  2. How the number of enriched Snowplow event files scale with the number of events

We have working hypotheses for what determines both numbers. We need to validate these with the Snowplow community, and quantify the relationships mathematically, based on data points shared by members of our community:

We believe that Amazon Cloudfront generates one log file per server per edge location every hour. That means that Snowplow users who do not track large volumes of traffic, will generate a surprisingly large number of log files, each with very low volumes of data. (E.g. 1-5 rows.) As traffic levels climb, the number of log files will increase (more requests to more edge locations is bound to hit more Cloudfront servers), but that this tails off reasonably rapidly once there are enough visitors that most servers are hit at most edge locations every hour. (We guess that Amazon has lots of servers at each edge location, so this tailing off might only happen at very large volumes.) We’d therefore expect a graph like the one below, if we plotted numbers of log files vs events:

line-graph

In the case of forecasting the number of Snowplow Event files: this should be more straightforward. We believe that the Scalding Enrichment Process generates one output file for every input file that it receives. The Scalding Enrichment process does not operate on the raw collector logs: to speed up data processing (and reduce costs), these are aggregated using S3DistCopy prior to being fed into the Enrichment Process. This should aggregate the input log files so they are as close to 128MB each as possible. If one output file is produced for each input file, then they will have the same number of rows: it is likely that there is reasonably constant relationship between the size of the two, that mean they are of similar, but not identical, size. (Because they have roughly the same data, but in different formats with different levels of row-level enrichments.) If that is right, then the number of event files should scale almost linearly with the number of events recorded: almost because given the large maximum file size (roughly 128MBs) the graph would look more like a series of step functions, where a new step change occurs each time an additional 128MBs of events are recorded:

step-function

As we explain below, we hope to plot the above graphs with data points volunteered by Snowplow users, to see if we are correct, and then use to drive the model.

3. EMR costs

Snowplow uses EMR to run the Enrichment process on the raw logs created by the collector. Because the process is powered by Hadoop, it should scale linearly: double the number of lines of data to be processed, the time to process should double. Double the number of machines in the cluster, the processing time should be halved.

Amazon charges for EMR based on the number of boxes in the cluster, the size of those boxes and the time the cluster is live for. Amazon rounds up the nearest hour: as a result, we would expect the cost of the daily job to be a step function.

emr-costs

We need to work out how many lines of data we can expect each Amazon box to process in one hour. Getting a handle on these figures will also mean we can advise Snowplow users what size of cluster to spin up based on the number of events processed per run.

1.4 Redshift costs

The majority of Snowplow users store their events table in Amazon Redshift, and then plug analytics tools into Redshift to crunch that data. As a result, they run a Redshift cluster constantly.

Amazon charges per Redshift node, where each node provides a 2TB of storage (for standard XL nodes) or 16TB of storage (for 8XL nodes). As a result, we can model Redshift costs simply as a function of the volume of data stored (itself just a function of the number of events tracked per day, and the number of days Snowplow has been running). As for EMR, we expect a step function (with a big increase in cost each time an additional 2TB node is required):

redshift-costs

Wrapping up

One of the things we realised when we started the modelling exercise, was that we had not sufficiently documented the workings of the Snowplow data pipeline. This blog post is a first stab at addressing that shortfall. We hope it’s useful for those interested in how Snowplow works, and will be using the diagrams etc. in the technical documentation section of the Snowplow Wiki.

We also hope this post is interesting for anyone with an interest in modelling technology costs (and AWS costs in particular).

Lastly, we hope that this has piqued the interest of any Snowplow user who would value the Total Cost of Ownership model. As alluded to above, you can help us build that model, by providing us with the data points we need to validate the above hypothesised relationships, and model them accurately. We detail how you can help, exactly, in a follow on post.