In a previous blog post, we described how we were in the process of building a Total Cost of Ownership model for Snowplow: something that would enable a Snowplow user, or prospective user, to accurately forecast their AWS bill going forwards based on their traffic levels.
To build that model, though, we need your help. In order to ensure that our model is accurate and robust, we need to make sure that the relationships we believe exist between the number of events tracked, and the number and size of files generated, as detailed in the last post, are correct, and that we have modelled them accurately. To that end, we are asking Snowplow users to help us by providing the following data:
- The number of events tracked per day
- The number of times the enrichment process is run per day
- The number of Cloudfront log files generated per day, and the volume of data
- The amount of time taken to enrich the data in EMR (and the size of cluster used to perform the enrichment)
- The number of files outputted back to S3, and the size of those files
- The total number of lines of data in Redshift, and the amount of Redshift capacity used
We will then share this data back, in an anonymized form, with the community, as part of the model.
We recognise that that is a fair few data points! To thank Snowplow users for their trouble in providing them (as well as building a model for you), we will also send each person that provides data a free Snowplow T-shirt in their size.
In the rest of this post, we provide simple instructions for pulling the relevant data from Amazon.
Simply execute the following SQL statement in Redshift
Most Snowplow users run the enrichment process once per day.
You can confirm how many times you run Snowplow by logging into the AWS S3 console and navigating to the bucket where you archive your Snowplow event files. (This is specified in the StorageLoader config file.) Within the bucket you’ll see a single folder generated for each enrichment ‘run’, labelled with the timestamp of the run. You’ll be able to tell directly how many times the enrichment process is run – in the below case – it is once per day:
This is most easily done using an S3 front end, as the AWS S3 console is a bit limited. We use Cloudberry. On Cloudberry, you can read the number of files generated per day, and their size, directly, by simply right clicking on the folder with the day’s worth of log file archives and selecting properties:
In the above case we see there were 370 files generated on 2013-07-08, which occupied a total of 366.5KB.
4. The amount of time taken to enrich the data in EMR (and the size of cluster used to perform the enrichment)
You can use the EMR command line tools to generate a JSON with details of each EMR job. In the below example, we pull a JSON for a specific job:
Rather than parse the JSON yourself, we’re very happy for community members to simply save the JSON and email it to us, with the other data points. We can then extract the relevant data points from the JSON directly. (We’ll use R and the RJSON package, and blog about how we do it.) You can either generate a JSON for a specific job (you will need to enter the job ID:
Or you can fetch the data for every job run in the last two days:
Or all the data for every job in the last fortnight:
We can use Cloudberry again. Simply identify a folder in the archive bucket specified in the StorageLoader config, right click on it and select properties:
In the above example, 3 files were generated for a single run, with a total size of 981.4KB.
Measuring the amount of space occupied by your events in Redshift is very easy.
First, measure the number of events by executing the following query:
Then to find out how much disk space that occupies in your Redshift cluster execute the following query:
The amount of used capacity (in MB) is given in the “used” column: it is 1,941MB in the below example. The total capacity is given at 1906184 i.e. 1.8TB: that is because we are running a single (2TB) node.
For our purposes, we only need one of the lines of data to calculate the relationship between disk space on Redshift and number of events stored on Redshift, and use that to model Redshift costs.
Help us build an accurate, robust model, that we all can use to forecast Snowplow AWS costs
We realize that you, our users, are busy people who have plenty to do aside from spending 20-30 minutes fetching data points related to your Snowplow installation, and sending them to us. We really hope, however, that many of you do, because:
- A Total Cost of Ownership Model will be really useful for all of us!
- We’ll send you a Snowplow T-shirt, by way of thanks
If you can pop the above data points (in whatever format is most convenient), and email them to me on
yali at snowplowanalytics dot com, along with your T-shirt size, we will send you through your T-shirts as soon as they are printed.
So please help us help you, and keep plowing!