Using the new SQL views to perform cohort analysis with ChartIO
We wanted to follow-up our recent launch of Snowplow 0.8.10, with inbuilt SQL recipes and cubes, with a few posts demonstrating how you can use those views to quickly perform analytics on your Snowplow data. This is the first of those posts.
In this post, we’ll cover how to perform a cohort analysis using ChartIO and Snowplow.
Recap: what is Cohort Analysis
We have described cohort analysis at length in the Analyst Cookbook. To sum up, a cohort analysis is a longitudal study, that compares the behaviour or characteristics of groups of people over a long period of time. It therefore encompasses a broad range of analyses, because you can vary:
- how you group people into cohorts (cohort definition), and
- the characteristic that you’re comparing between cohort over time.
In digital media people generally use the phrase ‘cohort analysis’ to refer to measurements of retention rates for different cohorts, where cohorts are defined by when a user was acquired. In that way, SaaS companies, for example, can compare how well they retained customers acquired in October vs those acquired in September vs those acquired in August, and measure in a robust way whether they are getting better at retaining users over time. (This is key to SaaS business model being viable.)
We’ve included 10 different cohort analyses as standard with Snowplow, all of which compare retention rates between different cohorts, but which vary in how they define cohorts and what unit of time the comparison is performed over:
|Cohort definition||Unit of time comparison is performed over|
|Paid channel acquired||Months|
|Paid channel acquired||Weeks|
|Month first touch website||Months|
|Week first touch website||Weeks|
|Month first sign in to website||Months|
|Week first sign in to website||Months|
|Month first transact on website||Months|
|Week first transact on website||Weeks|
Other metrics you might want to compare between cohorts include average revenue per user and average engagement by user. (Engagement can be defined in many different ways.) For details in how to perform cohort analyses with these measures, see the detailed recipe in the Analysts Cookbook.
In this post we’re going to plot retention by month, based on the month that a user first touched the website. We’re going to perform this analysis for the Snowplow website itself.
Performing the analysis
Before we dive into ChartIO and create our plot, let’s first have a look at the raw data in Navicat, by opening up our view directly. Go into Navicat (or your SQL UI of choice) and open up
(You can click on the picture above to [zoom in] (/assets/img/blog/2013/10/cohort-analysis/1.JPG) on it.) Let’s take a good look at the actual structure of the data: this will make it much clearer how to successfully plot the data in ChartIO:
- Each line of data gives the number of uniques, for that a particular cohort, that were active each month. (This number is given in the
- Each line also gives a second metric: the
fraction_retained. This is the number of uniques that were active that month, divided by the total number of uniques in that cohort.
- We have three dimensions for each line of data: the first, called
cohort, is the month that the user first touched the website.
- The second dimension is
month_actual- this is the month that the measure was taken. So we see in our example, that we have 9 records for the February 2013 cohort, one for February, one for March, one for April etc. until October. For those 9 records, the first dimension is always equal to February 2013 - i.e. the month that these users first came to our website. But the
- As well as giving the actual month that each measure was taken, we also get a
month_rankwhich equals 1 for the first measurement for the cohort, 2 for the second etc. This will make it easy to compare the retention rates after e.g. 3 months between cohorts that signed up in January and February, for example: note that we’d want to compare the March numbers for our January cohort with our April numbers for our February cohort. Rather than work out the number of months between the cohort definition and the actual month, we can simply compare figures where
month_rank= 3 for our comparison.
- Finally, note that our
fraction_retainedfigure is always 1 (i.e. 100%) for the first month’s reading for each cohort. That is because, by definition, every member of the cohort is active on the first month that they touch the website. (Activity is defined as visiting the website at least once.)
Now - to perform our analysis, we want to plot a line graph: one per cohort, where we’re plotting month rank (on the x-axis) against fraction retained (on the y-axis). We expect a graph as follows, with retention by cohort decreasing over time for each of cohorts:
Let’s plot our cohort analysis! Log into ChartIO, go into a dashboard where you want to create your graph and click the Add Chart button. (We’re going to assume you’ve already setup a connection in ChartIO to your Snowplow data in Redshift, and in particular, the
recipes_customer schema. If you haven’t, instructions can be found here.)
Select the data source on the left which connects to the
recipes_customer schema. A long list of all the different views available in that schema will be shown below. Select the Cohort Dfn by Month First Touch view. The different measures and dimensions will be shown:
Now let’s create our plot. We want to plot Fraction Retained as our metric, so drag that from the Data Sources column to the Measures pane:
Now we want to plot this over Month Rank, so let’s drag Month Rank from the Data Sources column to the Dimensions pane:
Lastly we want to compare the fraction retained over month rank between Cohort, so drag Cohort from the Data Sources column to the dimensions pane as pane as well. ChartIO, on seeing that this is a date field, assumes you want to plot it by week. We want to plot it by month, so click on it in the dimensions pane to reveal a dropdown, and set Time bucket to Month:
Now click the Chart Query button. A table with the different values (a subset of those we saw in Navicat) will appear above. Click on the line graph icon, to its right, to plot a line chart and bingo! The cohort analysis is complete:
We can see that retention has not, sadly for us, improved over time. It does appear that the users acquired in August, however, were retained better than those acquired in other months. It would be interesting to understand why: to do this, we’d need to look at how each user was acquired (what drove them to our website), and whether some channels are better at driving “sticky” users than others, and if those channels accounted for a bigger share of user acquisition in August than the other months.
We plan to post more guides to using the recipes directly in ChartIO and other analytics tools, and as building blocks for developing your own, bespoke analysis, over the next few months. Stay tuned!