Using the new SQL views to perform cohort analysis with ChartIO


We wanted to follow-up our recent launch of Snowplow 0.8.10, with inbuilt SQL recipes and cubes, with a few posts demonstrating how you can use those views to quickly perform analytics on your Snowplow data. This is the first of those posts.


In this post, we’ll cover how to perform a cohort analysis using ChartIO and Snowplow.

Recap: what is Cohort Analysis

We have described cohort analysis at length in the Analyst Cookbook. To sum up, a cohort analysis is a longitudal study, that compares the behaviour or characteristics of groups of people over a long period of time. It therefore encompasses a broad range of analyses, because you can vary:

  1. how you group people into cohorts (cohort definition), and
  2. the characteristic that you’re comparing between cohort over time.

In digital media people generally use the phrase ‘cohort analysis’ to refer to measurements of retention rates for different cohorts, where cohorts are defined by when a user was acquired. In that way, SaaS companies, for example, can compare how well they retained customers acquired in October vs those acquired in September vs those acquired in August, and measure in a robust way whether they are getting better at retaining users over time. (This is key to SaaS business model being viable.)

We’ve included 10 different cohort analyses as standard with Snowplow, all of which compare retention rates between different cohorts, but which vary in how they define cohorts and what unit of time the comparison is performed over:

Cohort definition Unit of time comparison is performed over
Paid channel acquired Months
Paid channel acquired Weeks
Referer acquired Months
Referer acquired Weeks
Month first touch website Months
Week first touch website Weeks
Month first sign in to website Months
Week first sign in to website Months
Month first transact on website Months
Week first transact on website Weeks

Other metrics you might want to compare between cohorts include average revenue per user and average engagement by user. (Engagement can be defined in many different ways.) For details in how to perform cohort analyses with these measures, see the detailed recipe in the Analysts Cookbook.

In this post we’re going to plot retention by month, based on the month that a user first touched the website. We’re going to perform this analysis for the Snowplow website itself.

Performing the analysis

Before we dive into ChartIO and create our plot, let’s first have a look at the raw data in Navicat, by opening up our view directly. Go into Navicat (or your SQL UI of choice) and open up customer_recipes.cohort_retention_by_month_first_touch view:

Let’s take a good look at the actual structure of the data: this will make it much clearer how to successfully plot the data in ChartIO:

Now – to perform our analysis, we want to plot a line graph: one per cohort, where we’re plotting month rank (on the x-axis) against fraction retained (on the y-axis). We expect a graph as follows, with retention by cohort decreasing over time for each of cohorts:


Let’s plot our cohort analysis! Log into ChartIO, go into a dashboard where you want to create your graph and click the Add Chart button. (We’re going to assume you’ve already setup a connection in ChartIO to your Snowplow data in Redshift, and in particular, the recipes_customer schema. If you haven’t, instructions can be found here.)


Select the data source on the left which connects to the recipes_customer schema. A long list of all the different views available in that schema will be shown below. Select the Cohort Dfn by Month First Touch view. The different measures and dimensions will be shown:


Now let’s create our plot. We want to plot Fraction Retained as our metric, so drag that from the Data Sources column to the Measures pane:


Now we want to plot this over Month Rank, so let’s drag Month Rank from the Data Sources column to the Dimensions pane:


Lastly we want to compare the fraction retained over month rank between Cohort, so drag Cohort from the Data Sources column to the dimensions pane as pane as well. ChartIO, on seeing that this is a date field, assumes you want to plot it by week. We want to plot it by month, so click on it in the dimensions pane to reveal a dropdown, and set Time bucket to Month:


Now click the Chart Query button. A table with the different values (a subset of those we saw in Navicat) will appear above. Click on the line graph icon, to its right, to plot a line chart and bingo! The cohort analysis is complete:


We can see that retention has not, sadly for us, improved over time. It does appear that the users acquired in August, however, were retained better than those acquired in other months. It would be interesting to understand why: to do this, we’d need to look at how each user was acquired (what drove them to our website), and whether some channels are better at driving “sticky” users than others, and if those channels accounted for a bigger share of user acquisition in August than the other months.

We plan to post more guides to using the recipes directly in ChartIO and other analytics tools, and as building blocks for developing your own, bespoke analysis, over the next few months. Stay tuned!


Related articles