This guide is geared towards data analysts who have limited or no experience with R. It aims to get the reader started with R through concrete examples with Snowplow data - so that by the end of the guide, the reader can pull Snowplow data into R, play around with it and plot it. That should leave the reader comfortable enough to use other resources to build out her knowledge of R, and give her enough knowledge to start using R to perform productive analytics on Snowplow data.
- Why use R?
- Getting started: plotting the number of uniques per day
- Having a look at the data: an introduction to data frames in R
- Creating our first plot
- A more complicated example: comparing visit characteristics by a range of dimensions
- Introducing factors in R
- Producing some more interesting plots with ggplot2
- Where to go from here
In the first section of this guide, we outline why R is such an excellent tool to use. Next (sections 2 to 4), we work through how to pull a simple data set from Snowplow and visualize it in R. In sections 5 to 7, we work through a more complicated set of examples: one in which we pull a more involved data set from Snowplow, and visualize it in a number of ways that are difficult with traditional BI / OLAP tools in Excel.
R is a fantastic analytics tool. To date, however, only a minority of web analysts use it. To list just some of what makes it so excellent, especially when compared with traditional general purpose tools like Excel, and BI tools like Tableau:
- It is great for producing a wide variety of visualizations - including a much wider than those supported by BI tools
- It has fantastic tools for extending existing visualizations and creating new visualizations all together
- It is easy to document analytics performed with R and retrace your steps. (Simply by copying and pasting your steps into the R console.) This makes R a much safer and more robust environment to interrogate data than e.g. Excel or Tableau, where if you realize you made an error eight steps back, retracing your steps can be difficult and time consuming.
- It is blindingly fast to both perform complicated analytics and generate beautiful visualizations. The syntax is incredibly concise - what you can achieve in one line of R can take hours of working through menus and options on other systems.
- It helps you think through what you do with your data in a more rigorous way. R forces you to define your data types much more specifically than either Excel or BI tools like Tableau. That rigour is helpful - it means you do things faster, and you’re less liable to make mistakes.
- It is great at statistics. Traditional BI tools and Excel suck at statistics. Sure you can calculate means, medians, quartiles etc. But actually pulling these together into meaningful distribution plots, or plotting how they change over time, is a pain in these tools, which are much better at counting and summing data. R is simply much better at statistics
- It is a great tool for modeling data: it is straightforward to create models in R, and compare those models to your actual data sets, either formally through calculation, or by plotting you model(s) against your data(sets)
- It has an enormous library of packages available for performing just about any type of analytics imaginable
- It is free
- It works on all platforms
In spite of all the benefits listed above, people often find struggle with R to start off with. The command line is a difficult place to start, and R has some idiosyncracies that need to be understood.
In this guide, we cover them by working through practical examples with Snowplow data, paying special attention to covering some of those key concepts and idiosyncracies.
Back to top.
If you have not installed R yet, instructions on doing so can be found here.
This guide assumes you have installed R, and installed the
RPostgreSQL package required to get R to talk to PostgreSQL databases including Amazon Redshift. If you have not, installed this, you can do so by executing the following at the R prompt:
R will ask you to select a mirror to download the package from. Select a local mirror and the package should install automatically.
Now we need to tell R to use the package:
The library lets us execute a SQL query against our data in Redshift, and pulls the result of the query into R as a data frame. (We’ll explain a bit about data frames shortly.) The query we want to execute, to find out how many unique visitors we’ve received to our site by day, is:
To execute the query in R, first we have to setup a connection to our Redshift database, by executing the following commands:
(For detials about where to find your host, port, dbname and username see the setup guide.)
Now we can execute the query, and assign it to a variable called
uniquesByDay, by executing the following:
Note how double inverted commas have to be escaped with a backslash.
Back to top.
We can have a look at the
uniquesByDay variable by simply entering it at the command line:
R responds by dumping the contents of the variable to the screen. We see that the data is in the form of a table with three columns. The first column is the row number, the second column is the
date column, and the third is the
uniques column. We can view just the top 5 lines of data by entering
head(uniquesByDay) at the prompt. We’ve pasted the contents of the screen, including what R returns, in our case:
We can find out more about our uniquesByDay object, by typing
summary(uniquesByDay) at the prompt:
R tells us that our data contains two columns:
uniques. It tells us what type each column is:
date is of type
uniques is a numeric field, for which it gives us some basic statistical information, so we can get a sense of the range of values.
date column really corresponds to
date. We can update the data type for this column:
When we now look at the summary for our data frame, the type of data in the
date column has changed:
We can get further information on the structure of the data frame by executing the
str() function on it:
This confirms that the type of the first column has been set to a date format.
Back to top.
There are a number of built in and additional libraries for plotting data visualizations in R. We recommend starting out with the absolutely excellent
First, install the
ggplot2 package, if you have not done so previously:
Now that it’s installed, you can load the library:
Now we can plot the number of uniques over time, using the
qplot (quickplot) command:
The plot appears in a new window:
Back to top.
The previous example a was simple enough start. It didn’t enable us to do anything that could not easily be achieved using any other visualization tool.
In our second example, we’ll pull a more complicated data set from Snowplow, and try out some visualizations in GGPlot2 (boxplot and jittered scattergrams) that I think are awesome at conveying distribution information, and are not well supported by other tools.
For this example, we’re going to look at visit data by a number of different dimensions, and produce different plots to see how different visit characteristics vary with those different dimensions. This is classically something that is well supported by a BI / OLAP tool like Tableau. The example shows that R has those same capabilities, and in addition, a broader set of visualizations to enable us to unpick relationships between metrics and dimensions.
First, we need to design a SQL query to pull the cut of data we want. In this case, we want visit level data. We want to capture the following metrics for each visit:
- Number of distinct pages viewed (i.e. breadth of content engaged with)
- Total number of events (as a measure of depth engagement)
With the following dimensions:
- Whether or not the user visited our “services” page. (A goal of ours is to drive visitors to buy our services.)
- The landing page the visitor arrived on
- The referer that drove them to our website
- The timestamp the visit started. (So we can analyse e.g. how behavior varies by day of the week.)
- Which country the visitor is located in
- The browser type
- How many times the visitor had visited our website previously
The SQL query to return our required measures by visit and some of the dimensions is straightforward:
We need to add a line about whether or not one of our services pages was visited as part of the customer journey. We can identify all the visits where the visitor did pop by one of our services pages by executing the following query:
And then join the results of the above query to our first query to generate the additional dimension:
Lastly we add the data about who are referer is, and what the URL of the landing page is. Both of these data points can be read from the first line of data for the visit, so we add them by doing an additional join with the output of the previous query, this time on
Now that we have our query, we can pull the data into R:
visits data frame is more complicated than the
uniquesByDay data frame we created last time. Let’s inspect the structure:
Back to top.
This is a convenient moment to take a quick aside from our example and introduce factors in R.
In data analysis, we can distinguish three types of variable:
|Type of variable||Description||Example|
|Continuous||A variable that can take any value with a range||Height of a person, session length|
|Ordinal||A discrete (not continuous) variable, but one in which there is an order||Sizes of T-shirt (small / medium / large)|
|Nominal||A discrete (not continuous) variable, where there is no inherent order between values||Landing pages|
Continuous variables in R are typically numeric. Ordinal and nominal variables are factors. We can tell R which of the columns in our data frame are factors. This will help R
- Process our data more efficiently
- Plot our data more intelligently (because R will understand the type of variable in each column)
domain_userid, for example, is a factor: it can take one of a finite (but large) number of values of cookies set. It is an unordered factor, because there is no hierarchy between values. We can tell R it is factor by entering:
If we inspect the structure of that column, we’ll see it is now a factor:
Let’s update the status of our other dimensions in the same way:
We can view the updated structure of the data frame:
Back to top.
Let’s start by examining the breadth and depth of engagement by visit:
That is kind of interesting: it shows that we have visitors who engage very deeply with only one page on our site (e.g. the visitor at the top left of the scatter) and we have visitors who engage lightly across a broad range of pages (bottom right) and deeply with a lot of pages (top right). However, most of the points are in the bottom left, and they are too closely spaced to see how closely packed they are.
We can get a clearer view if we plot a jitter plot: this “jitters” points left and right about where they would otherwise be plotted, making it easier to identify areas with a high concentration of points:
Note: we have changed the plot type from “point” to “jitter” to jitter each point. The “alpha” argument is responsible for point transparency: we have set it to 5% (i.e. 20 points in one place will be required to create a black spot). The
scale_x_continuous arguments scale the x and y axes so that we zoom in on the point of the graph where most of our visits lie. R has kindly reminded us that some of our points lie outside this range, and so have been been removed.
The graph shows us that the majority of visitors do not engage deeply. It actually appears that engagement is likely to be deeper if a visitor visits fewer pages. That is interesting: we can explore if using a boxplot:
The boxplot shows that our earlier intuition was wrong: engagement depth increases with breadth. We had thought looking at the jittered plot that a higher proportion of people who only engaged with one page engaged more deeply. Now we can see that this was false: it looked to be the case, because the absolute number of people engaging deeply who only look at one page is higher than the actual number who engage deeply with a broader set. The boxplot controls for the difference in absolute numbers: we are instead looking at the distribution within each breadth level.
One thing to notice is that when producing the boxplot, we plot
factor(distinct_pages_visited) on the x-axis rather than simply
distinct_pages_visited. That is because whereas a scatter or jitter plot can be produced by plotting a continuous variable on the x-axis, a boxplot requires a discrete variable. Applying the
factor function converts our continuous variable into a discrete variable very simply: in general, R makes converting one type of variable into another very simple.
The boxplot above was interesting: it showed very clearly a relationship between breadth and depth of engagement by visit. Let’s use the same plot, but this time to compare breadth of engagement by landing page:
Ouch - this graph is very messy: breadth of engagement does appear to vary by landing page, but there are too many (100+) landing pages to make the analysis straight forward. What we want is to plot the results just for the top 10 landing pages, by number of visits.
To do this, let’s start by identifying the top 10 landing pages by number of visits. This is a straightforward query to write in SQL: luckily a package is available for R called
sqldf that enables you to to execute SQL statements against data frames in R directly, and returns the results as a new data frame:
We’ve now created a data frame with two columns:
landing_page (with the landing page name) and
cnt (with the count of visits). There are 10 lines of data in the frame, which corresopnd to the top 10 landing pages by number of visits.
We can now use this to create a subset of the
visit data so we only include visits where the landing page is in our top 10 list:
Before we produce our plot, it would be nice to make
landing_page an ordered factor, where the ordering is determined by the number of visits. This is straightforward to do:
We can now plot breadth of engagement by landing page, just for the top landing page:
We can tidy up the x-axis by aligning the labels on the axis vertically, by adding ` + theme(axis.text.x = element_text(angle=90, vjust=0.4))
to our command. We can also hide the outliers on our plot, to make it clear, by adding outlier.shape=NA`:
The results are very interesting: visitors who land on the homepage and analytics cookbook are much more likely to engage more broadly with content on the site, rather than visitors who land on the specific blog posts (e.g. on writing Hive UDFs or dealing with Hadoop small files problem.)
Let’s see if there is a similar pattern in the depth of engagement by landing page. (This is likely: remember we showed in our first boxplot that visitors who engage more broadly are more likely to also engage more deeply.)
The impact of landing page on depth of engagement is not nearly as pronounced as that between landing page and depth of engagement. That is likely to be because of both:
- The intent of people landing on specific blog posts is likely to be much more focused on learning specific bits of information (that may not be Snowplow specific), rather than learn about Snowplow in general. (Which someone landing on the hoempage or cookbook homepage are likely to be interested in.)
- The design of the homepage and analytics cookbook homepage encourage users to explore beyond the page much more forcefully than the links that surround individual blog posts.
Let’s now have a look and see if breadth of engagement changes depending on the referer type. To do this, we simply swap out
landing_page as our x-axis dimension (first argument in
qplot) and swap in
refr_medium. Also note that we have switched back to using our original data frame
visits, rather than the subset we created for the landing page analysis:
Visitors from email and search tend to engage more deeply with the Snowplow content. Are they more likely to buy our services, though?
position="fill" makes each bar on our bar graph extend 100%.
The chart is interesting - it suggests that even though e.g. direct visitors engage less deeply, overall, they are more likely to visit our services pages (and hence convert) than visitors from search. The relationship between level of engagement and conversion isn’t straightforward - we can explore it by comparing our engagement scatter plots (breadth vs depth) for visitors that did visit the services page, with those that do not:
We used the
facet argument above to ask R to plot us two version of the scatter plot, one for visits where the services pages were visted, and one for visits where they were not, to see if the overall pattern of engagement for the two groups is different. The results are interesting: they suggest that in both cases, there is a strong relationship between breadth and depth of engagement. (These are highly correlated.) However, visitors that visited the services pages are more likely to engage both more deeply and more broadly, overall, than those who do not, as evidenced from the lower fraction of points at the bottom right of the graph. (Further visualization work is required to specify this more precisely.)
Back to top.
This tutorial barely scratches the surface of what is possible with R. The purpose of it is to give enough familiarity with basic R functions around fetching data, managing data via data frames, and plotting data, that the reader will be able to get up and running with R on top of Snowplow data.
There is a wealth of material on R available. The following are some recommended resources:
- R in Action. An excellent and thorough overview of R, with lots of straightforward-to-follow examples
- Hacley Wickham’s excellent ggplot2 book. Written by the author of
ggplot2, this covers the actual graphics grammar implemented by this excellent library
- R-bloggers. Regular updates, recipes and R-related news
We will be buildling out the number of recipes in the Snowplow Analytics cookbook that use R. At the moment, we have:
- Performing market basket analysis with R
- More to come…
Back to top.