This guide is geared towards data analysts who have limited or no experience with R. It aims to get the reader started with R through concrete examples with Snowplow data - so that by the end of the guide, the reader can pull Snowplow data into R, play around with it and plot it. That should leave the reader comfortable enough to use other resources to build out her knowledge of R, and give her enough knowledge to start using R to perform productive analytics on Snowplow data.

In the first section of this guide, we outline why R is such an excellent tool to use. Next (sections 2 to 4), we work through how to pull a simple data set from Snowplow and visualize it in R. In sections 5 to 7, we work through a more complicated set of examples: one in which we pull a more involved data set from Snowplow, and visualize it in a number of ways that are difficult with traditional BI / OLAP tools in Excel.

R is a fantastic analytics tool. To date, however, only a minority of web analysts use it. To list just some of what makes it so excellent, especially when compared with traditional general purpose tools like Excel, and BI tools like Tableau:

• It is great for producing a wide variety of visualizations - including a much wider than those supported by BI tools
• It has fantastic tools for extending existing visualizations and creating new visualizations all together
• It is easy to document analytics performed with R and retrace your steps. (Simply by copying and pasting your steps into the R console.) This makes R a much safer and more robust environment to interrogate data than e.g. Excel or Tableau, where if you realize you made an error eight steps back, retracing your steps can be difficult and time consuming.
• It is blindingly fast to both perform complicated analytics and generate beautiful visualizations. The syntax is incredibly concise - what you can achieve in one line of R can take hours of working through menus and options on other systems.
• It helps you think through what you do with your data in a more rigorous way. R forces you to define your data types much more specifically than either Excel or BI tools like Tableau. That rigour is helpful - it means you do things faster, and you’re less liable to make mistakes.
• It is great at statistics. Traditional BI tools and Excel suck at statistics. Sure you can calculate means, medians, quartiles etc. But actually pulling these together into meaningful distribution plots, or plotting how they change over time, is a pain in these tools, which are much better at counting and summing data. R is simply much better at statistics
• It is a great tool for modeling data: it is straightforward to create models in R, and compare those models to your actual data sets, either formally through calculation, or by plotting you model(s) against your data(sets)
• It has an enormous library of packages available for performing just about any type of analytics imaginable
• It is free
• It works on all platforms

In spite of all the benefits listed above, people often find struggle with R to start off with. The command line is a difficult place to start, and R has some idiosyncracies that need to be understood.

In this guide, we cover them by working through practical examples with Snowplow data, paying special attention to covering some of those key concepts and idiosyncracies.

If you have not installed R yet, instructions on doing so can be found here.

This guide assumes you have installed R, and installed the RPostgreSQL package required to get R to talk to PostgreSQL databases including Amazon Redshift. If you have not, installed this, you can do so by executing the following at the R prompt:

R will ask you to select a mirror to download the package from. Select a local mirror and the package should install automatically.

Now we need to tell R to use the package:

The library lets us execute a SQL query against our data in Redshift, and pulls the result of the query into R as a data frame. (We’ll explain a bit about data frames shortly.) The query we want to execute, to find out how many unique visitors we’ve received to our site by day, is:

To execute the query in R, first we have to setup a connection to our Redshift database, by executing the following commands:

Now we can execute the query, and assign it to a variable called uniquesByDay, by executing the following:

Note how double inverted commas have to be escaped with a backslash.

We can have a look at the uniquesByDay variable by simply entering it at the command line:

R responds by dumping the contents of the variable to the screen. We see that the data is in the form of a table with three columns. The first column is the row number, the second column is the date column, and the third is the uniques column. We can view just the top 5 lines of data by entering head(uniquesByDay) at the prompt. We’ve pasted the contents of the screen, including what R returns, in our case:

We can find out more about our uniquesByDay object, by typing summary(uniquesByDay) at the prompt:

R tells us that our data contains two columns: date and uniques. It tells us what type each column is: date is of type character, and uniques is a numeric field, for which it gives us some basic statistical information, so we can get a sense of the range of values.

Our date column really corresponds to date. We can update the data type for this column:

When we now look at the summary for our data frame, the type of data in the date column has changed:

We can get further information on the structure of the data frame by executing the str() function on it:

This confirms that the type of the first column has been set to a date format.

There are a number of built in and additional libraries for plotting data visualizations in R. We recommend starting out with the absolutely excellent ggplot2.

First, install the ggplot2 package, if you have not done so previously:

Now that it’s installed, you can load the library:

Now we can plot the number of uniques over time, using the qplot (quickplot) command:

The plot appears in a new window:

The previous example a was simple enough start. It didn’t enable us to do anything that could not easily be achieved using any other visualization tool.

In our second example, we’ll pull a more complicated data set from Snowplow, and try out some visualizations in GGPlot2 (boxplot and jittered scattergrams) that I think are awesome at conveying distribution information, and are not well supported by other tools.

For this example, we’re going to look at visit data by a number of different dimensions, and produce different plots to see how different visit characteristics vary with those different dimensions. This is classically something that is well supported by a BI / OLAP tool like Tableau. The example shows that R has those same capabilities, and in addition, a broader set of visualizations to enable us to unpick relationships between metrics and dimensions.

First, we need to design a SQL query to pull the cut of data we want. In this case, we want visit level data. We want to capture the following metrics for each visit:

• Number of distinct pages viewed (i.e. breadth of content engaged with)
• Total number of events (as a measure of depth engagement)

With the following dimensions:

• Whether or not the user visited our “services” page. (A goal of ours is to drive visitors to buy our services.)
• The landing page the visitor arrived on
• The referer that drove them to our website
• The timestamp the visit started. (So we can analyse e.g. how behavior varies by day of the week.)
• Which country the visitor is located in
• The browser type
• How many times the visitor had visited our website previously

The SQL query to return our required measures by visit and some of the dimensions is straightforward:

We need to add a line about whether or not one of our services pages was visited as part of the customer journey. We can identify all the visits where the visitor did pop by one of our services pages by executing the following query:

And then join the results of the above query to our first query to generate the additional dimension:

Lastly we add the data about who are referer is, and what the URL of the landing page is. Both of these data points can be read from the first line of data for the visit, so we add them by doing an additional join with the output of the previous query, this time on domain_userid, domain_sessionidx and min(collector_tstamp):

Now that we have our query, we can pull the data into R:

Our visits data frame is more complicated than the uniquesByDay data frame we created last time. Let’s inspect the structure:

This is a convenient moment to take a quick aside from our example and introduce factors in R.

In data analysis, we can distinguish three types of variable:

Type of variable Description Example
Continuous A variable that can take any value with a range Height of a person, session length
Ordinal A discrete (not continuous) variable, but one in which there is an order Sizes of T-shirt (small / medium / large)
Nominal A discrete (not continuous) variable, where there is no inherent order between values Landing pages

Continuous variables in R are typically numeric. Ordinal and nominal variables are factors. We can tell R which of the columns in our data frame are factors. This will help R

1. Process our data more efficiently
2. Plot our data more intelligently (because R will understand the type of variable in each column)

domain_userid, for example, is a factor: it can take one of a finite (but large) number of values of cookies set. It is an unordered factor, because there is no hierarchy between values. We can tell R it is factor by entering:

If we inspect the structure of that column, we’ll see it is now a factor:

Let’s update the status of our other dimensions in the same way:

We can view the updated structure of the data frame:

Let’s start by examining the breadth and depth of engagement by visit:

That is kind of interesting: it shows that we have visitors who engage very deeply with only one page on our site (e.g. the visitor at the top left of the scatter) and we have visitors who engage lightly across a broad range of pages (bottom right) and deeply with a lot of pages (top right). However, most of the points are in the bottom left, and they are too closely spaced to see how closely packed they are.

We can get a clearer view if we plot a jitter plot: this “jitters” points left and right about where they would otherwise be plotted, making it easier to identify areas with a high concentration of points:

Note: we have changed the plot type from “point” to “jitter” to jitter each point. The “alpha” argument is responsible for point transparency: we have set it to 5% (i.e. 20 points in one place will be required to create a black spot). The scale_y_continuous and scale_x_continuous arguments scale the x and y axes so that we zoom in on the point of the graph where most of our visits lie. R has kindly reminded us that some of our points lie outside this range, and so have been been removed.

The graph shows us that the majority of visitors do not engage deeply. It actually appears that engagement is likely to be deeper if a visitor visits fewer pages. That is interesting: we can explore if using a boxplot:

The boxplot shows that our earlier intuition was wrong: engagement depth increases with breadth. We had thought looking at the jittered plot that a higher proportion of people who only engaged with one page engaged more deeply. Now we can see that this was false: it looked to be the case, because the absolute number of people engaging deeply who only look at one page is higher than the actual number who engage deeply with a broader set. The boxplot controls for the difference in absolute numbers: we are instead looking at the distribution within each breadth level.

One thing to notice is that when producing the boxplot, we plot factor(distinct_pages_visited) on the x-axis rather than simply distinct_pages_visited. That is because whereas a scatter or jitter plot can be produced by plotting a continuous variable on the x-axis, a boxplot requires a discrete variable. Applying the factor function converts our continuous variable into a discrete variable very simply: in general, R makes converting one type of variable into another very simple.

The boxplot above was interesting: it showed very clearly a relationship between breadth and depth of engagement by visit. Let’s use the same plot, but this time to compare breadth of engagement by landing page:

Ouch - this graph is very messy: breadth of engagement does appear to vary by landing page, but there are too many (100+) landing pages to make the analysis straight forward. What we want is to plot the results just for the top 10 landing pages, by number of visits.

To do this, let’s start by identifying the top 10 landing pages by number of visits. This is a straightforward query to write in SQL: luckily a package is available for R called sqldf that enables you to to execute SQL statements against data frames in R directly, and returns the results as a new data frame:

We’ve now created a data frame with two columns: landing_page (with the landing page name) and cnt (with the count of visits). There are 10 lines of data in the frame, which corresopnd to the top 10 landing pages by number of visits.

We can now use this to create a subset of the visit data so we only include visits where the landing page is in our top 10 list:

Before we produce our plot, it would be nice to make landing_page an ordered factor, where the ordering is determined by the number of visits. This is straightforward to do:

We can now plot breadth of engagement by landing page, just for the top landing page:

We can tidy up the x-axis by aligning the labels on the axis vertically, by adding  + theme(axis.text.x = element_text(angle=90, vjust=0.4)) to our command. We can also hide the outliers on our plot, to make it clear, by adding outlier.shape=NA:

The results are very interesting: visitors who land on the homepage and analytics cookbook are much more likely to engage more broadly with content on the site, rather than visitors who land on the specific blog posts (e.g. on writing Hive UDFs or dealing with Hadoop small files problem.)

Let’s see if there is a similar pattern in the depth of engagement by landing page. (This is likely: remember we showed in our first boxplot that visitors who engage more broadly are more likely to also engage more deeply.)

The impact of landing page on depth of engagement is not nearly as pronounced as that between landing page and depth of engagement. That is likely to be because of both:

• The intent of people landing on specific blog posts is likely to be much more focused on learning specific bits of information (that may not be Snowplow specific), rather than learn about Snowplow in general. (Which someone landing on the hoempage or cookbook homepage are likely to be interested in.)
• The design of the homepage and analytics cookbook homepage encourage users to explore beyond the page much more forcefully than the links that surround individual blog posts.

Let’s now have a look and see if breadth of engagement changes depending on the referer type. To do this, we simply swap out landing_page as our x-axis dimension (first argument in qplot) and swap in refr_medium. Also note that we have switched back to using our original data frame visits, rather than the subset we created for the landing page analysis:

Visitors from email and search tend to engage more deeply with the Snowplow content. Are they more likely to buy our services, though?

Note that position="fill" makes each bar on our bar graph extend 100%.

The chart is interesting - it suggests that even though e.g. direct visitors engage less deeply, overall, they are more likely to visit our services pages (and hence convert) than visitors from search. The relationship between level of engagement and conversion isn’t straightforward - we can explore it by comparing our engagement scatter plots (breadth vs depth) for visitors that did visit the services page, with those that do not:

We used the facet argument above to ask R to plot us two version of the scatter plot, one for visits where the services pages were visted, and one for visits where they were not, to see if the overall pattern of engagement for the two groups is different. The results are interesting: they suggest that in both cases, there is a strong relationship between breadth and depth of engagement. (These are highly correlated.) However, visitors that visited the services pages are more likely to engage both more deeply and more broadly, overall, than those who do not, as evidenced from the lower fraction of points at the bottom right of the graph. (Further visualization work is required to specify this more precisely.)

• Hacley Wickham’s excellent ggplot2 book. Written by the author of ggplot2, this covers the actual graphics grammar implemented by this excellent library