# Converting Snowplow data into a format suitable for OLAP reporting tools e.g. Tableau, Qlikview, Pentaho, Microstrategy

Snowplow data is stored in a log file format, where each line of data represents one “event” on a particular user journey. This data structure is unsuitable for traditional BI / analysis tools like Tableau, Qlikview, Pentaho or Microstrategy which require that the data be structured in a format suitable for OLAP analysis. (Sometimes also called pivot tables.) In this section of the cookbook, we describe how to restructure Snowplow event data into a format suitable for OLAP analysis, so that it can be interrogated using tools like Tableau and Qlikview.

Although this guide is written specifically for Snowplow data, the basic approach to converting log format data into a structure suitable for OLAP data should work for other log or event data sets as well.

In this recipe, we give an overview of OLAP reporting tools before walking through the steps necessary to convert Snowplow log file format data into a format suitable for data analysis. (So that it can be interrogated by tools like Tableau, Qlikview, Microstrategy etc.)

OLAP tools like Tableau, Qlikview, Pentaho and Microstrategy are very popular amongst data analysts because they

1. Make it easy to slice data along different dimensions, exploring different relationships over time, and answering a wealth of different business questions
2. They do not require much technical knowledge to use. (E.g. no need to know SQL or Python).
3. They are well suited to “train of thought analysis” i.e. moving quickly between one view of the data and another, as insights derived from the first immediately lead to questions that are answered by the second.

OLAP tools are especially well suited for Snowplow web analytics data:

1. There are a wide range of dimensions we might want to slice and dice web analytics data by, including time, user, visit, geography, device, browser, type of web page, web page, content and/or product, acquisition channel…
2. There are a wide variety of metrics we might want to compare between dimension combinations e.g. unique users, visits, page views, events, purchases, conversion rates, revenue…
3. Web analysts are generally very familiar with OLAP analysis / pivot tables: Google Analytics custom reports enables analysts to select metrics and dimensions like a primitive (and incredibly slow) OLAP cube, for example.

Back to [top] (#top).

#### 2.1 OLAP overview

OLAP is an approach for analysing multi-dimensional data. OLAP stands for “online analytics processing”, but it in fact relates to something much more tightly defined in data analytics: the treating of multidimensional data as a cube.

An OLAP cube is a multi-dimensional array of data. Data points are made up of one or more metrics. (In our cases, uniques, visits, page views, transactions, revenue, number of content items consumed etc.) Data can be viewed by a range of different dimensions. (In our case, examples include time of day, day in the week, time of the year, year, customer cohort, type of device, type of browser etc.) An OLAP reporting tool makes it easy for analysts to view the metrics they want, sliced by the particular dimensions they’re interested in. So, for example, if an analyst wanted to see if conversion rates had been improving over time, they might slice the conversion rates metric by the time dimension (e.g. by month), to view if there had been an improvement. If there had been an improvement, they might then drill down to see if that improvement had been across the board: was it true of all customer segments, across all device types etc.?

When we say OLAP cube, then, we visualise a “cube” of data points (i.e. metrics) at the intersection of multiple dimensions. (Three in the case of a cube, but more often there are more dimensions. Technically we should talk in terms of a “hyper-cube”, but it doens’t really matter.)

#### 2.2.1 Slicing data

We “slice” data when we pick two dimensions to view a particular metric by. The analogy is to take a “slice” through an OLAP cube to produce a 2D data set.

#### 2.2.2. Dice data

Rather than slice data into two dimensions, we might want to create a subcube with more than 2 dimensions. This operation is called “dice”.

#### 2.2.3. Drill up and drill down

OLAP dimensions are often organised in a hierarchy. To gives some examples:

Year -> Quarter -> Month -> Day -> Hour -> Minute -> Second

Organic referrers -> Google -> specific keywords

Browser -> browser version

Category -> Subcategory -> Product


Drilling down refers to moving down the dimension hierarchy - e.g. from viewing sales by month to by week, and then by day. In each case, the level the metrics are being aggregated drops, so the actual numbers reported fall.

In contrast, drilling up means moving up the dimension hierarchy - e.g. form viewing sales by month to by year. In this case, you’re aggregating bigger data sets (a whole year’s worth of data), so the the actual numbers get larger.

#### 2.2.4 Pivot

Pivoting means swaping one dimension for another. Typically this is used in two situations:

1. With an initial data set, to spot if there are interesting relationships between particular dimensions (e.g. product mix by city or conversion rate by time of day).
2. To better understand a relationship once one has been spotted. (Does product mix only vary by city? Or is it actually that user segments vary by city, and it is user segment that is predictive of product mix?)

Back to [top] (#top).

In traditional OLAP parlance, at the heart of OLAP data is a “fact table”. In Snowplow paralance, that fact table is really an “events” table. There are two key requirements to fulfil to ensure that events data is structured in a format suitable for processing by an OLAP tool like Tableau:

1. The granularity of the data is sufficient to support the sufficient “drill down”. So, for example, if you want to be able to drill down to an individual user level, for example, your data set has to have separate entries for each user, each differentiated by the user_id field. Any aggregation of users (e.g. cohorts) must happen in the OLAP tool e.g. Tableau, not in the raw data fed into Tableau
2. All the dimensional information associated with each event (or “fact”) that you want to slice / dice on must be present in the line of event data.

Of the two requirements outlined above, meeting the first using Snowplow data is easy, because Snowplow data is already stored at the most granular level. (I.e. at least one line of data per “event”.) Meeting the second is a bit more nuanced. We discuss both below:

#### 3.1 Getting the granularity of data right

The more granular our data, the more reporting flexibility we have. One of the most frequently cited complaints about Google Analytics, for example, is that the data Google provides isn’t sufficiently granular, so you cannot, for example, drill down to explore the way an individual user has engaged with your site.

With OLAP analysis, increased granularity doesn’t just support better drill down facilities. It also enables more pivotting possibilities - as you can slice and dice more combinations of metrics against one another.

So granularity is good. The good news is that Snowplow data is very granular: at least one line of data per event. If we wanted (and it’s perfectly legitimate to), we could feed Snowplow data into our OLAP reporting tool without aggregating it at all. However, there is a cost associated with this level of granularity: it means that the data volumes are greater, and so it is likely that the reporting tool will work more slowly. This used to be a much more important consideration (when RAM wasn’t so cheap, and before columnar databases like Infobright, in-memory databases and SSD drives). However, it is still a reasonable consideration today.

One example of an approach to aggregating Snowplow data: we could aggregate it at the level of the user, session and event. (Be it a particular page load, product add to basket etc.) In this case, if we had a user who had visited a particular page 3x in one session, we would only have one line of data representing those three page views. (As opposed to having three in our original Snowplow events data set.) This would reduce our volume of data somewhat (likely by a factor of 0.1 - 0.25), but still give us a lot of reporting flexibility. (We’d be able to drill down to the user and action level.) We would not, however, be able to perform any path analysis. (I.e. look at the sequence of events in a particular user session.) In any case, this type of analysis isn’t well supported by OLAP tools like Tableau.

#### 3.2 Ensure all dimensions information associated with each event is in every line of data

There are a number of dimensions that are already available on every line of Snowplow data. For example, the browser and operating system fields are populated in every line of Snowplow data.

However, there are other dimensions that are not available on every line of Snowplow data. These dimensions need to be inferred by looking at user behavior across multiple lines of data. To give two examples:

Say we are interested in bucketing users into cohorts based on the first time they performed a particular action e.g. “signup”. Maybe we bucket users by month. In order to decide which bucket a user belongs, we need to scan all their records to identify the line of data generated when they first signed up, and then read the dt field on that line and use that to determine the bucket. This can be accomplished, for example, using the following SQL:

When we feed our data into an OLAP reporting tool like Tableau, we need a column in it called “cohort”, and for each event for each user, that column needs to be populated with the results from the above query.

To take another example: say we want to bucket users again, but this time by the channel that drove them to our website in the first place. Again, to work this out, we need to look at the first event for each of those users, and see what drove him / her there. Our SQL query would look like:

In our modified data set for our OLAP reporting tool, we’d want a column for this data (maybe called “first referrer”) that was populated with the first_referrer field generated by the above query.

In both the above cases, we take data that we deduce about a dimension from scanning multiple rows of events data (in each case, about the “user” dimension), and add it into an additional column to our “event” line of data in or OLAP data set. This means that it is straightforward for the OLAP tool to identify that this event happened for a user who belongs in this particular cohort, and so when aggregating metrics for that cohort, the metric on this line of data should be included. Tools like Tableau will not scan multiple lines of data to infer the dimension value for a particular line of data computation needs to happen to the data set before it is imported in.

Back to [top] (#top).

OLAP has a long history (which means it is an old technology). In the literatue on OLAP, data structures are typically described in terms of star schemas: a denormalized data structure with a central “fact table” of the actual events that occured, linked to dimensions table (by lookups) that shed light on each individual action.

Image taken from Wikipedia

Columar databases like Infobright mean organising data formally in a star schema is no longer necessary: we still store denormalized data, but effectively collapse all the dimension tables into the fact table, to create a single “fat” fact table with all the relevant dimensions:

Columnar databases make this possible because they make “columns cheap”: query times are a function of the number of columns in the query - any that are unused are ignored. Having all the columns in a single table makes querying significantly easier: it means no joins are necessary. It also means tools like Tableau can run directly on our data table, without having to peform any joins themselves. (Which is generally a time consuming operation.)

Back to [top] (#top).

Converting Snowplow event log data into a dimensional OLAP struture is a four step process

1. Define the structure of our OLAP data table, including both metric and dimension columns
2. Work out what level of granularity to use i.e. what each line of data reprents. (One event or a higher level of aggregation?)
3. Define our table
4. Work out how to compute each line of data
5. Generate the OLAP data

### 5.1.1 Metrics

Let us start with the metrics we want to include, and make sure we have all the basics we’d expect for any website:

• Uniques
• Visits
• Page views

There are a large number of other metrics we could include, depending on the type of website / web app running Snowplow. For example, an ecommerce site might want to log:

• Number of purchases (transactions)
• Number of products sold
• Value of sales (revenue)

A content site might want to include:

• Number of articles consumed
• Time spent consuming articles
• Number of articles shared
• Number of articles liked
• Number of articles tweeted
• Number of articles commented on
• Number of videos watched
• Minutes spent streaming media
• Number of videos completed watching
• Number of subscription signups
• Subscription spend
• Number of ad units shown
• Number of ad untis engaged with

These are just basic metrics. Later on (once the cube has been created), it should be possible to define more sophisticated metrics e.g.

• Breadth of engagement. (Which we might define as the number of articles read)
• Deptch of engagement. (Which we might define based on the time spent reading articles, or decide that a more engaged user is one who has liked / commented / shared an article.)

A social network might want to include:

• Number of connections initiated
• Number of connections accepted
• Number of posts
• Number of direct messages
• Number of X way conversations
• Number of shares

To keep this tutorial a manageable length, we’ll only include the first three metrics (uniques, visits and page views) in our example. However, extending our OLAP data table to include many more metrics is possible (and recommended): one of the explicit aims of Snowplow, after all, is to enable you to report the metrics that matter to your business, however particular those metrics are. (For a video of an example cube with many more dimensions, see our blog post on analysing Snowplow data with Tableau.)

### 5.1.2 Dimensions

When deciding which dimensions to include, we need to think about all the entities that matter to our business. Some common entities that all website / webapp owners should be interested in:

• Users (or visitors)
• Visits (or sessions)
• Pages

Depending on the type of website / webapp we run, however, there are many more we might want to include. For example, an online retailer would be interested in :

• Products
• Orders
• Stage in customer funnel (e.g. window shopping vs add to basket vs checkout vs purchase)

A content site would be interested in:

• Articles
• Authors
• Topics

A social network would be interested in:

• Friends
• Groups

Again, to keep this tutorial manageable, we’ll concentrate on just the first three entities identified i.e. users, visits, pages. However, we do recommend including all the entities in your OLAP data structure that matter to you business. Now, for each entity, we should list what data points we want to capture. Some suggestions:

User

• user_id
• cohort (where maybe we define cohort as the month-year in which he / she started first visited our website)
• The traffic source for the user based on his / her first visit i.e. mkt_source, mkt_medium, mkt_term, mkt_content, mkt_campaign for the first visit and page_referrer for the very first page view of that first visit

Visit

• visit_id
• Visit date i.e. dt
• Visit time e.g. time of initial event or page view
• Traffic source for teh visit i.e. mkt_source, mkt_medium, mkt_term, mkt_content, mkt_campaign for the first visit and page_referrer for the very first page view of that visit

Page

• page_url
• page_title

We should recognise that there is a hierarchy in the three entities we’ve identified: users may visit our website once or more, and on each visit view one or more pages. This will be important when we generate our OLAP data.

Remember - we’re only including a handful of dimensions to keep this tutorial manageable. We recommend including as many dimensions as possible, to support as wide a range of OLAP analyses as possible.

### 5.2 Work out what level of granularity to use i.e. what each line of data reprents. (One event or a higher level of aggregation?)

We discussed the merits and costs or keeping more granular data in your OLAP cube. For the sakes of this example, we’re going aggregate data by page by visit. That means we will be able to look at the number of page views and number of unique page views by visit, but not analyse e.g. for users who look at a particular web page twice in one session e.g. how many page views they viewed in between the two of the same page.

### 5.3 Define our table

To return to our example, we are now in a position to define the structure of the OLAP data as it would look in Infobright:

Note: only one of our three metrics is defined in our table: page_views. We do not have a column for uniques or visits. The reason is that both of these can be derived from the data above: we can work out how many uniques on a particular day by counting the number of distinct user_ids. Similarly, we can calculate the number of visits based on the number of distinct CONCAT(user_id, "-", visit_id) in a time period.

If we were aggregating our data to a the visit level (so we have one line of data for each visit), we could have an additional metric column called visits and set the value to 1 for each line of data. (Because each line of data would represent one visit.) Then summing this value over e.g. time would have given us a visit count. However, in our case, where each line of data represents one page view, there is no way to sum across page views to get visits. (Because we don’t know in advance how many page views in a visit.) Instead, we count distinct effective visit identifiers, where our visit identifier is created by concatenating the user_id with the visit_id.

Similarly, if we were to aggregate our data to the user level (so we have one line of data for each user), we could have a column for both the uniques and visits metrics, where the uniques column would be set to 1 on every data line and the value of visits would vary depending on the number of visits by user. However, we would not be able to perform any page level analysis with our data set.

### 5.4 Work out how to compute each line of data

There are a number of tools we can use to generate out OLAP data set from our Snowplow data set. In this example, we’re going to assume our events data is in Infobright, and we’ll generate our OLAP data set using just SQL. Note: the same SQL should work equally well to generate an output data set in Hive. However, we would not want our OLAP data to live in Hive, as querying it would be much slower than if that data lived in a columnar database like Infobright. Train-of-thought analysis (which OLAP reporting tools excel at) is only possible if querying is fast, and Hive/ map reduce is not fast.

We’ve decided that we want a line of data for every web page on every visit. We can generate a line of data at this granularity using the following query:

So far so good, we now have a number of the desired columns in our OLAP data set. We’re still missing a number of dimensions, however, that relate to the visit (i.e. dt, tm, and source of traffic) and user (i.e. cohort and source of traffic).

Let us first generate the missing visit data. We want the date, time and traffic source data fields that are stored in the first line of event data for each visit. We can pull that data from our Snowplow events table by executing the following query:

And then join it with our original data set to populate the missing visit dimension fields:

Now we only need to add the missing user dimension fields i.e. cohort and traffic source fields. Both can be derived from the first event that each user takes, ever, using the following query:

Now need to join these additional data fields to our data set:

The above query will return Snowplow data in a format suitable for processing in an OLAP reporting tool like Tableau.

### 5.5 Generate the OLAP data

Generating the data required is as simple as running the above query. However, using SQL statements to generate versions of the Snowplow events data to use in OLAP reporting tools is not computationally efficient: in particular, it requires multiple passes through the entire data set and several joins between subtables derived from the events table, all of which are pretty costly.

For that reason, we recommend using SQL to generate agile data cubes: to quickly generate data for querying in Tableau / Qlikview / Pentaho with a view to experimenting with new dimensions and metrics, for example. If it turns out that a particular data cube is of value to you, and you want to continue to use it (and keep it updated), we’d recommend using Cascading in a Hadoop environment to keep you OLAP data set up-to-date. We will document how to do this in due course.

If you’re running Infobright Community Edition, you wont be able to write the output of the query directly into a new table because INSERT statements are not supported. Instead, we’ll write our results to file, and then import them back into Infobright.

To write the results to file, execute the above query, but add INTO OUTFILE  at the end of the statement.

Now create a new table to house the data, using the [table definition given above] (#table-def).

Now we can load our saved data into our new table

We’re ready to connect our OLAP reporting tool of choice (e.g. Tableau, Qlikview) to our data set in Infobright!

Back to [top] (#top).

Connecting Tableau desktop edition to Infobright is a bit fiddly: in most cases, your data will be on a remote Infobright instance and you’ll want to establish an SSH connection between the PC running Tableau and that instance. Instructions on how to do this is beyond the scope of this tutorial, but a decent explanation of how to setup a tunnel using Putty can be found here.

Once you have setup your tunnel, launch Tableau and select connect to data. Select MySQL from the list of options:

In our case I’ve setup Putty to forward requests to localhost:1234 to my remote Infobright instance. The server name and port number you’ll need to enter will be determined by the nature of the tunnel you’ve established.

You’ll be given an option as to how much of the data you want imported into Tableau’s fast data engine. If your data set isn’t too large, you can try importing all of it. Otherwise, select “import some data” or “Connect live””. Your workbook should launch:

Tableau’s made some guesses as to which one of our columns are dimensions and which are metrics. If the field is numeric, Tableau assumes it is a metric. If it’s not, it assumes it’s a dimension. That works for most of our fields, but not for visit_id (which is a dimension). Simply drag visit_id from the Measures pane to the Dimensions pane:

Now we need to add our two missing metrics: uniques and visits. To add uniques, right click on user_id (in the dimensions pane) and select Create calculated field. Tableau gives us the opportunity to create a new, calcultaed field. Fill in the name and formula as below:

Click OK: Uniques will now be listed as a Measure. Now we need to create a new calculated field for Visits:

Now we’re all set! We can drag and drop our fields and metrics to our heart’s content, to explore our data set. To see a video demonstration of how to do this (including with a cube with more dimensions and metrics than in this tutorial) see our blog post on analysing Snowplow data with Tableau.

Back to [top] (#top).

TO WRITE