12 January 2017  •  Inside the Plow  •  Diogo Pacheco

Looking back at 2016

looking-back-to-2016

With the start of 2017, we have decided to look back at our 2016 blog and our community Discourse posts that generated more engagement with our users.

More than ten thousand users spent a total of 548 hours reading our blog posts whilst on Discourse (which we only launched this year), 8700 unique users spent 424 hours reading and participating in the Snowplow community.

Let’s take a closer look at:

  1. Top 10 blog posts published in 2016
  2. Top 10 Discourse threads published in 2016

1. Top 10 blog posts published in 2016

Let’s start by looking into our top 10 blog posts by number of unique users.

Top Blog posts Unique users Time in min
1 An introduction to event data modeling 1504 5220
2 Introducing Snowplow Mini 1156 3141
3 Introducing Factotum data pipeline runner 1072 1729
4 We need to talk about bad data 891 2330
5 Ad impression and click tracking with Snowplow 791 2013
6 Introducing Sauna, a decisioning and response platform 761 1930
7 Snowplow JavaScript Tracker 2.6.0 released with Optimizely and Augur integration 619 1485
8 Building first and last touch attribution models in Redshift SQL 511 1686
9 Debugging bad data in Elasticsearch and Kibana - a guide 460 776
10 Web and mobile data only gets you to first base when building a single customer view 341 921

While this ranking already gives us some insights on what type of content drove the most engagement, let’s plot this the number of uniques against the average engagement time per post by unique, to compare compare posts not only by how many people each attracts but how long each of those people spends reading the content (at least on average).

Number of unique users per average time spent

top2016_1_f

The blog post, An introduction to event data modeling, stands out as the post that not only attracted the largest number of readers but also kept them reading longer than any of the other 10 posts. Event data modeling is a hot topic: one we’ve done a lot of thinking about at Snowplow over hte last 18 months. This was the first post where we started to sketch out an overall approach and highlight some of the key challenges to event data modeling, and it’s great to see that the community at large engaged with us. We’ve certainly had a lot of interesting conversations of that back of that blog post, and the presentations and other posts and threads on this topic.

It’s therefore also great to see that the second post by average time engaged per user was another event data modeling post - this time on building first and last touch attribution models in Redshift SQL.

Snowplow Mini was a surprise hit for us in 2016. The initial version was prototyped on a company hackathon back in Feb. By the time we published Introducing Snowplow Mini we had already piloted its use across a number of our users and found that it was invaluable to them as they developed new event and entity (context) schemas: enabling to test those instrumentation updates prior to rolling them out.

Introducing Factotum data pipeline runner was the third most popular blog post by number of users. This is very exciting: Factotum is something we developed at Snowplow to make our jobs of reliably instrumenting and running a huge number of data pipelines, each defined by a DAG, efficiently and robustly across hundreds of our users. The interest in Factotum shows that other people and companies are also interested in better managing the ongoing running of complciated, multi-step data pipelines.

Drilling into the source of traffic of the top 10 blog posts

To better understand the channels that drove users to our most read posts, we can split traffic by refr_medium . We have plotted the blog posts per referrer to understand the distribution of traffic between the posts.

Distribution of unique users per different sources of traffic ranked by total unique users:

top2016_2_f

Search was a significant driver for many posts and after further investigation, we discovered that for example, the top post An introduction to event data modeling was ranking as the first result in the Google search when searched “Event data modelling”. Direct traffic drove significant more traffic for Introducing Factotum data pipeline runner and Introducing Sauna, a decisioning and response platform while Social had a significant impact for the top 6 posts.

Let’s now look at our top Discourse posts.

2. Top 10 Discourse threads published in 2016

Top Discourse posts Unique users Time in min
1 Visualise Snowplow data using Airbnb Caravel & Redshift [tutorial] 530 1121
2 Identifying users (identity stitching) 429 979
3 Should I use views in Redshift? 362 376
4 Wagon alternative 277 177
5 How to setup a Lambda architecture for Snowplow 251 806
6 Debugging a Serializable isolation violation in Redshift (ERROR: 1023) [tutorial] 208 496
7 Debugging bad rows in Spark and Zeppelin [tutorial] 201 296
8 Comparing Snowplow with Google Analytics 360 BigQuery integration (WIP) 184 480
9 Basic SQL recipes for web data 183 726
10 Loading Snowplow events into Apache Spark and Zeppelin on EMR [tutorial] 181 289

Now let’s plot the same visualisation as before:

Number of unique users per average time spent:

top2016_3_f

The Discourse tutorial on Visualis[ing] Snowplow data using Airbnb Caravel & Redshift was the post that attracted the largest number of users: people are certainly interested in open source tools for visualizing data! It’s not a surprise therefore that the post Wagon alternative also featured in the top 10.

Our Basic SQL recipes for web data ranked first for engaged time: perhaps not surprising as it’s likely readers will have walked through the different example queries whilst testing them on their own Snowplow data.

Event data modeling also feature in the top 10 with our post Identifying users (identity stitching).

It’s also great to see the active interest in Spark by the Snowplow community - two of the top 10 posts are about analyzing Snowplow data with Spark.

What should we be writing about in 2017?

If you have any ideas then let us know. Please stay tuned to our Blog, Discourse and Twitter during 2017.

And sign up to our mailing list for a monthy digest of new content by the Snowplow Team and broader Snowplow Community.

Thoughts or questions? Come join us in our Discourse forum!
Diogo Pacheco
Diogo is a data analyst at Snowplow.