Data collection: the essential, but unloved, foundation of the data value chain

16 January 2017  •  Yali Sassoon
it is so obvious no one bothers saying it. Data collection is an essential part of any data strategy After all: without data collection, there is no data. Without data there is no data value chain. No reporting, no analysis, no data science, no data-driven decision making. It is not just that people in data don’t remark on the importance of data collection. They do not talk about data collection at all. To take just...

Looking back at 2016

12 January 2017  •  Diogo Pacheco
With the start of 2017, we have decided to look back at our 2016 blog and our community Discourse posts that generated more engagement with our users. More than ten thousand users spent a total of 548 hours reading our blog posts whilst on Discourse (which we only launched this year), 8700 unique users spent 424 hours reading and participating in the Snowplow community. Let’s take a closer look at: Top 10 blog posts published...

Building robust data pipelines that cope with AWS outages and other major catastrophes

10 February 2016  •  Yali Sassoon
At Snowplow, we pride ourselves on building robust data pipelines. Recently that robustness has been severly tested, by two different outages in the AWS us-east-1 region (one S3 outage, and one DynamoDB outage that caused issues with very many other AWS APIs inculding EC2), and by an SSL certificate issue with one of our client’s collectors that meant that for five consecutive days no events were successfully recorded from their most important platform: iOS. In...

Web and mobile data only gets you to first base when building a single customer view

17 January 2016  •  Yali Sassoon
One of the main reasons that companies adopt Snowplow is to build a single customer view. For many of our users, Snowplow lets them for the first time join behavioral data gathered from their website and mobile apps with other customer data sets (e.g. CRM). This simple step drives an enormous amount of value. However, this is just the beginning. Most companies engage with users on a very large number of channels - not just...

We need to talk about bad data

07 January 2016  •  Yali Sassoon
Architecting data pipelines for data quality No one in digital analytics talks about bad data. A lot about working with data is sexy, but managing bad data, i.e. working to improve data quality, is not. Not only is talking about bad data not sexy, it is really awkward, because it forces us to confront a hard truth: that our data is not perfect, and therefore the insight that we build on that data might not...

Anton Parkhomenko is a Snowplower!

25 December 2015  •  Alex Dean
Astute readers of this blog have probably noticed a regular new author - we are hugely excited to introduce Anton Parkhomenko to the Snowplow team! Anton joined us as a Data Engineering intern this summer to launch our new Schema Guru project. Anton was already an experienced software engineer; for him the Snowplow internship was about getting his first professional experience in Scala and Functional Programming, plus gaining exposure to Big Data technologies and open...

Looking back on 2015: Most read blogposts

24 December 2015  •  Christophe Bogaert
2015 is drawing to a close, so we decided to crunch our own numbers in Redshift and share which blogposts were read the most. The Snowplow team published 82 new posts in 2015 and more than 2953 hours were spent reading content on our blog (a metric which we calculated using page pings). Apache Spark and AWS Lambda were the topics that resonated most with our readers. We will continue to write about both topics,...

Orchestrating batch processing pipelines with cron and make

13 October 2015  •  Alex Dean
At Snowplow we are often asked how best to orchestrate multi-stage ETL pipelines, where these pipelines typically include Snowplow and our SQL Runner, sometimes Huskimo and often third-party apps and scripts. There is a wide array of tools available for this kind of orchestration, including AWS Data Pipeline, Luigi, Chronos, Jenkins and Airflow. These tools tend to have the following two capabilities: A job-scheduler, which determines when each batch processing job will run A DAG-runner,...

Christophe Bogaert is a Snowplower!

20 April 2015  •  Yali Sassoon
Snowplow clients who have been working with us on analytics projects, and anyone who’s been keeping up with our releases, will have noticed a new face on the Snowplow team. It is with great pleasure that we introduce Christophe Bogaert to the Snowplow communitiy. Christophe joined us as our first Data Scientist in Februray. He designed, tested and delivered the data models that are at the heart of last week’s Snowplow v.64 Palila release -...

Joshua Beemster is a Snowplower!

19 February 2015  •  Alex Dean
You have probably started seeing a new name behind software releases and blog posts recently: we are hugely excited to belatedly introduce Joshua Beemster to the Snowplow team! Josh joined us as a Data Engineer last fall. He is our first remote hire - he is currently based in Dijon, France. Josh hails from Australia and is currently taking his Bachelor of Computer Science at Charles Sturt University, Sydney via Distance. Since starting at Snowplow,...

Fred Blundun is a Snowplower!

02 July 2014  •  Alex Dean
You have probably seen a new name behind blog posts, new software releases and email threads recently: we are hugely excited to introduce (somewhat belatedly!) Fred Blundun to the team! Fred joined us a Data Engineer this spring. Fred is a Mathematics graduate from Cambridge University; data engineering at Snowplow is his first full-time role in software. Fred hit the ground running at Snowplow with some great new tracker releases, including: The Snowplow Python Tracker...

Making Snowplow schemas flexible - our technical approach

06 June 2014  •  Yali Sassoon
In the last couple of months we’ve been doing an enormous amount of work to make the core Snowplow schema flexible. This is an essential step to making Snowplow an event analytics platform that can be used to store event data from: Any kind of application. The event dictionary, and therefore schema, for a massive multiplayer online game, will look totally different to a newspaper site, which will look different to a banking application Any...

Amazon Kinesis tutorial - a getting started guide

15 January 2014  •  Yali Sassoon
Of all the developments on the Snowplow roadmap, the one that we are most excited about is porting the Snowplow data pipeline to Amazon Kinesis to deliver real-time data processing. We will publish a separate post outlining why we are so excited about this. (Hint: it’s about a lot more than simply real-time analytics on Snowplow data.) This blog post is intended to provide a starting point for developers who are interested in diving into...

Loading JSON data into Redshift - the challenges of quering JSON data, and how Snowplow can be used to meet those challenges

20 November 2013  •  Yali Sassoon
Very many of our Professional Services projects involve forking the Snowplow codebase so that specific clients can use it to load their event data, stored as JSONs, into Amazon Redshift, so that they can use BI tools to create dashboards and mine that data. We’ve been surprised quite how many companies have gone down the road of using JSONs to store their event data. In this blog post, we look at: Why logging event data...

Snowplow passes 500 stars on GitHub

01 October 2013  •  Alex Dean
As of yesterday, the Snowplow repository on GitHub now has over 500 stars! We’re hugely excited to reach this milestone, having picked up 300 new watchers since our last milestone in January. Many thanks to everyone in the Snowplow community and on GitHub for their continuing support and interest! Here’s a quick round-up of the most popular open source analytics projects on GitHub: Hummingbird (real-time web analytics) - 2,299 stars Piwik (web analytics) - 1,290...

How much does Snowplow cost to run?

27 September 2013  •  Yali Sassoon
We are very pleased to announce the release of the Snowplow Total Cost of Ownership Model. This is a model we started developing back in July, to enable: Snowplow users and prospective users to better forecast their Snowplow costs on Amazon Web Services going forwards The Snowplow Development Team to monitor how the cost of running Snowplow evolves as we build out the platform Modelling the costs associated with running Snowplow has not been straightforward:...

Unpicking the Snowplow data pipeline and how it drives AWS costs

09 July 2013  •  Yali Sassoon
Back in March, Robert Kingston suggested that we develop a Total Cost of Ownership model for Snowplow: something that would enable a user or prospective user to accurately estimate their Amazon Web Services monthly charges going forwards, and see how those costs vary with different traffic levels. We thought this was an excellent idea. Since Rob’s suggestion, we’ve made a number of important changes to the Snowplow platform that have changed the way Snowplow costs...

Dealing with Hadoop's small files problem

30 May 2013  •  Alex Dean
Hadoop has a serious Small File Problem. It’s widely known that Hadoop struggles to run MapReduce jobs that involve thousands of small files: Hadoop much prefers to crunch through tens or hundreds of files sized at or around the magic 128 megabytes. The technical reasons for this are well explained in this Cloudera blog post - what is less well understood is how badly small files can slow down your Hadoop job, and what to...

Towards high-fidelity web analytics - introducing Snowplow's innovative new event validation capabilities

10 April 2013  •  Alex Dean
A key goal of the Snowplow project is enabling high-fidelity analytics for businesses running Snowplow. What do we mean by high-fidelity analytics? Simply put, high-fidelity analytics means Snowplow faithfully recording all customer events in a rich, granular, non-lossy and unopinionated way. This data is incredibly valuable: it enables companies to better understand their customers and develop and tailor products and services to them. Ensuring that the data is high fidelity is essential to ensuring that...

Inside the Plow - Rob Slifka's Elasticity

20 March 2013  •  Alex Dean
The Snowplow platform is built standing on the shoulders of a whole host of different open source frameworks, libraries and tools. Without the amazing ongoing work by these individuals, companies and not-for-profits, the Snowplow project literally could not exist. As part of our “Inside the Plow” series, we will also be showcasing some of these core components of the Snowplow stack, and talking to their creators. To kick us off, we are delighted to have...

Bulk loading data from Amazon S3 into Redshift at the command line

20 February 2013  •  Yali Sassoon
On Friday Amazon launched Redshift, a fully managed, petabyte-scale data warehouse service. We’ve been busy since building out Snowplow support for Redshift, so that Snowplow users can use Redshift to store their granular, customer-level and event-level data for OLAP analysis. In the course of building out Snowplow support for Redshift, we need to bulk load data stored in S3 into Redshift, programmatically. Unfortunately, the Redshift Java SDK is very slow at inserts, so not suitable...

Writing Hive UDFs - a tutorial

08 February 2013  •  Alex Dean
Snowplow’s own Alexander Dean was recently asked to write an article for the Software Developer’s Journal edition on Hadoop The kind folks at the Software Developer’s Journal have allowed us to reprint his article in full below. Alex started writing Hive UDFs as part of the process to write the Snowplow log deserializer - the custom SerDe used to parse Snowplow logs generated by the Cloudfront and Clojure collectors so they can be processed in...

Help us build out the Snowplow Event Model

04 February 2013  •  Yali Sassoon
At its beating heart, Snowplow is a platform for capturing, storing and analysing event-data, with a real focus on web event data. Working out how best to structure the Snowplow event data is key to making Snowplow a success. One of the things that has surprised us, since we started working on Snowplow, is the extent to which our view of the best way to structure that data has changed over time. In this blog...

The Snowplow development roadmap for the ETL step - from ETL to enrichment

09 January 2013
In this blog post, we outline our plans to develop the ETL (“extract, transform and load”) part of the Snowplow stack. Although in many respects the least sexy element of the stack, it is critical to Snowplow, and we intend to re-architect the ETL step in quite significant ways. In this post, we discuss our plans and the rationale behind them, in the hope to get: Feedback from the community on them Ideas for alternative...

Understanding the thinking behind the Clojure Collector, and mapping out its development going forwards

07 January 2013
Last week we released Snowplow 0.7.0: which included a new Clojure Collector, with some significant new functionality for content networks and ad networks in particular. In this post we explain a lot of the thinking behind the Clojure Collector architecture, before taking a look ahead at the short and long-term development roadmap for the collector. This is the first in a series of posts we write where describe in some detail the thinking behind the...