Improving Snowplow's understanding of time

15 September 2015  •  Alex Dean
As we evolve the Snowplow platform, one area we keep coming back to is our understanding and handling of time. The time at which an event took place is a crucial fact for every event - but it’s surprisingly challenging to determine accurately. Our approach to date has been to capture as many clues as to the “true timestamp” of an event as we can, and record these faithfully for further analysis. The steady expansion...

First experiments with Apache Spark at Snowplow

21 May 2015  •  Justine Courty
As we talked about in our May post on the Spark Example Project release, at Snowplow we are very interested in Apache Spark for three things: Data modeling i.e. applying business rules to aggregate up event-level data into a format suitable for ingesting into a business intelligence / reporting / OLAP tool Real-time aggregation of data for real-time dashboards Running machine-learning algorithms on event-level data We’re just at the beginning of our journey getting familiar...

Uploading Snowplow events to Google BigQuery

08 February 2015  •  Andrew Curtis
As part of my winternship here at Snowplow Analytics in London, I’ve been experimenting with using Scala to upload Snowplow’s enriched events to Google’s BigQuery database. The ultimate goal is to add BigQuery support to both Snowplow pipelines, including being able to stream data in near-realtime from an Amazon Kinesis stream to BigQuery. This blog post will cover: Getting started with BigQuery Downloading some enriched events Installing BigQuery Loader CLI Analyzing the event stream in...

Modeling events through entity snapshotting

18 January 2015  •  Alex Dean
At Snowplow we spend a lot of time thinking about how to model events. As businesses re-orient themselves around event streams under the Unified Log model, it becomes ever more important to properly model those event streams. After all: “garbage in” means “garbage out”: deriving business value from events is hugely dependent on modeling those events correctly in the first place. Our focus at Snowplow has been on defining a semantic model for events: one...

Introducing self-describing Thrift

16 December 2014  •  Fred Blundun
At Snowplow we have been thinking about how to version Thrift schemas. This was prompted by the realization that we need to update the SnowplowRawEvent schema, which we use to serialize the Snowplow events received by the Scala Stream Collector. We want to update this in a way that supports further schema evolution in the future. The rest of this post will discuss our proposed solution to this problem: The problem The un-versioned approach Adding...

Introducing self-describing JSONs

15 May 2014  •  Alex Dean
Initial self-describing JSON draft. Date: 14 May 2014. Draft authors: Alexander Dean, Frederick Blundun. Updated 10 June 2014. Changed iglu:// references to iglu: as these resource identifiers do not point to specific hosts. At Snowplow we have been thinking a lot about how to add schemas to our data models, in place of the implicit data models and wiki-based tracker protocols that we have today. Crucially, whatever we come up with must also work for...

Introducing SchemaVer for semantic versioning of schemas

13 May 2014  •  Alex Dean
Initial SchemaVer draft. Date: 13 March 2014. Draft authors: Alexander Dean, Frederick Blundun. As we start to re-structure Snowplow away from implicit data models and wiki-based tracker protocols towards formal schemas (initially Thrift and JSON Schema, later Apache Avro), we have started to think about schema versioning. "There are only two types of developer: the developer who versions his code, and developer_new_newer_newest_v2" Proper versioning of software is taken for granted these days - there are...

Building an event grammar - understanding context

11 March 2014  •  Alex Dean
Here at Snowplow we recently added a new feature called “custom contexts” to our JavaScript Tracker (although not yet into our Enrichment process or Storage targets). To accompany the feature release we published a User Guide for Custom Contexts - a practical, hands-on guide to populating custom contexts from JavaScript. We want to now follow this up with a post on the underlying theory of event context: what it is, how it is generated and...

The three eras of business data processing

20 January 2014  •  Alex Dean
Every so often, a work emerges that captures and disseminates the bleeding edge so effectively as to define a new norm. For those of us working in eventstream analytics, that moment came late in 2013 with the publication of Jay Kreps’ monograph The Log: What every software engineer should know about real-time data’s unifying abstraction. Anyone involved in the operation or analysis of a digital business ought to read Jay’s piece in its entirety. His...

Scripting Hadoop, Part One - Adventures with Scala, Rhino and JavaScript

21 October 2013  •  Alex Dean
As we have got to know the Snowplow community better, it has become clear that many members have very specific event processing requirements including: Custom trackers and collector logging formats Custom event models Custom business logic that impacts on the way their event data is processed To date, we have relied on three main techniques to help Snowplow users meet these requirements: Adding additional configuration options into the core Enrichment process (e.g. IP address anonymization,...

Towards universal event analytics - building an event grammar

12 August 2013  •  Alex Dean
As we outgrow our “fat table” structure for Snowplow events in Redshift, we have been spending more time thinking about how we can model digital events in Snowplow in the most universal, flexible and future-proof way possible. When we blogged about building out the Snowplow event model earlier this year, a comment left on that post by Loic Dias Da Silva made us realize that we were missing an even more fundamental point: defining a...