Why is Snowplow so unusual in aiming for high-fidelity analytics? Most often, analytics vendors sacrifice the goal of high-fidelity data at the altar of these three compromises:
Premature aggregation - when the data store gets too large, or the reports take too long to generate, it’s tempting to perform the aggregation and roll-up of the raw event data earlier, sometimes even at the point of collection. Of course this offers a huge potential performance boost to the tool, but at the cost of a huge degree of customer data fidelity
Ignoring bad news - the nature of event data means that often incomplete, corrupted or plain wrong data is sent in to the analytics tool by the event trackers. Handling bad event data is complicated (let’s go shopping!). Instead of dealing with the complexity, most analytics packages just throw the bad data away silently; this is why tag audit companies like ObservePoint exist
Being over-opinionated - customer analytics is full of challenging questions which need answering before you can analyse the data: do I track users by their first-party cookie, third-party cookie, business ID and/or IP address? Do I use the server clock, or the user’s clock to log the event time? When does a user session start and end? Because these questions can be difficult to answer, most analytics tools don’t ask them: instead they take an opinionated view of the “right answer” and silently enforce that view through their event collection, storage and analysis. By the time users realize that the logic enforced is one that does not work for their business, they are already tied to that vendor and the imperfect data set they have created with that vendor to date.
To deliver on the goal of high-fidelity analytics, then, we’re trying to steer Snowplow around these three common pitfalls as best we can.
We have talked in detail on our website and wiki about avoiding pitfall #1, Premature aggregation. In short: we do no aggregation - Snowplow users have access to granular, event level data, so that they can work out how best they should aggregate it for each type of analysis they wish to perform.
We will blog more about our ideas to combat #3, Being over-opinionated, in the future.
For the rest of this blog post, though, we will look at our solution to pitfall #2, Ignoring bad news: namely, event validation.
Our new Scalding-based event enrichment process (introduced in our last blog post) introduces the concept of event validation.
Instead of “ignoring bad news”, the Snowplow enrichment engine now validates that every logged event matches the format that we expect for Snowplow events - be they page views, ecommerce transactions, custom structured events or some other type of event. Events which do not match this format are stored in a new “Bad Rows” bucket in Amazon S3, along with the specific data validations which the event failed.
By way of example, here are a couple of custom structured events generated by a ecommerce site running Snowplow; both of these events failed the new validation step in our Scalding ETL process. You will note that the bad rows are logged to the S3 bucket in JSON format - we have “pretty printed” the rows to make them easier to read:
These validation errors occurred because the ecommerce site incorrectly tried to log customer address information in the value field of a custom structured event; the value field only supports numeric values (and is stored in Redshift in a float field). When we saw these validation errors, we notified the site and they corrected their Google Tag Manager implementation.
Currently these bad rows are simply stored for inspection in the Bad Rows bucket in S3, while Snowplow carries on with the raw event processing. This lets the Snowplow user tackle the tagging/data quality issues offline, without disrupting the loading of all their high-fidelity, now-validated event data into Redshift. It leaves open the possibility that the user can fix and reprocess the bad rows.
In the future we could look into ways of sending alerts when bad rows are generated, or even look into ways of automatically fixing bad rows and submitting them for re-processing.
This is straight forward stuff - but compare it with the approach taken by other web analytics vendors. If a Google Analytics user sends incorrectly configured data into GA, for example, one of two things happens:
GA silently ignores the data
GA accommodates the data, so that it corrupts reports produced in GA
For the GA user, spotting the error is impossible in either case. Not only has a data point been lost, but potentially an erroneous data point has been introduced, one that will be very hard to debug given that users can never inspect the underlying data.
This becomes more of a problem as we move to a Unviersal Analytics world: one in which companies feed GA with all their customer event data from a variety of systems. Ensuring that the system is fed with perfect data will only get harder, whilst dealing with situations where erroneous data has been pushed in will remain impossible.
That completes our brief look at event validation. We hope it is clear why this is such an important topic. For us at Snowplow, event validation is a key part of our quest for high-fidelity event analytics - so expect to hear more from us on this topic soon!