We need to talk about bad data. We can’t manage an issue unless we identify and it explore it. The good news is that if we take a systematic approach, we can manage bad data very effectively. In this post, we’ll confront bad data head on, and discuss how we’ve architected the Snowplow pipeline to manage and mitigate bad data.
Bad data is a problem because it erodes our confidence in our data, which limits our ability to build insight on that data. To take a very simple example, the graph below shows some metric (it can be any metric) over time, and a temporal dip in that metric.
What caused the dip?
We need to rule to rule out (2) if we are to conclude (1).
The impact of bad data is more corrosive than simply making interpreting data difficult. Once question marks have been raised in an organization about the reliability of the data source, it becomes very hard to win back confidence in the data source, after which it is impossible to socialize insight derived from that data source.
If we are going to manage bad data, the first thing we need to do is identify as many potential sources of bad data as possible. We can never rule out the possibility that there are as yet unidentified sources of bad data - but we can do our best by keeping our list of known sources of bad data as exhaustive as possible.
In order to identify help us with this exercise, I’ve included a schematic of the Snowplow data pipeline below. Whilst this is a Snowplow-specific diagram, the general data pipeline stages are broadly in common with many commercial data pipeline products and home-brewed event data pipelines. We can use this diagram as a jumping off point to identify potential sources of bad data in any data pipeline.
There are two root causes of data quality issues:
2.1 Missing data
2.2 Inaccurate data
We’ll start by identifying the sources of missing data. If we’ve missed any, then do please share them in the Comments section below!
Another class of data quality issue is caused by inaccurate data. Again let us try and identify the different sources of inaccurate data:
The first step to managing bad data is to ensure that we can spot data quality issues i.e. identify either that data is missing, or that it is inaccurate. Making sure that your data pipeline is auditable, so that you can conduct those checks, is essential.
There are several features of the Snowplow pipeline that make it especially easy to audit data quality:
3.1 Auditing can be done at the event-level
3.2 Auditing can be done at each stage in the data pipeline
3.3 Data is never dropped from the Snowplow pipeline
Let’s discuss each of these in turn.
When you audit data processed through the Snowplow data pipeline, you can drill down to the event-level. That’s very important, because:
Most data pipelines are auditable at the event level - one notable exception is Google Analtyics (the free rather than the Premium version, which makes ‘hit’ level data available in BigQuery.) This seriously limits the ability of Google Analytics users to identify and manage data quality issues - something many have become aware of since migrating from GA Classic to Universal Analytics, with perceived discrepancies in the KPI post-migration being hard to explore leave alone explain.
Event data pipelines are complicated things: a lot of computation is done on the data through the pipeline. This is a good thing, because each step adds value to the data, but equally each step creates a new opportunity for something to go wrong resulting either in data loss (missing data) or data corruption (inaccurate data).
Snowplow is architected so that the data can be diligenced throughout the data pipeline. Specifically:
One of the lovely things about diligencing event-level data is that it is easy to reason about: if you record 10 events with a tracker, you should be able to identify those 10 events in the raw collector logs, those 10 events in the processed data in s3 and those 10 events in your different storage targets (e.g. Redshift or Elasticsearch).
What if something goes wrong with a processing step for one of the events?
Every other analytics platform and data pipeline that we are aware of simply ‘drops’ the problematic data, resulting in missing data. This is deeply problematic, because it becomes very hard to spot missing data: if you are relying on your analytics system to tell you what has happened and it fails to tell you about something, you have to be very fortunate to separately know that that thing occurred to identify that it is missing.
So with Snowplow, rather than drop events that fail to process, we load them into a separate “bad rows” folder in S3. This data set includes a record not only of all the events that failed to be processed successfully, but also an array of all the error messages generated when that data was processed. We then load the bad data into Elasticsearch, to make it easy for our users to:
A guide to using Kibana / Elasticsearch to monitoring bad data with Snowplow is in the works, but in the meantime, to give you a flavor, the below screenshot shows the number of events that failed to be successfully processed for a Snowplow trial user. At the top of the UI you can view the number of events that failed to successfully process per run: note that this is alarmingly high. Below, we can inspect a sample set of events. We can quickly identify that all the events in this batch that failed processing had the same root cause: a non-numeric value inserted into a numeric column (
A basic data science principle is that you should be able to recompute your analysis from scratch. This is important because if you spot an error in one of the processing steps, you can fix it and then run the analysis again.
We’ve taken exactly the same approach with the Snowplow pipeline: you can always reprocess your data from scratch, (starting with the raw collector logs). That means that you can:
When we diligence our data quality, we need more than just the data itself - we need metadata that describes:
This metadata is essential to help us identify and address issues - to take a very simple example, if we can see that an issue is restricted to a particular application, time period, or enrichment version, we can take steps to resolve that problem. Performing that diagnosis requires that metadata.
The Snowplow pipeline generates a large amount of metadata on data lineage, and stores that metadata with the data itself. For example, each event includes:
All of the above metadata is invaluable when diligencing data for quality issues, and then diagnosing the source of any issues that are spotted.
We’ll take a closer look at the value of attaching the data schema for each event with the event data itself in the next section.
When we build data processing steps, we make assumptions about how data is structured. If data is malformed, so that e.g. a field that should be numeric is not, that is liable to cause any data processing step that treats the field as a number to break.
Data that is sent into Snowplow is ‘self-describing’: it includes not just the data itself, but the schema i.e. structure for the data. This means that users can evolve their data schemas over time. More importantly when we think about bad data, it also means that Snowplow can validate that the data conforms to the schema right at the beginning of the pipeline, upstream of any data processing steps that would otherwise fail because the data has been malformed. If the data fails validation, Snowplow can push the data into the bad rows bucket with a very specific error message describing exactly what is wrong with what field, so that the issue can be addressed and the data reprocessed.
Most of our competitors claim to support ‘schemaless’ event analytics, but as Martin Fowler made clear in his excellent presentation on Schemaless Data Structures:
schemaless structures still have an implicit schema … Any code that manipulates the data needs to make some assumptions about its structure, such as the name of fields. Any data that doesn’t fit this implicit schema will not be manipulated properly, leading to errors.
If you return to the initial list of sources of missing data, in particular, you will note that limitations in processing power can lead to data loss. To go back to three examples:
Note that I have focused on issues at the beginning of the pipeline. As I explained earlier, once our data has been logged to S3, we have the ability to reprocess it, so any difficulties should not result with data loss at this stage.
The Snowplow pipeline makes extensive use of queues and autoscaling to minimize the chance of data loss due to the above issues:
7.1 Trackers queue data locally
7.2 Each collector instances uses a local queue before pushing data to the global queue
7.3 Collectors are distributed applications that autoscale to handle traffic spikes
This is really important. It means, for example, that if a user leaves a website before all the different asyncornous tags have fired, those tags will have the opportunity to be fired when the user returns to the website at a later date. It means that if the collector is temporarily unavailable, the data will be cached locally and resent once the collector is back up. It means that in the event of a network connectivity issue between the tracker and the collector, no data is lost.
The Snowplow batch-based pipeline uses Amazon S3 as a processing queue. The Snowplow real-time pipeline uses Amazon Kinesis as a processing queue.
Prior to writing the data to these queues, the event data will be persisted briefly on individual collector instances. In the case of the Clojure collector (used on the batch pipeline), the data will be stored locally and flushed to S3 each hour. During that hour, it is persisted to disk (Amazon Elastic Block Storage). For the Scala Stream Collector, it is persisted to an in-memory queue, wWhich is flushed at a configurable frequency to Kinesis.
In both cases, the data remains in the queue until it is successfully flushed. This means, for example, that Snowplow users in us-east-1 did not need to worry when a few months ago there was a brief Amazon S3 outage: during the outage the flush to S3 failed, but the data was persisted locally, and pushed to S3 once the outage had been resolved.
Rather than rely on individual boxes, both the Clojure Collector and Scala Stream Collector are distributed apps that run on multiple instances in an autoscaling group behind an Elastic Load Balancer.
Managing bad data is an essential ongoing workstream for any data team that wants to build confidence in the insight developed on their data sets. Unfortunately, most solutions simply do not equip their users with the tools to identify data quality issues, leave along diagnose and resolve them. Bluntly, most commercial analytics and data pipeline solutions are black boxes: certainly you can test sending in a handful of data points and see that they come out of the other side. Of course, you can send data into two different pipelines and compare the output. But beyond that, you can only really choose to trust, or not, each solution.
At Snowplow, we’ve taken a radically diffent approach: rather than burying bad data, we seek to surface it, so that it can be correctly handled.
Then get in touch!