data quality

Say goodbye to contradictory reports. Trust your data.

Snowplow is architected from the ground up for data quality, so you don’t have to worry about bad or missing data.

Formal validation

All data collected is validated against its associated data schemas. You can configure schemas to be as strict as you like, making it easy to proactively identify data quality issues e.g. a value being stored in the wrong field.

Loss-less
data pipeline

No events sent to the Snowplow pipeline are silently “dropped”. If there is an issue processing the data, we surface it with the associated error messages, so you can easily monitor and proactively identify data quality issues as they emerge, rather than after the fact.

Fully auditable

You have direct access to the data at every stage in the Snowplow data pipeline, enabling you to audit the data quality at each stage and validate that no data has been lost or incorrectly transformed.

Recover and reprocess
bad data

It is possible to recover and reprocess bad data, so that data tracking issues do not necessarily need to result in gaps in your data collection.

From the Snowplow blog

We need to talk about bad data

No one in digital analytics talks about bad data. A lot about working with data is sexy, but managing bad data, i.e. working to improve data quality, is not. Not only is talking about bad data not sexy, it is really awkward, because it forces us to confront a hard truth.

Debugging bad data in GCP with BigQuery

One of the key features of the Snowplow pipeline is that it’s architected to ensure data quality up front - rather than spending a lot of time cleaning and making sense of the data before using it, schemas are defined up front and used to validate data as it comes through the pipeline.

Debugging bad data in Elasticsearch and Kibana

One of the features that makes Snowplow unique is that we actually report bad data: any data that hits the Snowplow pipeline and fails to be processed successfully. This is incredibly valuable, because it means you can spot data tracking issues that emerge, quickly, and address them at source.