Many customers and community members come to Snowplow specifically because they want accurate and complete data collection.
One of our core features is being a loss-less data pipeline, which means all data can be accounted for, even if it doesn’t make it all the way through to a data storage environment. More specifically, events coming into a Snowplow pipeline are validated to ensure only those that match the pre-defined expectation make it into the data warehouse. If any event has any processing issue while making its way through the pipeline, it too would be separated to prevent corrupting the data set in the downstream storage target.
Yet even with these features, ensuring complete and accurate web and mobile data is hard. Data quality issues often emerge when:
- Tracking is instrumented incorrectly, with key fields set wrong or missed altogether
- Tracking is accidentally broken because of a new release rollout or a change in tag management configuration
For many data teams the resulting data quality issues are often only identified once a graph or chart is computed on the data and shows something unusual. The source of the issue takes time and effort to diagnose. When it is finally identified, the issue is usually only fixed for newly incoming data. This means a business faces inaccurate or incomplete data for the entire time period the issue went undetected.
However, diagnosing and fixing these issues is critical. As businesses use web and mobile data to do more, such as power real-time applications, improve the user experience and inform critical product decisions, undetected data quality issues can lead to bad decisions. Unresolved data quality issues could lead to a loss of confidence in the data. And once a business loses trust in a data set, it is very hard to win it back.
Data Quality UI/API and notifications
At Snowplow, we have launched an improved toolset to make it easier for users to proactively monitor data quality and surface errors as soon as they happen. We actively validate every event processed by Snowplow against the associated schema definitions for an event or associated entities, and customers can:
- Identify any failures to process data via the user interface
- Connect programmatically to fetch failed event information on a regular basis
- Subscribe to email notifications to be alerted of new failures not seen over the past 7 days
- Review diagnostic information to surface where the data was generated, when it started, and the nature of the error, making it easier to quickly diagnose and repair the issue
This new functionality enables Snowplow Insights customers to:
- Easily monitor for new data quality issues
- Diagnose the source of the issue and decide on severity to prioritize a fix accordingly
- Manage expectations of data users downstream
As a result, users can expect an increase in the overall quality of their data, allowing data teams to build higher levels of assurance in data accuracy and completeness, and enabling the broader business to use web and mobile data across more applications with greater confidence.
Under the hood
The new UI is powered by a complete refactoring of our core pipeline technology. This means any data processing issues result in very highly structured errors that enable us to easily distinguish failures of “real data” from noise generated by bots and spiders on the web for example, or other requests hitting the Snowplow collector that do not represent real data.
- Visit the Snowplow Insights console to enable failed event monitoring if you haven’t already done so
- Learn more about accessing failed events through the UI or API
Not a Snowplow Insights customer yet? Get in touch with us here to learn more.
How does data quality impact your product and organization? We want to hear your story and feature it in our next blog post! Reach out to firstname.lastname@example.org if you’d like to share your experience.