Click create table and you’re good to go.
Our dashboard will query the external table whenever we refresh. Since BigQuery charges per amount of data scanned, the next steps are to make sure we’re limiting our data appropriately and creating a custom query. In this section we’ll first briefly cover how that’s done (again you may have read this section of the last blog, in which case you can skip to 2.2), then we’ll create our data source using a custom query which limits the amount of BigQUery data scanned per refresh appropriately.
The bad rows data in Cloud Storage is partitioned with a folder and filename structure which corresponds to the date and time at which the data was loaded to Cloud Storage. If we use the special
_FILE_NAME pseudo-column in a
WHERE clause, BigQuery will only scan those files in Cloud Storage which correspond to our limiting clause.
This can be done using a REGEX and parsing along the lines of:
DATE(PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%S', LTRIM(REGEXP_EXTRACT(_FILE_NAME, 'output-[0-9]+-[0-9]+-[0-9]+T[0-9]+:[0-9]+:[0-9]+'), 'output-')))
Resulting in the query:
SELECT *FROM bad_rows.bad_rows_external,UNNEST(errors) eWHERE DATE(PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%S', LTRIM(REGEXP_EXTRACT(_FILE_NAME, 'output-[0-9]+-[0-9]+-[0-9]+T[0-9]+:[0-9]+:[0-9]+'), 'output-'))) >= DATE_SUB(CURRENT_DATE, INTERVAL 7 DAY)GROUP BY 1,2ORDER BY 1,3 DESC
This will only scan the last week’s data, and the query results will contain all bad row data which was loaded to Cloud Storage in the last week (note that this won’t exactly correspond to the
failure_tstamp but is a good enough proxy for any practical use case - unless there’s been a failure in loading the bad rows data at some point).
Navigate to Data Studio, go to the data sources tab and click the
+ button to add a BigQuery connector for the data warehouse. From here, you want to choose the custom query tab on the left, find the
bad_rows BigQuery dataset, and select the
The below query results will be limited to data scanned into Cloud Storage from the last 14 days’ bad rows - for most use cases volumes will be low enough per day that this won’t make a material impact on your bill, but do be aware that if a lot of data is failing validation this may not be the case - it’s best to either run queries in BigQuery to check how many bad rows per day are coming in, or take a look at the file sizes in the bad rows bucket to manually do a sense-check on them (Anything above a few MB per file is starting to look abnormally large, assuming they load every 5 mins).
SELECT e.message, failure_tstampFROM bad_rows.bad_rows_external,UNNEST(errors) eWHERE DATE(PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%S', LTRIM(REGEXP_EXTRACT(_FILE_NAME, 'output-[0-9]+-[0-9]+-[0-9]+T[0-9]+:[0-9]+:[0-9]+'), 'output-'))) >= DATE_SUB(CURRENT_DATE, INTERVAL 14 DAY)
This query also unnests and extracts the error message. We’re ignoring the line here since it’s very laborious and time consuming to decode it in SQL, and for the purposes of monitoring all we really need to do is count rows per error over time.
Once you’ve clicked the connect button, your data source should now be available in the Data Sources tab in Data Studio. Click on it and choose ‘Generate Report’ to get started on visualizing your bad rows. You’ll be prompted to add the new data source to your report - accept that and you’re ready to build your dashboard.
Data studio is an amazing visualization tool - just choose the chart you’re looking to build, drag and drop it in the size you want onto the canvas, and choose the measures and dimensions you’re interested in.
Bad row counts per error First, let’s create a table which displays error messages and a count of rows. Choose the table option from the toolbar, drag it to the size you want, and choose errors.message as a dimension, and record count as a metric. You’ll notice that there are a lot of errors we can safely disregard in there (as explained above) - scroll to the bottom of the data tab to add a filter which excludes
message values starting with the offending errors.
Bad rows over time For this visualization I chose a bar chart, sorted by
failure_tstamp, and again applied the filter we created in the last step. Now we can easily see if there’s a spike in bad rows over time - which would suggest that some tracking or schema change has caused issues. By choosing
message as a dimension, and selecting Stacked Bar chart in the Style tab, we can also easily see how big of a problem each error is.
Bad rows per error over time Choosing a line chart with the same dimensions and metrics as above, and adding
errors.message as a breakdown dimension allows us to see bad rows per error message over time - spikes in all errors at once suggests a significant uptick in traffic, unless someone has made a global change to all tracking or all schemas. A spike in just one suggests an issue with a particular event. A new line appearing suggests an issue with a newly set up event.
The end result is a Data Studio report which can be used to identify problems with validation in real-time. Every time you refresh the report, Data Studio will query the external table, and update the visualizations with the latest data.