The two applications communicate through the
typesTopic. The Loader writes to that topic all of the types that it has encountered; the Mutator then reads from that topic to perform mutation of the events table as necessary. The Mutator should be constantly running and consuming Pub/Sub messages.
typesTopic, the Loader makes use of two other Pub/Sub topics:
badRows- rows that for some reason couldn’t be transformed into BigQuery format. These could be caused by an Iglu registry outage, or by an unexpected schema patch or overwrite. This closely resembles the “shredded bad” data generated by our RDB Shredder for Redshift, and contains the reason of failure and raw enriched JSON
failedInserts- the Loader sends data to this topic that has passed transformation, but for some reason failed during the actual insertion stage. Unlike
badRowsdata, these records unfortunately do not contain the reason of failure - they are in the form of ready-to-be-inserted BigQuery row format. The main source of failed inserts is the short period of time between the first event with new schema processed by the Loader, and the Mutator performing the necessary mutation
Both “bad rows” and “failed inserts” have different formats, causes and recovery strategies.
Bad rows should be extremely rare and in order to recover them one needs to sink the data to Cloud Storage (we recommend using our
snowplow-google-cloud-storage-loader and apply an appropriate recovery strategy depending on the root cause. Stay tuned for the release of
snowplow-event-recovery - to do just this.)
Failed inserts in turn usually can be simply forwarded to BigQuery using the auxiliary BigQuery Forwarder job. If they were simply caused by the Mutator’s delay, then BigQuery will accept these rows the second time.
Note that Pub/Sub has a retention time of 7 days. After this time, messages will be silently dropped. Therefore, we recommend sinking these topics to Cloud Storage to prevent data loss.
Setup of the Snowplow BigQuery Loader is relatively straightforward, involving the following steps:
Both Mutator and Loader use the same self-describing JSON configuration file with this schema:
Here is a configuration example:
For more information on these configuration properties, check out the Loader’s wiki.
You can initialize the Mutator like this:
Then you can submit the Loader itself to Cloud Dataflow like so:
This is the first public release of BigQuery Loader, and it can be considered stable and reliable enough for most production use cases.
It has performed well in our internal testing program, but many things are still subject to change. Upcoming changes will most likely be focused on the following aspects:
contexts_com_acme_product_context_1_0_0. We think that this model is enough for many use cases, but not optimal for data models which make heavy use of self-describing data with regularly evolving schemas. We’re thus considering
MODEL-based versioning (e.g.
contexts_com_acme_product_context_1) using table-sharding for the next major version of Loader - but are still open to suggestions
typesTopicintroduced above. This makes it very hard to reason about how BigQuery loading is proceeding, so we are looking for more sophisticated solutions going forwards
You’ll find documentation for the BigQuery Loader on the project’s wiki.
For more details on this release, as always do check out the release notes on GitHub.
And if you have any questions or run into any problems, please visit our Discourse forum.