We are thrilled to announce version 0.13.0 of Relational Database Loader, our Snowplow component that lets you load your data into relational databases such as Redshift and PostgreSQL.
This release marks the migration of our RDB Loader and RDB Shredder apps from part of the
snowplow/snowplow “mono-repo” into an independent project with its own release cadence.
In this post, we will cover:
1. Dedicated repository
Historically, the main snowplow/snowplow repository contained all components required to load Snowplow enriched events into different databases.
This approach worked relatively well while Amazon Redshift was our primary storage target, but we’re steadily moving towards supporting multiple databases as first-class citizens in the Snowplow pipeline.
As a result, we are starting to publish new standalone Snowplow loaders in their own repositories: the first was Elasticsearch Loader, we are adding RDB Loader, and there are more to come!
The dedicated repositories have a few advantages:
- De-couple loader releases from the “full-fat” Snowplow release
- Emphasize the loosely-coupled architecture of Snowplow, encouraging a clear separation of concerns
- Faster to review and accept community contributions
- Easier to track the Git history of the component
Note that, alongside the RDB Loader itself, the RDB Shredder (written as a Spark job), now also resides in the new repository. The two apps were always tightly coupled, sharing a lot of logic.
2. Single folder load
It’s sometimes useful for a Snowplow pipeline operator to load a single archived directory of enriched events into a Redshift database – for example as part of a pipeline recovery process.
Before 0.13.0, the only way to do this was to copy or move the directory from the archive to
shredded.good, and then launch the pipeline without the Spark Enrich and RDB Shredder jobs, pretending that they’re already completed. But file moves like these are slow and error-prone, so we are now introducing a new
--folder option, which allows you to load exactly one directory using only RDB Loader.
This is not yet supported from within EmrEtlRunner, so you have to either run it locally using a helper script, or write a Dataflow Runner playbook with Base64-encoded config files instead of file paths.
Here is an invocation example:
You can find a full wrapper Bash script in our recent discourse post.
Note that RDB Loader uses the AWS Credential Provider Chain, which means RDB Loader will use credentials provided at user-level (such as
~/.aws/credentials file of environment variables) credentials, not the ones in
3. Dry run
Another new feature, especially useful for pipeline recovery, is the new
--dry-run option. This will provide you with a full list of all the SQL statements that would be executed to perform the load, were dry run mode not enabled.
Statements will be printed to standard output along with other important debug information.
You can use this to inspect these statements, or potentially to tweak them and execute them manually, if you need to work around a load issue.
4. Other changes
As part of a regular run, RDB Loader performs data discovery at least twice to make sure that S3 provides consistent results, and no “ghost files” are lingering which will break the
If you don’t need this (for example when performing a single folder load), you can skip this now, along with other steps, by adding the
--skip consistency_check option:
Finally, one important bug was fixed. This flaw in the way that JSON Paths files were cached lead to excessive S3 requests, which could significantly slow down the shredded-type discovery process.
The primary way to run RDB Loader is still via Snowplow’s own EmrEtlRunner, Release 90 and above. You will need to update your
If you have any questions or run into any problems,
please visit our Discourse forum.