RDB Loader 0.13.0 released

06 September 2017  •  Anton Parkhomenko

We are thrilled to announce version 0.13.0 of Relational Database Loader, our Snowplow component that lets you load your data into relational databases such as Redshift and PostgreSQL.

This release marks the migration of our RDB Loader and RDB Shredder apps from part of the snowplow/snowplow “mono-repo” into an independent project with its own release cadence.

In this post, we will cover:

  1. Dedicated repository
  2. Single folder load
  3. Dry run
  4. Other changes
  5. Upgrading
  6. Contributing

1. Dedicated repository

Historically, the main snowplow/snowplow repository contained all components required to load Snowplow enriched events into different databases.

This approach worked relatively well while Amazon Redshift was our primary storage target, but we’re steadily moving towards supporting multiple databases as first-class citizens in the Snowplow pipeline.

As a result, we are starting to publish new standalone Snowplow loaders in their own repositories: the first was Elasticsearch Loader, we are adding RDB Loader, and there are more to come!

The dedicated repositories have a few advantages:

  1. De-couple loader releases from the “full-fat” Snowplow release
  2. Emphasize the loosely-coupled architecture of Snowplow, encouraging a clear separation of concerns
  3. Faster to review and accept community contributions
  4. Easier to track the Git history of the component

Note that, alongside the RDB Loader itself, the RDB Shredder (written as a Spark job), now also resides in the new repository. The two apps were always tightly coupled, sharing a lot of logic.

2. Single folder load

It’s sometimes useful for a Snowplow pipeline operator to load a single archived directory of enriched events into a Redshift database - for example as part of a pipeline recovery process.

Before 0.13.0, the only way to do this was to copy or move the directory from the archive to shredded.good, and then launch the pipeline without the Spark Enrich and RDB Shredder jobs, pretending that they’re already completed. But file moves like these are slow and error-prone, so we are now introducing a new --folder option, which allows you to load exactly one directory using only RDB Loader.

This is not yet supported from within EmrEtlRunner, so you have to either run it locally using a helper script, or write a Dataflow Runner playbook with Base64-encoded config files instead of file paths.

Here is an invocation example:

$ java -jar $JARFILE \
  --config $BASE64_CONFIG \
  --target $BASE64_TARGET \
  --resolver $BASE64_RESOLVER \
  --folder s3://com-acme-snowplow/archive/shredded/run=2017-09-05-13-30-22 \
  --logkey s3//com-acme-snowplow/log/rdb-loader/$(uuid) 

You can find a full wrapper Bash script in our recent discourse post.

Note that RDB Loader uses the AWS Credential Provider Chain, which means RDB Loader will use credentials provided at user-level (such as ~/.aws/credentials file of environment variables) credentials, not the ones in config.yml.

3. Dry run

Another new feature, especially useful for pipeline recovery, is the new --dry-run option. This will provide you with a full list of all the SQL statements that would be executed to perform the load, were dry run mode not enabled.

Statements will be printed to standard output along with other important debug information.

You can use this to inspect these statements, or potentially to tweak them and execute them manually, if you need to work around a load issue.

4. Other changes

As part of a regular run, RDB Loader performs data discovery at least twice to make sure that S3 provides consistent results, and no “ghost files” are lingering which will break the COPY statements.

If you don’t need this (for example when performing a single folder load), you can skip this now, along with other steps, by adding the --skip consistency_check option:

$ java -jar $JARFILE \
  --skip consistency_check \
  --config $BASE64_CONFIG \
  --target $BASE64_TARGET \
  --resolver $BASE64_RESOLVER \
  --folder s3://com-acme-snowplow/archive/shredded/run=2017-09-05-13-30-22 \
  --logkey s3//com-acme-snowplow/log/rdb-loader/$(uuid) 

Finally, one important bug was fixed. This flaw in the way that JSON Paths files were cached lead to excessive S3 requests, which could significantly slow down the shredded-type discovery process.

5. Upgrading

The primary way to run RDB Loader is still via Snowplow’s own EmrEtlRunner, Release 90 and above. You will need to update your config.yml:

storage:
  versions:
    rdb_loader: 0.13.0        # WAS 0.12.0

6. Contributing

You can check out the repository and the open issues if you’d like to get involved!

If you have any questions or run into any problems, please visit our Discourse forum.