Bulk loading data from Amazon S3 into Redshift at the command line

Share

On Friday Amazon launched Redshift, a fully managed, petabyte-scale data warehouse service. We’ve been busy since building out Snowplow support for Redshift, so that Snowplow users can use Redshift to store their granular, customer-level and event-level data for OLAP analysis.

In the course of building out Snowplow support for Redshift, we need to bulk load data stored in S3 into Redshift, programmatically. Unfortunately, the Redshift Java SDK is very slow at inserts, so not suitable bulk loading. We found a simple workaround that might be helpful for anyone who wishes to bulk load data into Redshift from S3, and have documented it below.

An overview of the workaround

Amazon enables users to bulk load data from S3 into Redshift by executing queries with the following form:

copy events from 's3://$MY-BUCKET/PATH/TO/FILES/FOR/UPLOAD' credentials 'aws_access_key_id=$ACCESS-KEY;aws_secret_access_key=$SECRET-ACCESS-KEY' delimiter 't';

However, these queries can only be executed in a SQL client running a JDBC or ODBC driver compatible with Redshift. (Links to those drivers can be found here. )

In order to orchestrate bulk loading programmatically, we used JiSQL, a Java based command-line tool for executing SQL queries that uses a JDBC driver. JiSQL enables us to specify the specific, Redshift-compatible JDBC driver to use to establish the connection. This meant we could upgrade our Ruby StorageLoader to execute the relevant command-line syntax to initiate the regular data loads of Snowplow data from S3 into Redshift.

Using JiSQL to bulk load data from S3 to Redshift at the command-line: a step by step guide

1. Download and install JiSQL

2. Download the Redshift-compatible JDBC driver

3. Identify the JDBC URL for your Redshift cluster

In the AWS Console, go to the Redshift and select the cluster you want to load data into. A window will appear with details about the cluster, including the JDBC URL.

screenshot

4. Initiate your bulk load of data at the command line

$ java -cp lib/jisql-2.0.11.jar:lib/jopt-simple-3.2.jar:lib/postgresql-8.4-703.jdbc4.jar com.xigole.util.sql.Jisql -driver postgresql -cstring jdbc:postgresql://snowplow.cjbccnwghslt.us-east-1.redshift.amazonaws.com:5439/snplow -user $USERNAME -password $PASSWORD -c <span class="se">; -query "copy events from 's3://$MY_BUCKET/PATH/TO/FILES/FOR/UPLOAD credentials 'aws_access_key_id=$ACCESS-KEY;aws_secret_access_key=$SECRET-ACCESS-KEY' delimiter 't<span class="s2">';"

Some notes about the above query:

A note about bulk loading data from S3 into Redshift:

Happy bulk loading from the command line!

Share

Related articles