Snowplow 0.8.8 released with Postgres and Hive support

05 August 2013  •  Alex Dean

We are pleased to announce the immediate release of Snowplow 0.8.8. This is a big release for us: it adds the ability to store your Snowplow events in the popular PostgreSQL open-source database. This has been the most requested Snowplow feature all summer, so we are delighted to finally release it.

And if you are already happily using Snowplow with Redshift, there are two other new features to check out:

  1. We have added support for multiple storage targets to Snowplow’s StorageLoader. This means that you can configure StorageLoader to load into three different Redshift databases, one PostgreSQL database and one Redshift - whatever
  2. We have brought back the ability to query your Snowplow events using HiveQL. Regardless of which storage target(s) you are using, you can now also run HiveQL queries against your Snowplow events stored in Amazon S3

As well as these new features, we have made a large number of improvements across Snowplow:

  • We have made some improvements to the Hadoop-based Enrichment process (bumped to version 0.3.3)
  • We have simplified EmrEtlRunner and its configuration file format
  • We have improved the performance of the Redshift loading code
  • We have added a configuration option for setting MAXERROR when loading into Redshift (see the Redshift COPY documentation for details)
  • We have moved the Snowplow JavaScript Tracker into its own repository
  • We have removed the deprecated Hive ETL and Infobright folders from the repository

After the fold, we will cover the options for upgrading and using the new functionality:

  1. Upgrading
  2. Loading events into Postgres
  3. Querying events with HiveQL
  4. Getting help

1. Upgrading

There are three components to upgrade in this release:

  • The Hadoop ETL, to version 0.3.3
  • EmrEtlRunner, to version 0.4.0, and its configuration file
  • StorageLoader, to version 0.1.0, and its configuration file

Let’s take these in turn:

Hadoop ETL

If you are using EmrEtlRunner, you need to update your configuration file, config.yml, to use the latest version of the Hadoop ETL:

:snowplow:

  :hadoop_etl_version: 0.3.3 # Version of the Hadoop ETL

Read on for the rest of the changes you will need to make to config.yml.

EmrEtlRunner

You need to upgrade your EmrEtlRunner installation to the latest code (0.8.8 release) on GitHub:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.8.8
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment

Next, you need to update the format of your config.yml - the format has been simplified significantly. The new format (as found on GitHub) looks like this:

:aws:

  :access_key_id: ADD HERE

  :secret_access_key: ADD HERE

:s3:

  :region: ADD HERE

  :buckets:

    :assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket

    :log: ADD HERE

    :in: ADD HERE

    :processing: ADD HERE

    :out: ADD HERE WITH SUB-FOLDER # e.g. s3://my-out-bucket/events

    :out_bad_rows: ADD HERE        # e.g. s3://my-out-bucket/bad-rows

    :out_errors: ADD HERE # Leave blank unless :continue_on_unexpected_error: set to true below

    :archive: ADD HERE

:emr:

  # Can bump the below as EMR upgrades Hadoop

  :hadoop_version: 1.0.3

  :placement: ADD HERE

  :ec2_key_name: ADD HERE

  # Adjust your Hadoop cluster below

  :jobflow:

    :master_instance_type: m1.small

    :core_instance_count: 2

    :core_instance_type: m1.small

    :task_instance_count: 0 # Increase to use spot instances

    :task_instance_type: m1.small

    :task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances

:etl:

  :job_name: SnowPlow ETL # Give your job a name

  :hadoop_etl_version: 0.3.3 # Version of the Hadoop ETL

  :collector_format: cloudfront # Or 'clj-tomcat' for the Clojure Collector

  :continue_on_unexpected_error: false # You can switch to 'true' (and set :out_errors: above) if you really don't want the ETL throwing exceptions

Note that the :snowplow: section has been mostly removed, with :hadoop_etl_version: moving into the :etl: section.

StorageLoader

You need to upgrade your StorageLoader installation to the latest code (0.8.8 release) on GitHub:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.8.8
$ cd snowplow/4-storage/storage-loader
$ bundle install --deployment

Next, you need to update the format of your config.yml - the format has been updated to support multiple storage targets. The new format (as found on GitHub) looks like this:

:aws:

  :access_key_id: ADD HERE

  :secret_access_key: ADD HERE

:s3:

  :region: ADD HERE # S3 bucket region must be the same as Redshift cluster region

  :buckets:

    :in: ADD HERE # Must be s3:// not s3n:// for Redshift

    :archive: ADD HERE

:download:

  :folder: # Not required for Redshift

:targets:

  - :name: "My Redshift database"

    :type: redshift

    :host: ADD HERE # The endpoint as shown in the Redshift console

    :database: ADD HERE # Name of database

    :port: 5439 # Default Redshift port

    :table: events

    :username: ADD HERE

    :password: ADD HERE

    :maxerror: 1 # Stop loading on first error, or increase to permit more load errors

Note the new :maxerror: setting - see the Redshift COPY documentation for more on this.

To add another Redshift storage target, just add another configuration block under :targets:, starting with - :name:.

To add a new Postgres storage target, read on…

2. Loading events into Postgres

Loading events into Postgres is quite straightforward:

  1. Upgrade to the latest version of Snowplow as described above
  2. If you don’t have one already, setup a Postgres database server
  3. Create a snowplow database within Postgres
  4. Deploy the Snowplow schema and table into snowplow
  5. Configure StorageLoader to load into Postgres

For help on steps 2-4, please see our new guide, Setting up PostgreSQL. You can find the new PostgreSQL script on GitHub, here.

For step 5, you should create a StorageLoader configuration file which looks like this (as found on GitHub):

:aws:

  :access_key_id: ADD HERE

  :secret_access_key: ADD HERE

:s3:

  :region: ADD HERE # S3 bucket region

  :buckets:

    :in: ADD HERE

    :archive: ADD HERE

:download:

  :folder: ADD HERE # Postgres-only config option. Where to store the downloaded files

:targets:

  - :name: "My PostgreSQL database"

    :type: postgres

    :host: ADD HERE # Hostname of database server

    :database: ADD HERE # Name of database

    :port: 5432 # Default Postgres port

    :table: atomic.events

    :username: ADD HERE

    :password: ADD HERE

    :maxerror: # Not required for Postgres

Make sure to set :folder: to a local directory where you can download the Snowplow event files to, ready for loading into your local Postgres database server.

3. Querying events with HiveQL

The new release makes it possible again to query your Snowplow events directly on Amazon S3 using HiveQL.

Steps are as follows:

  1. Upgrade to the latest version of Snowplow as described above
  2. Wait for Snowplow to run at least once following the upgrade
  3. Follow the instructions in our updated guide, Running Hive using the command line tools

If you want to run HiveQL queries across your historical event data (i.e. from before the upgrade), this is possible too. You will need to rename the timestamped folders in your event archive from the old format to the new format, by prepending run=. So for example, change:

s3://my-snowplow-archive/events/2013-07-01-04-00-03

To:

s3://my-snowplow-archive/events/run=2013-07-01-04-00-03

One this is done for all folders, all of your historic event files should be correctly partitioned ready for Hive to query.

4. Getting help

As always, if you do run into any issues or don’t understand any of the above changes, please raise an issue or get in touch with us via the usual channels.

For more details on this release, please check out the 0.8.8 Release Notes on GitHub.