Snowplow 0.7.6 released with Redshift data warehouse support

Share

We’re excited to announce the immediate release of Snowplow version 0.7.6 with support for storing your Snowplow events in Amazon Redshift.

We were very excited when Amazon announced Redshift back in late 2012, and we have been working to integrate Snowplow data since Redshift became generally available two weeks ago. Our tests with Redshift since launch have not disappointed – and we can’t wait to see what the Snowplow community do with the new platform!

In this post we will cover:

  1. Why Redshift is a great fit for Snowplow data
  2. Changes in this version
  3. Setting up Snowplow for Redshift
  4. Upgrading for Infobright/Hive users
  5. Roadmap and next steps
  6. Getting help

Read on below the fold to find out more.

1. Why Amazon Redshift is a great fit for Snowplow data

Snowplow datasets get big very quickly: we store at least one line of data for every single event that occurs on your website or application. Our largest users are recording 100M+ events every day; these data volumes get very big, very quickly.

Whereas traditional web analytics packages deal with this by aggregating data to feed pre-cut reports, we built Snowplow specifically to maintain that granularity, because that granularity is critical to:

As a result, we built Snowplow on technologies like Hadoop and Hive from the get-go to enable Snowplow users to record and analyse massive volumes of event data.

The trouble with Hadoop and Hive is that they are not great tools for OLAP analysis. As a result, we added support for Infobright Community Edition: an open source columnar database you deploy yourself which scales to terabytes.

With Amazon Redshift, we now support a columnar database that scales to petabytes. Not only that, but:

Please read on to find out how to get started with Snowplow and Redshift.

2. Changes in this version

This version makes a set of changes to Snowplow to add support for Redshift; it is important to understand these changes even if you are using our Hive or Infobright storage options and are not interested in using Redshift.

The main changes are as follows:

Additionally we have fixed two bugs in this version:

  1. Our Bash files for automating EmrEtlRunner and/or StorageLoader had a bug where the BUNDLE_GEMFILE configuration lines did not end in /Gemfile. This has now been fixed. Many thanks to Eric Zimmerman for reporting this!
  2. We have widened the field storing the raw useragent string in Infobright (and Redshift) to 1000 characters: 500 characters wasn’t enough

If you are a Snowplow Hive/Infobright user with no interest in Redshift, please jump to Upgrading for Infobright/Hive users for information on how to upgrade.

3. Setting up Snowplow for Redshift

This is a relatively simple process, which is fully documented on our wiki:

  1. Setup a Redshift cluster
  2. Configure EmrEtlRunner to output Snowplow events in Redshift format
  3. Configure StorageLoader to load Snowplow events into Redshift
  4. (Optional) Connect Chartio to Snowplow data in Redshift

Once you have completed these steps, you should now have a Snowplow eventstream data warehouse setup in Redshift!

4. Upgrading for Infobright/Hive users

These are the steps to upgrade Snowplow to version 0.7.6 if you are using the Hive or Infobright output formats:

4.1 EmrEtlRunner

If you are using EmrEtlRunner, you need to update your configuration file, config.yml, to use the latest versions of the Hive serde and HiveQL scripts:

:snowplow: :serde_version: 0.5.5 :hive_hiveql_version: 0.5.7 :mysql_infobright_hiveql_version: 0.0.8 :redshift_hiveql_version: 0.0.1 

If you are outputting Snowplow events in Infobright format, you need to update this line too:

:etl: ... :storage_format: mysql-infobright # Used to be 'non-hive' 

4.2 Infobright table definition

If you are using Infobright Community Edition for analysis, you will need to update your table definition, because we have widened the useragent field.

To make this easier for you, we have created a script:

4-storage/infobright-storage/migrate_to_008.sh 

Running this script will create a new table, events_008 (version 0.0.8 of the Infobright table definition) in your snowplow database, copying across all your data from your existing events table, which will not be modified in any way.

4.3 StorageLoader

If you are using StorageLoader, you need to update your configuration file, config.yml, to the new format:

:storage: :type: infobright :host: # Not used by Infobright :database: ADD IN HERE :port: # Not used by Infobright :table: events_008 # NOT "events_007" any more :username: ADD IN HERE :password: ADD IN HERE 

Note that the table field now points to the new events_008 table created in section 4.2 above.

Done!

5. Roadmap and next steps

We’re really excited about the opportunities for building web-scale, low-cost data warehouses for marketing and product analytics with Amazon Redshift, and we’re super-excited about all of the potential uses of Snowplow event data within these data warehouses. If you’re excited too, do get in touch!

Separately, this is the last planned release in the 0.7.x series. We’re already hard at work on the next release, which will see us swap out the current Hive-based ETL process for a more robust, performant and extensible Hadoop (Cascading/Scalding) ETL process.

To keep track of this new release, please sign up for our mailing list and checkout our Roadmap.

6. Getting help

As always, if you do run into any issues or don’t understand any of the above changes, please raise an issue or get in touch with us via the usual channels.

Share

Related articles