StorageLoader is a Ruby application that downloads Snowplow event files from S3 and loads them into an alternative database. It has been built to make keeping an up to date version of your Snowplow data in other databases as easy as possible. Currently, it only supports loading the data into Infobright Community Edition (ICE) - a high-performance columnar database based on MySQL. However, we plan to extend it over the next few months to support a range of other databases including:
There are significant advantages to storing data in Infobright instead of (or as well as) S3:
As you can hopefully get a sense looking at our roadmap for other databases to support, there are obvious advantages to using some of the other databases on our roadmapGoing forwards, we expect that many companies using Snowplow will store that Snowplow data in more than one store, to enable a very broad range of analytics from different types of tools.
You can configure StorageLoader with the details of the Infobright table to insert your Snowplow events into, and then you schedule StorageLoader (e.g. in a cronjob) to regularly download your Snowplow events and load them into Infobright. StorageLoader can run as soon as EmrEtlRunner has completed its job (and we include a script to run both in one go).
With this setup, you will have your Snowplow events easily accessible and queryable in a local Infobright instance - but you can still fall back to querying the data in Hive if you wish.
The following setup guides should be helpful in terms of setting up StorageLoader:
If you want to take a look at the code, you can find it in the main repository here: 4-storage/storage-loader/
If you have any problems getting StorageLoader working, please raise an issue or get in touch with us via the usual channels.
We have made a number of other fixes across Snowplow to prepare the ground for StorageLoader:
EmrEtlRunner has been bumped to 0.0.5, including upgrading it to Sluice 0.0.4 (which has some bug fixes around S3 path handling).
The Hive deserializer has been bumped to 0.5.1, and now outputs booleans such as
br_cookies as 0 or 1 (instead of true or false) for the non-Hive output.
The non-Hive format HiveQL script has been bumped to 0.0.2 and now uses the new 0 or 1 approach to booleans. This is necessary so that true/false values can be successfully loaded into Infobright.
The setup_infobright.sql script has been bumped to 0.0.2 - we have changed the columns defined as booleans to be tinyint(1)s. This is just a formality, because Infobright creates ‘boolean’ columns as tinyint(1)s anyway.
We will keep you posted as we roll out support for additional database options in StorageLoader! (And welcome suggestinos for other databases we should build support for.)