We’re pleased to start the week with the release of a new Ruby gem, our Infobright Ruby Loader (IRL).
At Snowplow we’re committed to supporting multiple different storage and analytics options for Snowplow events, alongside our current Hive-based approach. One of the alternative data stores we are working with is Infobright, a columnar database which is available in open source and commercial versions.
For all but the largest Snowplow users, columnar databases such as Infobright should be an attractive alternative to doing all of your analysis in Hive. The main advantages of columnar databases are as follows:
- Scale to terabytes (although not petabytes, unlike Hive)
- Fixed cost (dedicated RAM-heavy analytics server), versus pay-as-you-go querying on Amazon EMR
- Significantly faster query times – typically seconds, not minutes
- Plug in to many analytics front-ends e.g. Tableau, Qlikview, R
So, open source columnar databases like Infobright Community Edition (ICE) are a good fit for Snowplow analytics. Unfortunately, when we started to load Snowplow event logs into ICE, we realised that there wasn’t a good data-loading solution for Infobright in Ruby, our ETL language of choice. So, we built one 🙂
Our freshly minted Infobright Ruby Loader (IRL) can be used in two different ways:
- As a command-line tool – for manual loading of data into Infobright at the command-line. No Ruby expertise required
- As part of another application – because it is a Ruby gem with a Ruby API, IRL can be integrated into larger Ruby ETL processes
We will be using IRL at Snowplow as part of our larger ETL process to load Snowplow events into ICE for analysis – we hope to roll this out within the next few weeks.
In the meantime, we hope that IRL is useful to people in the Infobright community who need to run data loads at the command-line; IRL was inspired by ParaFlex, an excellent Bash script from the Infobright team to perform parallel loading of Infobright, and can be used as a direct alternative to ParaFlex.