Although this is version 0.2.0 of Huskimo, this is the first publicized release, and so we will take some time in this blog post to explain the rationale for Huskimo as an all-new open-source project.
Read on after the jump for:
At Snowplow we strongly believe that events are the most effective currency for capturing digital activity of any form. To enable this, we provide a variety of language/platform specific trackers, plus growing support for various third-party SaaS platforms via webhooks.
However, not all third-party SaaS platforms are willing or able to expose their internal stateful data as a stream of immutable events; some of these platforms are the very ones that Snowplow users are most excited about querying in Redshift alongside their Snowplow event data.
To bridge the gap, we are now open sourcing the Huskimo project. Huskimo has a simple goal: to make essential datasets currently locked away inside various SaaS platforms available for analysis inside Redshift.
At launch, we are supporting just one SaaS platform: Singular, which is a tool for managing marketing spend focused on mobile apps and games companies.
Huskimo supports two API resources made available by Singular:
stats: all the campaign statistics for your account
creative_stats: all the creative statistics for your account
For each resource type, Huskimo will retrieve all records from the Singular API, convert them into a simple TSV file format, and load them into Redshift.
The most complex aspect of Huskimo is dealing with Singular marketing data becoming “golden” - Huskimo’s approach to this is covered in the next section.
Marketing data is notoriously difficult to finalize - it takes days (sometimes weeks) for advertising companies to determine which clicks on ads were real, and which ones were fraudulent. This means that it takes days or weeks for marketing spend data to be finalized (sometimes referred to as “becoming golden”).
As a result, we can retrieve Sunday’s marketing spend data from Singular on Monday, but if we fetch Sunday’s spend data again on Tuesday, the numbers for Sunday will very likely have been updated in the meantime.
Huskimo gets around this by:
when_retrievedtimestamp to each row of data retrieved from Singular
In other words, if Huskimo runs daily with its “lookback” set to 30 days, then the marketing spend data for say Sunday 21 June 2015 is fetched and stored in Redshift each day for 30 days. When joining your Snowplow event data to your Huskimo marketing spend data in Redshift, it’s then simply a matter of using
MAX(retrieved_date) to reference the most recent (and thus most accurate) report of a given day’s marketing spend.
Running Huskimo consists of four steps:
We’ll cover each of these steps briefly in the next section.
Huskimo is made available as an executable “fatjar” runnable on any Linux system. It is hosted on Bintray, download it like so:
Once downloaded, unzip it:
Assuming you have a recent (Java 7 or 8) runtime on your system, running is as simple as:
Huskimo is configured using a YAML-format file which looks like this:
Key things to note:
lookbackperiod determines how many days back in time to retrieve spend data for
Before starting Huskimo you must remember to deploy the two Singular tables into Redshift. You can find the table definitions in the file:
Make sure to deploy this file against each Redshift database you want to load Singular data into.
You are now ready to schedule Huskimo to run daily.
We typically run Huskimo in the early morning so that the data for yesterday is already available (even if rather incomplete). A cron entry for Huskimo might look something like this:
For more details on this release, please check out the Huskimo 0.2.0 on GitHub.
We will be building a dedicated wiki for Huskimo to support its usage; in the meantime, if you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.
We will be adding support for further SaaS platforms to Huskimo on a case-by-case basis. The next release (0.3.0) of Huskimo will extract the major resource types from Twilio, the popular Telephony-as-a-Service provider.
We are also particularly interested in adding support for more marketing channels, such as Google AdWords or Facebook. Having these datasets available in Redshift alongside your event data should enable some very powerful marketing attribution and return-on-spend analytics.
If you are interested in sponsoring a new integration for Huskimo, do get in touch!