Snowplow 83 Bald Eagle released with SQL Query Enrichment

06 September 2016  •  Anton Parkhomenko

We are pleased to announce the release of Snowplow 83 Bald Eagle. This release introduces our powerful new SQL Query Enrichment, long-awaited support for the EU Frankfurt AWS region, plus POST support for our Iglu webhook adapter.

  1. SQL Query Enrichment
  2. Support for eu-central-1 (Frankfurt)
  3. POST support for the Iglu webhook adapter
  4. Other improvements
  5. Upgrading
  6. Roadmap
  7. Getting help

bald-eagle

1. SQL Query Enrichment

The SQL Query Enrichment lets us perform dimension widening on an incoming Snowplow event using any JDBC-compatible relational database such as MySQL or Postgres. We are super-excited about this capability - a first for any event analytics platform. Alongside our API Request Enrichment and JavaScript Enrichment, this enrichment is a step on our way to a fully customizable enrichment process for Snowplow.

The SQL Query Enrichment lets you effectively join arbitrary entities to your events during the enrichment process, as opposed to attaching the data in your tracker or in your event data warehouse. This is very powerful, not least for the real-time use case where performing a relational database join post-enrichment is impractical.

The query is plain SQL: it can span multiple tables, alias returned columns and apply arbitrary WHERE clauses driven by data extracted from any field found in the Snowplow enriched event, or indeed any JSON property found within the unstruct_event, contexts or derived_contexts fields. The enrichment will retrieve one or more rows from your targeted database as one or more self-describing JSONs, ready for adding back into the derived_contexts field.

For a detailed walk-through of the SQL Query Enrichment, check out our new tutorial, How to enrich events with MySQL data using the SQL Query Enrichment.

You can also find out more on the SQL Query Enrichment page on the Snowplow wiki.

2. Support for eu-central-1 (Frankfurt)

We are delighted to be finally adding support for the EU Frankfurt (eu-central-1) AWS region in this release; this has been one of the most requested features by the Snowplow community for some time now.

To implement this we made various changes to our EmrEtlRunner and StorageLoader applications, as well as to our central hosting of code artifacts for Elastic MapReduce and Redshift loading.

AWS has a healthy roadmap of new data center regions opening over the coming months; we are committed to Snowplow supporting these new regions as they become available.

3. POST support for the Iglu webhook adapter

Our Iglu webhook adapter is one of our most powerful webhooks. It lets you track events sent into Snowplow via a GET request, where the name-value pairs on the request are composed into a self-describing JSON, with an Iglu-compatible schema parameter being used to describe the JSON.

Previously this adapter only supported GET requests; as of this release the adapter also supports POST requests. You can send in your data in the POST request body, either formatted as a JSON or as a form body; the schema parameter should be part of the request body.

Many thanks to community member Mike Robins at Snowplow partner Snowflake Analytics for contributing this feature!

For information on the new POST-based capability, please check out the setup guide for the Iglu webhook adapter.

4. Other improvements

This release also contains further improvements to EmrEtlRunner and StorageLoader:

  • In EmrEtlRunner, we now pass the GZIP compression argument to S3DistCp as “gz” not “gzip” (#2679). This makes it easier to query enriched events from Apache Spark, which does not recognize .gzip as a file extension for GZIP compressed files
  • Also in EmrEtlRunner, we fixed a bug where files were being double compressed as the output of the Hadoop Shred step if the Hadoop Enrich step was skipped (#2586)
  • In StorageLoader, we opted to use the Northern Virginia endpoint instead of the global endpoint for us-east-1 (#2748). This may have some benefits in terms of improved eventual consistency behavior (still under observation)

5. Upgrading

Upgrading is simple - update the hadoop_enrich job version in your configuration YAML like so:

versions:
  hadoop_enrich: 1.8.0        # WAS 1.7.0
  hadoop_shred: 0.9.0         # UNCHANGED
  hadoop_elasticsearch: 0.1.0 # UNCHANGED

For a complete example, see our sample config.yml template.

6. Roadmap

We have renamed the upcoming milestones for Snowplow to be more flexible around the ultimate sequencing of releases. Upcoming Snowplow releases, in no particular order, include:

  • R8x [HAD] 4 webhooks, which will add support for 4 new webhooks (Mailgun, Olark, Unbounce, StatusGator)
  • R8x [RT] ES 2.x support, which will add support for Elasticsearch 2.x to our real-time pipeline, and also add the SQL Query Enrichment to this pipeline
  • R8x [HAD] Spark data modeling, which will allow arbitrary Spark jobs to be added to the EMR jobflow to perform data modeling prior to (or instead of) Redshift
  • R8x [HAD] Synthetic dedupe, which will deduplicate event_ids with different event_fingerprints (synthetic duplicates) in Hadoop Shred

Note that these releases are always subject to change between now and the actual release date.

7. Getting help

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.