We are pleased to release Snowplow 114 Polonnaruwa, named after the ancient city of Polonnaruwa in Sri Lanka. This Snowplow release includes a number of new features and updates, most of which live in Scala Common Enrich:
- New enrichment: YAUAA (Yet Another UserAgent Analyzer)
- New feature: remote HTTP adapter
- New tutorial: add an enrichment to the pipeline
- Other improvements
- Updates for EmrEtlRunner
- Getting help
1. New enrichment: YAUAA (Yet Another UserAgent Analyzer)
Understanding what device a website visitor is using, and what browser and operating system they are running, is incredibly valuable. They can, for example, be used to:
- Understand how user engagement varies by device: Are patterns of engagement different for users on the go (on their mobiles), vs tablets and desktop? If so - how does that engagement vary?
- Identify issues with the user experience on particular devices, operating systems or browsers
Device detection on web is typically done using the useragent string. Prior to this release, Snowplow supported two different user agent enrichments, that each used a different library to derive additional data points about the device events occur on. The User Agent Utils enrichment used the User-agent-utils library to infer the following data points from the useragent string:
The library was deprecated and so we recommended users employ a second ua-parser enrichment, which used the Browserscope user agent parser to infer the following fields, all located in the
However, a number of users spotted issues with the detection of particular devices and as a result, we have released another useragent enrichment, this time based on the YAUAA (“Yet another user agent analyzer”) library. The YAUAA enrichment can easily be enabled by adding the following config file to your enrichments:
It populates the following fields in the new YAUAA context, which includes a raft of new fields:
More information about this enrichment can be found on the wiki page.
Because device detection is so important, we are additionally looking to add a WURFL enrichment in a forthcoming release. We welcome any feedback from users on which fields would be most useful to fetch as part of that enrichment, given the enormous number supported by the WURFL team.
2. New feature: remote HTTP adapter
The HTTP adapter provides Snowplow users with the opportunity to extend Snowplow to ingest data from a range of sources and processes without having to tamper with the Snowplow source code itself.
Snowplow has for sometime supported ingesting data from specific sources via adapters. For example, Snowplow users can ingest data from SendGrid via our SendGrid adapter: SendGrid is configured to stream data via a webhook pointing to:
Snowplow uses the fact that the data has landed on the
/com.SendGrid/v3 path to identify that this data needs to be processed by the SendGrid adapter prior to being validated and enriched. Adapters provide an opportunity to convert the data from the format used by the 3rd party webhook into one matching a Snowplow-authored event, so that it can subsequently be processed like any other Snowplow event.
With the HTTP adapter, it is possible to configure Snowplow to stream data landing on particular collector paths to an external HTTP endpoint where users can configure their own applications for converting that data into a format suitable for Snowplow to continue to process. (This transformed data is returned in the HTTP response.) This means that any Snowplow user can write their own Snowplow adapter for any source of data they wish. Some example use cases:
- A company might want to ingest data from their own application which exposes it in a particular format, and not have the opportunity to either update that application to emit the data using a Snowplow Tracker, or update the shape of the data currently emitted into one suitable for ingestion via our standard Iglu Webhook. This might be the case for a legacy application which is no longer being developed, for example. In this case, a standalone adapter could be written to perform the relevant transformation.
- A company might wish to write an adapter for a third party provider but not wish to do so in Scala. In this case, the adapter could be written in any language that suited the author.
- A company might wish to ingest data into Snowplow generated by data science models that are typically written in R or Python.
The HTTP adapter is enabled via a configuration like the following to the stream-enrich configuration file:
In the above example, Snowplow has been configured to forward any payloads that land on the path
The HTTP request sent to the remote adapter at
http://remote-adapter.com:9090 will contain the following parameters:
Snowplow expects the body of the HTTP response to be a JSON with a field
events which is a list of
Map[String, String], each map being placed in the parameters of a raw event (a collector payload can contain several raw events). In the event that the remote adapter was not able to process the event successfully Snowplow expects the response to contain a string field called
error containing an error message.
The feature has been added to Scala Common Enrich but can be used only in stream-enrich for now. We plan to add it to beam-enrich shortly, so that GCP users can benefit from it.
This incredibly powerful feature has been contributed by Donald Matthews and Saeed Zareian at The Globe and Mail. They have also very kindly provided an example of code for an HTTP remote adapter here. Many thanks Donald and Saeed!
3. New tutorial: add an enrichment to the pipeline
We’ve had a number of users express an interest in contributing new enrichments to Snowplow, so have written a tutorial on how to do so. This can be found here.
4. Other improvements
4.1. More relaxed URL parsing
A number of Snowplow users employ Snowplow tracking on websites that they do not directly control. (For example, this is the case for companies that provide widgets, or analytics for marketing effectiveness.)
For these users, the relatively strict URL parsing employed previously by Snowplow was problematic, because it meant events that occurred on URLs that were strictly speaking invalid (but worked on the web) would fail validation.
In this version of Snowplow that URL parsing has been relaxed. For example it now supports URLs containing macros like
4.2. IP address now deduced from `xForwardedFor` where this conflicts with `Forwared: for=`
In this version of Snowplow, if both
Forwarded: for= are set in the headers,
X-Forwarded-For now takes priority.
4.3. IAB Bots and Spiders enrichment skipped for IPv6 addresses
The library used by Snowplow to interface with the IAB Bots and Spiders enrichment does not support IPv6 addresses. As a result, events recorded against these IP addresses failed validation in previous versions of Snowplow.
With this version any events recorded against an IPv6 address are not processed using the IAB enrichemnt, so that they are successfully processed by Snowplow. (But lack the additional data points generated by the IAB Bots and Spiders enrichment.)
We plan to update this behaviour once we rollout support for IPv6 in the IAB Bots and Spiders library.
4.4. SendGrid integration update
We have updated our SendGrid integration so that the optional
marketing_campaign_* fields are now captured by the pipeline. (These were added to the SendGrid webhook payloads since we rolled out our initial SendGrid integration.)
More info about these fields on this page.
4.5. IP lookup enrichment
IP lookup enrichment now supports IPs (v4) containing a port.
5. Updates for EmrEtlRunner
We are continuing the effort started in R113 to decrease the number of connection issues.
The backoff periods for retries have been increased, so that it’s less likely to hit EMR rate limits with multiple pipelines running concurrently.
The calls being made to the EMR API to monitor the jobs have also been updated, so that there is no redundant calls any more.
6.1. Upgrading your enrichment platform
If you are a GCP pipeline user, a new Beam Enrich can be found on Bintray:
If you are a Kinesis or Kafka pipeline user, a new Stream Enrich can be found on Bintray.
Finally, if you are a batch pipeline user, a new Spark Enrich can be used by setting the new version in your EmrEtlRunner configuration:
or directly make use of the new Spark Enrich available at:
For the batch pipeline, we’ve also extended the timeout recovery introduced in R112. A new version of EmrEtlRunner incorporating those improvements is available from our Bintray [here][bintray-eer].
The new version of EmrEtlRunner aiming at decreasing the number of connection issues is also available in our Bintray.
6.2. Using YAUAA enrichment
YAUAA enrichment requires an additional 400Mb of memory to run, so be careful when sizing clusters or individual machines.
To use new YAUAA enrichment, add
yauaa_enrichment_config.json to the folder with configuration files for enrichments, with the following content:
Upcoming Snowplow releases include:
- R115 New bad row format, a release which will incorporate the new bad row format discussed in the dedicated RFC.
Stay tuned for announcements of more upcoming Snowplow releases soon!
8. Getting help
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problem, please visit our Discourse forum.