Snowplow R103 Paestum released with IP Lookups Enrichment upgrade

17 April 2018  •  Ben Fradet

We are proud to announce the release of Snowplow R103 Paestum. This release is centered around upgrading the IP Lookups Enrichment for both the batch and streaming pipelines given the impending end of life of Maxmind’s legacy databases.

It also ships with a security improvement for cross-domain policy management on the Clojure Collector.

Read on for more information on R103 Paestum, named after the ancient city in in Italy:

  1. Upgrading the IP lookups enrichment
  2. Cross domain policy management for the Clojure collector
  3. PII enrichment for the batch pipeline
  4. Community contributions
  5. Upgrading
  6. Roadmap
  7. Help

paestum

1. Upgrading the IP Lookups Enrichment

As described in our Discourse post, MaxMind will not provide monthly updates to their now-legacy databases starting April 2nd.

To tackle this issue and keep the IP Lookups Enrichment as accurate as possible, we are releasing a new version of the enrichment, for both the batch and streaming pipelines, which interacts with GeoIP2 databases, Maxmind’s new format.

A special thanks to Tiago Macedo and Andrew Korzhuev, who worked on the scala-maxmind-iplookups library upgrade, without which this enrichment upgrade wouldn’t have been possible.

2. Cross-domain policy management for the Clojure collector

On the security side of things, we have made the cross-domain policy of the Clojure Collector configurable; this change is inline with the updates made to the Scala Stream Collector back in Release 98 Argentomagus.

First, what is a Flash cross-domain policy? Quoting the Adobe website:

A cross-domain policy file is an XML document that grants a web client, such as Adobe Flash Player or Adobe Acrobat (though not necessarily limited to these), permission to handle data across domains. When clients request content hosted on a particular source domain and that content make requests directed towards a domain other than its own, the remote domain needs to host a cross-domain policy file that grants access to the source domain, allowing the client to continue the transaction.

To allow a Flash media player hosted on another web server to access content from the Adobe Media Server web server, we require a crossdomain.xml file. A typical use case will be HTTP streaming (VOD or Live) to a Flash Player. The crossdomain.xml file grants a web client the required permission to handle data across multiple domains.

A cross-domain policy file gives the necessary permissions when, for example, you are trying to make a request to a Snowplow collector from a Flash game given that both are running on different hosts.

The Clojure Collector embeds what was a very permissive cross-domain policy file, giving permission to any domain and not enforcing HTTPS:

<?xml version="1.0"?>
<cross-domain-policy>
  <allow-access-from domain="*" secure="false" />
</cross-domain-policy>

With this release, we’re completely removing the /crossdomain.xml route by default - should you need it, manually re-enable it by adding the two following environment properties to your Elastic Beanstalk application:

  • SP_CDP_DOMAIN: the domain that is granted access, *.acme.com will match both http://acme.com and http://sub.acme.com.
  • SP_CDP_SECURE: a boolean indicating whether to only grant access to HTTPS or both HTTPS and HTTP sources

3. PII enrichment for the batch pipeline

This release also marks the availability of the PII enrichment for the batch pipeline, check out the dedicated blog post to learn more.

4. Community contributions

This release contains quite a few community contributions which we’d like to highlight, huge thanks to everyone involved!

4.1 Improvement to the IP address extractor

Thanks to Mike Robins from Snowflake Analytics, extracting IP addresses from collector payloads originating from the Scala Stream Collector has gotten better.

Snowplow now successfully extracts IPv6 IPs from these Scala Stream Collector payloads, and now inspects the Forwarded header in addition to the historically supported X-Forwarded-For header.

4.2 Improvements to the Mandrill integration

An unexpected subaccount property in the Mandrill events format has meant that many Mandrill events have been failing enrichment.

To resolve this, community member Adam Gray has authored new 1-0-1 schemas for our Mandrill events, and updated the adapter to emit these new versions.

4.3 Documentation improvements

Finally, thanks to Kristoffer Snabb and Thales Mello for improving the repo-embedded documentation, as follows:

  • Redirecting our users to Discourse for support requests in our CONTRIBUTING.md
  • Renaming Caravel to Superset in our README.md

5. Upgrading

5.1 Upgrading the IP Lookups Enrichment

Whether you are using the batch or streaming pipeline, it is important to perform this upgrade if you make use of the MaxMind IP Lookups Enrichment.

To make use of the new enrichment, you will need to update your ip_lookups.json so that it conforms to the new 2-0-0 schema.

An example is provided in the GitHub repository.

5.1.1 Stream Enrich

If you are a streaming pipeline user, a version of Stream Enrich incorporating the upgraded IP Lookups Enrichment can be found on our Bintray here.

5.1.2 Spark Enrich

If you are a batch pipeline user, you’ll need to either update your EmrEtlRunner configuration to the following:

enrich:
  version:
    spark_enrich: 1.13.0 # WAS 1.12.0

or directly make use of the new Spark Enrich available at:

s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.13.0.jar

5.2 Upgrading the Clojure Collector

The new Clojure Collector is available in S3 at:

s3://snowplow-hosted-assets/2-collectors/clojure-collector/clojure-collector-2.0.0-standalone.war

To re-enable the /crossdomain.xml path, make sure to specify the SP_CDP_DOMAIN and SP_CDP_SECURE environment properties as described above.

6. Roadmap

We have a packed schedule of new and improved features coming for Snowplow. Upcoming Snowplow releases will include:

7. Getting help

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problems, please visit our Discourse forum.