Snowplow R109 Lambaesis real-time pipeline upgrade

21 August 2018  •  Ben Fradet

We are pleased to announce the release of Snowplow 109 Lambaesis, named after the archeological site in north-eastern Algeria. This release focuses on upgrading the AWS real-time pipeline components, although it also updates EmrEtlRunner and Spark Enrich for batch pipeline users.

This release is one of the most community-driven releases in the history of Snowplow Analytics. As such, we would like to give a huge shout-out to each of the contributors who made it possible:

Please read on after the fold for:

  1. Enrichment process updates
  2. Scala Stream Collector updates
  3. EmrEtlRunner bugfix
  4. Supporting community contributions
  5. Upgrading
  6. Roadmap
  7. Help

lambaesis
Lambese - M. Gasmi / CC-BY 2.5

1. Enrichment process updates

1.1 Externalizing the file used for the user agent parser enrichment

Up until this release, the User Agent Parser Enrichment relied on a “database” of user agent regexes that was embedded along the code. With this release, we have externalized this file to decorrelate updates to the file from updates to the library, which gives us a lot more flexibility.

This User Agent Parser Enrichment update is available for both batch and real-time users, and we’ll be doing the same thing for the Referer Parser Enrichment as well.

Huge thanks to Kevin Irwin for contributing this change!

1.2 More flexible Iglu webhook

Up to this release, if you were to POST a JSON array to the Iglu webhook, such as:

curl -X POST \
  -H 'Content-Type: application/json' \
  -d '[
    {"name": "name1"},
    {"name": "name2"}
  ]' \
  'http://collector/com.snowplowanalytics.iglu/v1?schema=iglu%3Acom.acme%2Fschema%2Fjsonschema%2F1-0-0'

The Iglu webhook would assume you were sending a singleton event with an array of objects at its root; the schema would look like the following:

{
  "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
  "description": "Schema for acme",
  "self": {
    "vendor": "com.acme",
    "name": "schema",
    "format": "jsonschema",
    "version": "1-0-0"
  },
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "name": { "type": "string" }
    }
  }
}

We have now changed this behavior to instead treat an incoming array as multiple events which, in our case, would each have the following schema:

{
  "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
  "description": "Schema for acme",
  "self": {
    "vendor": "com.acme",
    "name": "schema",
    "format": "jsonschema",
    "version": "1-0-0"
  },
  "type": "object",
  "properties": {
    "name": { "type": "string" }
  }
}

This should make it easier to work with event sources which need to POST events to Snowplow in bulk.

1.3 Handle a comma-separated list of IP addresses

We have seen Snowplow users and customers encountering X-Forwarded-For headers containing a comma-separated list of IP addresses, occurring when the request went through multiple load balancers. The header in the raw event payload will indeed accumulate the different IP addresses, for example:

X-Forwarded-For: 132.130.245.228, 14.189.65.12, 132.71.227.98

According to the specification for this header, the first address is supposed to be the original client IP address whereas the following ones correspond to the successive proxies.

Based on this, we have made the choice to only conserve the first IP address in the case of a comma-separated list.

1.4 Stream Enrich updates

This section is for updates that apply to the real-time pipeline only.

Before this release, the Kinesis endpoint for Stream Enrich was determined by the AWS region that you wanted to run in. Unfortunately, this didn’t allow for use of projects like localstack which let you mimic AWS services locally.

Thanks to Arihant Surana, it is now possible to optionally specify a custom endpoint directly through the customEndpoint configuration.

Note that this feature is also available for the Scala Stream Collector.

1.5 Spark Enrich updates

This section is for updates that apply to the batch pipeline only.

This release introduces support for the 26-field CloudFront format that was released in January, for Snowplow users processing CloudFront access logs using Snowplow.

You can find more information in the AWS documentation; thanks to Moshe Demri for signaling the issue.

We have also taken advantage of our work on CloudFront to leverage the x-forwarded-for field to populate the user’s IP address. Thanks a lot to Dani Solà for contributing this change!

1.6 Miscellaneous updates

Thanks a lot to Saeed Zareian for a flurry of build dependency updates and Robert Kingston for example updates.

2. Scala Stream Collector updates

2.1 Reject requests with "do not track" cookies

The Scala Stream Collector can now reject requests which contain a cookie with a specified name and value. If the request is rejected based on this cookie, no tracking will happen: no events will be sent downstream and no cookies will be sent back.

The configuration takes the following form:

doNoTrackCookie {
  enabled = false
  name = do-not-track
  value = yes-do-not-track
}

You will have to set this cookie yourself, on a domain which the Scala Stream Collector can read.

2.2 Customize the response from the root route

It is now possible to customize what is sent back when hitting the / route of the Scala Stream Collector. Whereas the collector always sent a 404 before, you can now customize it through the following configuration:

rootResponse {
  enabled = false
  statusCode = 302
  # Optional, defaults to empty map
  headers = {
    Location = "https://127.0.0.1/"
    X-Custom = "something"
  }
  # Optional, defaults to empty string
  body = "302, redirecting"
}

This neat feature lets you provide an information page about your event collection and processing on the collector’s root URL, ready for site visitors to review.

The Scala Stream Collector now supports HEAD requests wherever GET requests were supported previously.

2.4 Allow for multiple domains in crossdomain.xml

You can now specify an array of domains when specifying your /crossdomain.xml route:

crossDomain {
  enabled = false
  domains = [ "*.acme.com", "*.acme.org" ]
  secure = true
}

3. EmrEtlRunner bugfix

In R108 we started leveraging the official AWS Ruby SDK in EmrEtlRunner and replaced our deprecated Sluice library.

Unfortunately, the functions we wrote to run the different empty file checks were recursive and can blow up the stack if you have a large number of EMR S3 empty files (more than 5,000 files in our tests).

This issue can prevent the Elastic MapReduce job from being launched.

We’ve now fixed this by making those functions iterative.

On a side note: we now encourage everyone to use s3a when referencing buckets in the EmrEtlRunner configuration because, when using s3a, those problematic empty files are simply not generated.

4. Supporting community contributions

We have taken advantage of this release to improve how we support our community of open source developers and other contributors. This initiative translates into:

  • A new Gitter room for Snowplow, where you can chat with the Snowplow engineers and share ideas on contributions you would like to make to the project
  • A new contributing guide
  • New issue and pull request templates to give better guidance if you are looking to contribute

5. Upgrading

5.1 Upgrading Stream Enrich

A new version of Stream Enrich incorporating the changes discussed above can be found on our Bintray here.

5.2 Upgrading Spark Enrich

If you are a batch pipeline user, you’ll need to either update your EmrEtlRunner configuration to the following:

enrich:
  version:
    spark_enrich: 1.16.0 # WAS 1.15.0

or directly make use of the new Spark Enrich available at:

s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.16.0.jar

5.3 Upgrading the User Agent Parser Enrichment

To make use of an external user agent database, you can update your enrichment file to the following:

{
  "schema": "iglu:com.snowplowanalytics.snowplow/ua_parser_config/jsonschema/1-0-1",
  "data": {
    "vendor": "com.snowplowanalytics.snowplow",
    "name": "ua_parser_config",
    "enabled": true,
    "parameters": {
      "database": "regexes-latest.yaml",
      "uri": "s3://snowplow-hosted-assets/third-party/ua-parser/"
    }
  }
}

Note the bump to the version 1-0-1 as well as the specification of the location of the user agent database. The database is the one maintained in the uap-core repository.

An example can be found in our repository.

We will be keeping the external user agent database that we host in Amazon S3 up-to-date as the upstream project releases new versions of it.

5.4 Upgrading the Scala Stream Collector

A new version of Stream Enrich incorporating the changes discussed above can be found on our Bintray here.

To make use of this new version, you’ll need to amend your configuration in the following ways:

  • Add a doNotTrackCookie section:
doNotTrackCookie {
  enabled = false
  name = cookie-name
  value = cookie-value
}
  • Add a rootResponse section:
rootResponse {
  enabled = false
  statusCode = 200
  body = “ok”
}
  • Turn crossDomain.domain into crossDomain.domains:
crossDomain {
  enabled = false
  domains = [ "*.acme.com", "*.emca.com" ]
  secure = true
}

A full configuration can be found in the repository.

5.5 Upgrading EmrEtlRunner

The latest version of EmrEtlRunner is available from our Bintray here.

We also encourage people to switch all of your bucket paths to s3a, which will prevent the pipeline’s S3DistCp steps from creating empty files, like so:

aws:
  s3:
    bucket:
      raw:
        in:
          - "s3a://bucket/in"
        processing: "s3a://bucket/processing"
        archive: "s3a://bucket/archive/raw"
      enriched:
        good: "s3a://bucket/enriched/good"
        bad: "s3a://bucket/enriched/bad"
        errors: "s3a://bucket/enriched/errors"
        archive: "s3a://bucket/archive/enriched"
      shredded:
        good: "s3a://bucket/shredded/good"
        bad: "s3a://bucket/shredded/bad"
        errors: "s3a://bucket/shredded/errors"
        archive: "s3a://bucket/archive/shredded"
...

6. Roadmap

Upcoming Snowplow releases include:

7. Getting help

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problem, please visit our Discourse forum.