Snowplow R113 Filitosa real-time pipeline improvements

06 March 2019  •  Ben Fradet

Snowplow 113 Filitosa, named after the megalithic site in Southern Corsica, is a release focusing on improvements to the Scala Stream Collector as well as new features for Scala Common Enrich, the library powering all the different enrichment platforms.

This release is almost entirely made of community contributions, shoutout to all the contributors:

Thanks a lot to everyone involved!

Please read on after the fold for:

  1. Scala Stream Collector improvements
  2. Scala Common Enrich improvements
  3. Upgrading
  4. Roadmap
  5. Help

filitosa
Jean-Pol Grandmont CC BY-SA 3.0

1. Scala Stream Collector improvements

1.1 Prometheus metrics support

Thanks to LiveIntent, the Scala Stream Collector now publishes Prometheus metrics to the /metrics endpoint. You’ll find the following metrics published at this endpoint:

  • http_requests_total: the total count of requests
  • http_request_duration_seconds: the time spent handling requests

You will be able to slice and dice the metrics by endpoint, method and/or response code.

Additional information will also be available, such as the Java and Scala versions as well as the version of the Scala Stream Collector artifact.

1.2 Improved Kafka support

It is now possible to specify arbitrary Kafka producer configurations for the collector through the collector.streams.sink.producerConf configuration setting. Additionally, the Kafka library has been upgraded to the latest version to leverage the latest features.

Note, that those changes are also true for Stream Enrich for Kafka through the enrich.streams.sourceSink.{producerConf, consumerConf} configurations.

Thanks a lot to Sven Pfenning and Mirko Prescha for those two awesome features!

1.3 Other improvements

For people using the do not track cookie feature of the Scala Stream Collector, LiveIntent has improved the feature by letting you specify a regex for the cookie value.

Mike from Poplin Data has introduced a configurable Access-Control-Max-Age header which lets clients cache the results of OPTIONS request, resulting in fewer requests and faster POST requests: no need to make a preflight request if the result is already cached.

2. Scala Common Enrich improvements

2.1 HubSpot webhook integration

Peter Zhu from Poplin Data built the HubSpot webhook integration from scratch for this release. Huge props to Peter!

You’ll now be able to track the following HubSpot events in your Snowplow pipeline:

  • Deal creation
  • Deal change
  • Deal deletion
  • Contact creation
  • Contact change
  • Contact deletion
  • Company creation
  • Company change
  • Company deletion

Peter has also made small improvements to the Marketo and CallRail integrations.

2.2 POST support in the API request enrichment

It is now possible to use POST requests to interact with the API leveraged in the API request enrichment. Thanks to LiveIntent for this feature.

This is useful if you have to leverage an API which isn’t necessarily RESTful.

3. Upgrading

3.1 Upgrading the Scala Stream Collector

A new version of the Scala Stream Collector incorporating the changes discussed above can be found on our Bintray.

To make use of this new version, you’ll need to amend your configuration in the following ways:

  • Add a collector.cors section to specify the Access-Control-Max-Age duration:
cors {
  accessControlMaxAge = 5 seconds # -1 seconds disables the cache
}
  • Add a collector.prometheusMetrics section:
prometheusMetrics {
  enabled = false
  durationBucketsInSeconds = [0.1, 3, 10] # optional buckets by which to group by the `http_request_duration_seconds` metric
}
  • Modify the collector.doNotTrackCookie section if you want to make use of a regex:
doNotTrackCookie {
  enabled = true
  name = cookie-name
  value = ".+cookie-value.+"
}
  • Add the optional collector.streams.sink.producerConf if you want to specify additional Kafka producer configuration:
producerConf {
  acks = all
}

This also holds true for Stream Enrich enrich.streams.sourceSink.{producerConf, consumerConf}.

A full example configuration can be found in the repository.

3.2 Upgrading your enrichment platform

If you are a GCP pipeline user, a new Beam Enrich can be found on Bintray:

If you are a Kinesis or Kafka pipeline user, a new Stream Enrich can be found on Bintray.

Finally, if you are a batch pipeline user, a new Spark Enrich can be used by setting the new version in your EmrEtlRunner configuration:

enrich:
  version:
    spark_enrich: 1.17.0 # WAS 1.16.0

or directly make use of the new Spark Enrich available at:

s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.17.0.jar

For the batch pipeline, we’ve also extended the timeout recovery introduced in R112. A new version of EmrEtlRunner incorporating those improvements is available from our Bintray here.

4. Roadmap

Upcoming Snowplow releases include:

Stay tuned for announcements of more upcoming Snowplow releases soon!

5. Getting help

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problem, please visit our Discourse forum.