Snowplow 93 Virunum released

Share

We are tremendously excited to announce the release of Snowplow 93 Virunum.

This release focuses on a much needed refresh of the real-time pipeline components: the Scala Stream Collector as well as Stream Enrich. It also fixes some long-standing annoyances regarding the Scala Stream Collector.

If you’d like to know more about R93 Virunum, named after [the ancient Roman city in Austria][virunum], please read on after the fold:

  1. Scala Stream Collector: detecting blocked third-party cookies through cookie bounce
  2. Scala Stream Collector: relaxing the parsing of HTTP request components
  3. Stream Enrich: enriching events from a certain point in time with AT_TIMESTAMP
  4. Stream Enrich: forcing the download of the ip lookup database
  5. Stream Enrich: partitionining the output stream according to event properties
  6. Kafka improvements
  7. Under the hood
  8. Upgrading
  9. Roadmap
  10. Help

Following on from Christoph Buente’s RFC, the Scala Stream Collector now provides a mechanism to test if third-party cookies are blocked and reacts appropriately. Huge thanks to Christoph and the team at LiveIntent for contributing this sophisticated new feature.

Simply put, the new “cookie bounce” mechanism:

To enable this feature, you can change the cookieBounce.enabled configuration to true.

Be careful though: the redirects mentioned above can significantly increase the number of requests that your collectors have to handle.

2. Scala Stream Collector: relaxing the parsing of HTTP request components

The Scala Stream Collector was previously too restrictive when it came to parsing elements of an HTTP request, and would reject certain events despite their intrinsic correctness, most notably due to:

Those shortcomings have been fixed in the new version of the collector as part of our ongoing focus on removing any possible data loss scenarios across the pipeline.

Additionally, the enrich stream processing application won’t reject events for which page_url contains more than one # characters (#2893).

3. Stream Enrich: enriching events from a certain point in time with `AT_TIMESTAMP`

If you are using Kinesis with Stream Enrich, you previously had two choices when it came to enriching your raw event stream:

With R93, you are now able to consume your raw event stream from an arbitrary point in time by specifying AT_TIMESTAMP as the streams.kinesis.initialPosition configuration setting. Additionally, you’ll need to specify an actual timestamp in streams.kinesis.initialTimestamp.

4. Stream Enrich: forcing the download of the MaxMind IP lookups database

Before R93, when you launched Stream Enrich with the IP lookups enrichment, the MaxMind IP lookups database was downloaded locally and, if you were to launch it later it would reuse this local cache of the database.

R93 introduces a command line argument --force-ip-lookups-download to download a new version of the ip lookup database every time that Stream Enrich is launched.

There are plans to introduce a time-to-live for this database and re-download it while Stream Enrich is running in issue #3407.

5. Stream Enrich: partitioning the output stream according to event properties

Before R93, it was only possible to use user_ipaddress as a partition key for the enriched event stream emitted by Stream Enrich. This release extends the realm of possibilities by introducing thestreams.out.partitionKey configuration setting, which lets you specify which event property to use to partition the output stream of Stream Enrich.

The available properties have been selected based on their fitness as a partition key (i.e. good distribution and usefulness):

If none of these are used, a random UUID will be generated for each event as partition key.

As a reminder, in Kinesis and Kafka, two events having the same partition key are guaranteed to end up in the same shard or partition respectively.

6. Kafka improvements

Improvements have also been made regarding how both the Scala Stream Collector and Stream Enrich interact with Kafka. In particular:

7. Under the hood

This release also includes a big set of other updates which are part of the modernization effort around the realtime pipeline, most notably:

8. Upgrading

8.1 Scala Stream Collector

The latest version of the Scala Stream Collector is available from our Bintray here.

8.1.1 Updating the configuration

collector { cookieBounce { # NEW  enabled = false name = "n3pc" fallbackNetworkUserId = "00000000-0000-4000-A000-000000000000" } sink = kinesis # WAS sink.enabled  streams { # REORGANIZED  good = good-stream bad = bad-stream kinesis { // ... } kafka { // ... retries = 0 # NEW  } } } akka { http.server { # WAS spray.can.server  // ... } }

For a complete example, see our sample config.hocon template.

8.1.2 Launching

The Scala Stream Collector is no longer an executable JAR file. As a result, it has to be launched as:

java -jar snowplow-stream-collector-0.10.0.jar --config config.hocon

8.2 Stream Enrich

The latest version of Stream Enrich is available from our Bintray here.

8.2.1 Updating the configuration

enrich { // ... streams { // ... out { // ... partitionKey = user_ipaddress # NEW  } kinesis { # REORGANIZED  // ... initialTimestamp = "2017-05-17T10:00:00Z" # NEW but optional  backoffPolicy { # MOVED  // ... } } kafka { // ... retries = 0 # NEW  } } }

For a complete example, see our sample config.hocon template.

8.2.2 Launching

Stream Enrich is no longer an executable JAR file. As a result, it will have to be launched as:

java -jar snowplow-stream-enrich-0.11.0.jar --config config.hocon --resolver file:resolver.json

Additionally, a new --force-ip-lookups-download flag has been introduced as mentioned above.

9. Roadmap

Upcoming Snowplow releases will include:

10. Getting help

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problem, please visit our Discourse forum.

Share

Related articles