Snowplow 93 Virunum released

03 October 2017  •  Ben Fradet

We are tremendously excited to announce the release of Snowplow 93 Virunum.

This release focuses on a much needed refresh of the real-time pipeline components: the Scala Stream Collector as well as Stream Enrich. It also fixes some long-standing annoyances regarding the Scala Stream Collector.

If you’d like to know more about R93 Virunum, named after the ancient Roman city in Austria, please read on after the fold:

  1. Scala Stream Collector: detecting blocked third-party cookies through cookie bounce
  2. Scala Stream Collector: relaxing the parsing of HTTP request components
  3. Stream Enrich: enriching events from a certain point in time with AT_TIMESTAMP
  4. Stream Enrich: forcing the download of the ip lookup database
  5. Stream Enrich: partitionining the output stream according to event properties
  6. Kafka improvements
  7. Under the hood
  8. Upgrading
  9. Roadmap
  10. Help

virunum

Following on from Christoph Buente’s RFC, the Scala Stream Collector now provides a mechanism to test if third-party cookies are blocked and reacts appropriately. Huge thanks to Christoph and the team at LiveIntent for contributing this sophisticated new feature.

Simply put, the new “cookie bounce” mechanism:

  • Checks if the cookie named according to the value of the cookie.name configuration is present
    • if it is present, uses it and processes the request
    • if not, issues a redirect to itself with a Set-Cookie header
      • if the cookie is still missing, then we infer that third-party cookies are not allowed, and the Scala Stream Collector processes the request with the placeholder network user id defined in cookieBounce.fallbackNetworkUserId
      • if the cookie is now present, third-party cookies are allowed, and the Scala Stream Collector processes the request with the cookie value

To enable this feature, you can change the cookieBounce.enabled configuration to true.

Be careful though: the redirects mentioned above can significantly increase the number of requests that your collectors have to handle.

2. Scala Stream Collector: relaxing the parsing of HTTP request components

The Scala Stream Collector was previously too restrictive when it came to parsing elements of an HTTP request, and would reject certain events despite their intrinsic correctness, most notably due to:

  • Query string parameters with non-url-encoded reserved characters in their values (#3272)
  • Useragents not conforming to RFC 2616 (#2970)

Those shortcomings have been fixed in the new version of the collector as part of our ongoing focus on removing any possible data loss scenarios across the pipeline.

Additionally, the enrich stream processing application won’t reject events for which page_url contains more than one # characters (#2893).

3. Stream Enrich: enriching events from a certain point in time with `AT_TIMESTAMP`

If you are using Kinesis with Stream Enrich, you previously had two choices when it came to enriching your raw event stream:

  • Starting from the beginning through TRIM_HORIZON
  • Starting from the latest message with LATEST

With R93, you are now able to consume your raw event stream from an arbitrary point in time by specifying AT_TIMESTAMP as the streams.kinesis.initialPosition configuration setting. Additionally, you’ll need to specify an actual timestamp in streams.kinesis.initialTimestamp.

4. Stream Enrich: forcing the download of the MaxMind IP lookups database

Before R93, when you launched Stream Enrich with the IP lookups enrichment, the MaxMind IP lookups database was downloaded locally and, if you were to launch it later it would reuse this local cache of the database.

R93 introduces a command line argument --force-ip-lookups-download to download a new version of the ip lookup database every time that Stream Enrich is launched.

There are plans to introduce a time-to-live for this database and re-download it while Stream Enrich is running in issue #3407.

5. Stream Enrich: partitioning the output stream according to event properties

Before R93, it was only possible to use user_ipaddress as a partition key for the enriched event stream emitted by Stream Enrich. This release extends the realm of possibilities by introducing thestreams.out.partitionKey configuration setting, which lets you specify which event property to use to partition the output stream of Stream Enrich.

The available properties have been selected based on their fitness as a partition key (i.e. good distribution and usefulness):

  • domain_userid
  • network_userid
  • domain_sessionid
  • user_ipaddress
  • event_id
  • event_fingerprint
  • user_fingerprint

If none of these are used, a random UUID will be generated for each event as partition key.

As a reminder, in Kinesis and Kafka, two events having the same partition key are guaranteed to end up in the same shard or partition respectively.

6. Kafka improvements

Improvements have also been made regarding how both the Scala Stream Collector and Stream Enrich interact with Kafka. In particular:

  • This release exposes the streams.kafka.retries configuration for both the Scala Stream Collector and Stream Enrich, allowing the Kafka producer to resend any record which failed being sent the specified number of times
  • Prior to this release, the streams.buffer.byteLimit setting was used as the size of the batch being sent to Kafka, which didn’t make a lot of sense. It now corresponds to the quantity of memory the Kafka producer can use to buffer records before sending them
  • Finally, this release makes use of the callback-based API to notify of errors when producing messages to a Kafka topic

7. Under the hood

This release also includes a big set of other updates which are part of the modernization effort around the realtime pipeline, most notably:

8. Upgrading

8.1 Scala Stream Collector

The latest version of the Scala Stream Collector is available from our Bintray here.

8.1.1 Updating the configuration

collector {
  cookieBounce {                                                   # NEW
    enabled = false
    name = "n3pc"
    fallbackNetworkUserId = "00000000-0000-4000-A000-000000000000"
  }

  sink = kinesis                                                   # WAS sink.enabled

  streams {                                                        # REORGANIZED
    good = good-stream
    bad = bad-stream

    kinesis {
      // ...
    }

    kafka {
      // ...
      retries = 0                                                  # NEW
    }
  }
}

akka {
  http.server {                                                    # WAS spray.can.server
    // ...
  }
}

For a complete example, see our sample config.hocon template.

8.1.2 Launching

The Scala Stream Collector is no longer an executable JAR file. As a result, it has to be launched as:

java -jar snowplow-stream-collector-0.10.0.jar --config config.hocon

8.2 Stream Enrich

The latest version of Stream Enrich is available from our Bintray here.

8.2.1 Updating the configuration

enrich {
  // ...
  streams {
    // ...
    out {
      // ...
      partitionKey = user_ipaddress             # NEW
    }

    kinesis {                                   # REORGANIZED
      // ...
      initialTimestamp = "2017-05-17T10:00:00Z" # NEW but optional
      backoffPolicy {                           # MOVED
        // ...
      }
    }

    kafka {
      // ...
      retries = 0                               # NEW
    }
  }
}

For a complete example, see our sample config.hocon template.

8.2.2 Launching

Stream Enrich is no longer an executable JAR file. As a result, it will have to be launched as:

java -jar snowplow-stream-enrich-0.11.0.jar --config config.hocon --resolver file:resolver.json

Additionally, a new --force-ip-lookups-download flag has been introduced as mentioned above.

9. Roadmap

Upcoming Snowplow releases will include:

  • R94 [BAT] Ellora, enhancing our Redshift event storage with ZSTD encoding, plus various bug fixes for the batch pipeline
  • R95 [STR] Zeugma, which will add support for NSQ to the stream processing pipeline, ready for adoption in Snowplow Mini
  • R9x [STR] Priority fixes, removing the potential for data loss in the stream processing pipeline
  • R9x [BAT] 4 webhooks, which will add support for 4 new webhooks (Mailgun, Olark, Unbounce, StatusGator)

10. Getting help

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problem, please visit our Discourse forum.