Snowplow R101 Neapolis released with initial GCP support

Share

We are tremendously excited to announce the release of Snowplow R101 Neapolis. This realtime release marks the first step in our journey towards making Snowplow Google Cloud Platform-native, evolving it into a truly multi-cloud platform.

Read on for more information on R101 Neapolis, named after the archeological site in Sicily, Italy where the Greek theatre of Syracuse is located:

  1. Why bring Google Cloud Platform support to Snowplow?
  2. Adding GCP support to Stream Collector and Stream Enrich
  3. Dedicated artifacts for each platform
  4. Miscellaneous changes
  5. Upgrading
  6. Roadmap
  7. Help

neapolis Michele Ponzio CC BY-SA 2.0

1. Why bring Google Cloud Platform support to Snowplow?

Historically, the Snowplow platform has been closely tied to Amazon Web Services and to a lesser extent on-premise (largely through Apache Kafka support). In order to make the platform accessible to all, it is important to make it as cloud-agnostic as possible.

This process begins with porting the realtime pipeline to Google Cloud Platform, the hugely popular public cloud offering – and this release is the first step on this journey.

The next releases on our journey to GCP will focus on porting the streaming enrichment process to Google Cloud Dataflow and making loading Snowplow events into BigQuery a reality.

For more information regarding our overall plans for Google Cloud Platform, please check out our RFC on the subject.

2. Adding GCP support to Stream Collector and Stream Enrich

To those familiar with the current Snowplow streaming pipeline achitecture, this release will look straightforward: we simply take our existing components and add support for publishing and subscribing to Google Cloud PubSub topics.

Specifically, we took our Scala Stream Collector and added support for publishing raw Snowplow events to a Google Cloud PubSub topic. Similarly, we updated our Stream Enrich component to read those raw events off Google Cloud PubSub, enrich them and publish them back to another PubSub topic.

We have written dedicated guides to setting up those micro-services in the following wiki articles:

Huge thanks to our former intern, Guilherme Pires, for laying the foundations for this release.

3. Dedicated artifacts for each platform

As we go multi-cloud and support a growing number of different platforms, it is becoming increasingly important to split the different artifacts according to their targeted streaming technology, in order to:

  1. Keep the JAR sizes from getting out of hand
  2. Prevent a combinatorial explosion of different source and sink technologies requiring testing (e.g. Amazon Kinesis source to Google Cloud PubSub sink)

Therefore, from this release onwards, there will be five different artifacts for the Scala Stream Collector, and five for Stream Enrich.

For the Scala Stream Collector:

JAR Targeted platform
snowplow-stream-collector-google-pubsub-version.jar Google Cloud PubSub
snowplow-stream-collector-kinesis-version.jar Amazon Kinesis
snowplow-stream-collector-kafka-version.jar Apache Kafka
snowplow-stream-collector-nsq-version.jar NSQ
snowplow-stream-collector-stdout-version.jar stdout

For Stream Enrich:

JAR Targeted platform
snowplow-stream-enrich-google-pubsub-version.jar Google Cloud PubSub
snowplow-stream-enrich-kinesis-version.jar Amazon Kinesis
snowplow-stream-enrich-kafka-version.jar Apache Kafka
snowplow-stream-enrich-nsq-version.jar NSQ
snowplow-stream-enrich-stdin-version.jar stdin/stdout

This approach reduces artifact size and simplifies testing, at the cost of some flexibility for Stream Enrich. If you were previously running a “hybrid-cloud” Stream Enrich (reading and writing to different streaming technologies), then we suggest setting up a dedicated app downstream of Stream Enrich to bridge the enriched events to the other stream system.

4. Miscellaneous changes

4.1 Exposing the number of requests made to the collector through JMX

Thanks to GitHub user jspc, the Scala Stream Collector now exposes some valuable metrics through JMX via the new MBean com.snowplowanalytics.snowplow:type=StreamCollector, containing the following attributes:

You can turn on JMX by launching the collector in the following manner:

java  -Dcom.sun.management.jmxremote  -Dcom.sun.management.jmxremote.port=9010  -Dcom.sun.management.jmxremote.local.only=false  -Dcom.sun.management.jmxremote.authenticate=false  -Dcom.sun.management.jmxremote.ssl=false  -jar snowplow-stream-collector-google-pubsub-0.13.0.jar --config config.hocon

For more informat
ion on setting JMX up, refer to this guide.

4.2 Upgrading to Kafka 1.0.1

We’ve taken advantage of this release to upgrade the Kafka artifacts to Kafka 1.0.1.

5. Upgrading

5.1 Scala Stream Collector

The latest version of the Scala Stream Collector is available from our Bintray here.

A complete setup guide for running the Scala Stream Collector on GCP can be found in the following guides:

5.1.1 Updating the configuration

For non-Google Cloud PubSub users, the only minor change was made to the collector.crossDomain section: it’s now non-optional but has an enabled flag:

crossDomain { enabled = false domain = "acme.com" secure = true }

However, if you want to leverage Google Cloud PubSub, you’ll need to change the collector.streams.sink section to something akin to the following:

sink { enabled = googlepubsub googleProjectId = ID # values are in milliseconds backoffPolicy { minBackoff = 50 maxBackoff = 1000 totalBackoff = 10000 # must be >= 10000 multiplier = 2 } }

For a complete example, see our sample config.hocon template.

If you’re running the collector from a GCP instance in the same project, authentication will be transparently taken care of for you.

If not, you’ll need to run the following to authenticate using GCP’s CLI gcloud:

gcloud auth login gcloud auth application-default login

Regarding backoffPolicy, if sinking a raw event to PubSub fails, the first retry will happen after minBackoff milliseconds. For the following failures, this backoff will be multiplied by multiplier each time until it reaches maxBackoff milliseconds, its cap. If the sum of the time spent backing off exceeds totalBackoff milliseconds, the application will shut down.

5.1.2 Launching

As explained in section 3, there is now one JAR per platform, as such you’ll need to use one of the following commands to launch the collector:

java -jar snowplow-stream-collector-google-pubsub-0.13.0.jar --config config.hocon java -jar snowplow-stream-collector-kinesis-0.13.0.jar --config config.hocon java -jar snowplow-stream-collector-kafka-0.13.0.jar --config config.hocon java -jar snowplow-stream-collector-nsq-0.13.0.jar --config config.hocon java -jar snowplow-stream-collector-stdout-0.13.0.jar --config config.hocon

5.2 Stream Enrich

The latest version of Stream Enrich is available from our Bintray here.

A complete setup guide for running Stream Enrich on GCP can be found in the following guides:

5.2.1 Updating the configuration

Configuration for Stream Enrich has been remodeled in order to only allow a source and a sink on the same platform and, for example, disallow reading events from a Kafka topic and writing out enriched events to a Kinesis stream.

As such, the configuration, if you’re using Kinesis, now looks like:

enrich { streams { in { ... } # UNCHANGED out { ... } # UNCHANGED sourceSink { # NEW SECTION enabled = kinesis region = eu-west-1 aws { accessKey = iam secretKey = iam } maxRecords = 10000 initialPosition = TRIM_HORIZON backoffPolicy { minBackoff = 50 maxBackoff = 1000 } } buffer { ... } # UNCHANGED appName = "" # UNCHANGED } monitoring { ... } # UNCHANGED }

If you want to leverage Google Cloud PubSub, it should look like the following:

enrich { streams { in { ... } # UNCHANGED out { ... } # UNCHANGED sourceSink { # NEW SECTION enabled = googlepubsub googleProjectId = id threadPoolSize = 4 backoffPolicy { minBackoff = 50 maxBackoff = 1000 totalBackoff = 10000 # must be >= 10000 multiplier = 2 } } buffer { ... } # UNCHANGED appName = "" # UNCHANGED } monitoring { ... } # UNCHANGED }

For a complet
e example, see our sample config.hocon template.

If you’re running the collector from a GCP instance in the same project, authentication will be transparently taken care of for you.

If not, you’ll need to run the following to authenticate using GCP’s CLI gcloud:

gcloud auth login gcloud auth application-default login

Regarding backoffPolicy, if sinking an enriched event to PubSub fails, the first retry will happen after minBackoff milliseconds. For the following failures, this backoff will be multiplied by multiplier each time until it reaches maxBackoff milliseconds, its cap. If the sum of the time spent backing off exceeds totalBackoff milliseconds, the application will shut down.

threadPoolSize refers to the number of threads available to the PubSub Subscriber.

5.2.2 Launching

Same as for the collector, there is now one JAR per targeted platform:

java -jar snowplow-stream-enrich-google-pubsub-0.15.0.jar --config config.hocon --resolver file:iglu.json java -jar snowplow-stream-enrich-kinesis-0.15.0.jar --config config.hocon --resolver file:iglu.json java -jar snowplow-stream-enrich-kafka-0.15.0.jar --config config.hocon --resolver file:iglu.json java -jar snowplow-stream-enrich-nsq-0.15.0.jar --config config.hocon --resolver file:iglu.json java -jar snowplow-stream-enrich-stdin-0.15.0.jar --config config.hocon --resolver file:iglu.json

6. Roadmap

Upcoming Snowplow releases will include:

Furthermore, this release is only the beginning for Google Cloud Platform support in Snowplow!

As discussed in our RFC, we plan on porting our streaming enrichment process to Google Cloud Dataflow, leveraging the Apache Beam APIs (see this milestone for details). In parallel, we are also busy designing our new Snowplow event loader for BigQuery.

We look forward to your feedback as we continue to roll out and extend our GCP capabilities!

7. Getting help

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problems, please visit our Discourse forum.

Share

Related articles