Snowplow R101 Neapolis released with initial GCP support

20 March 2018  •  Ben Fradet

We are tremendously excited to announce the release of Snowplow R101 Neapolis. This realtime release marks the first step in our journey towards making Snowplow Google Cloud Platform-native, evolving it into a truly multi-cloud platform.

Read on for more information on R101 Neapolis, named after the archeological site in Sicily, Italy where the Greek theatre of Syracuse is located:

  1. Why bring Google Cloud Platform support to Snowplow?
  2. Adding GCP support to Stream Collector and Stream Enrich
  3. Dedicated artifacts for each platform
  4. Miscellaneous changes
  5. Upgrading
  6. Roadmap
  7. Help

neapolis

1. Why bring Google Cloud Platform support to Snowplow?

Historically, the Snowplow platform has been closely tied to Amazon Web Services and to a lesser extent on-premise (largely through Apache Kafka support). In order to make the platform accessible to all, it is important to make it as cloud-agnostic as possible.

This process begins with porting the realtime pipeline to Google Cloud Platform, the hugely popular public cloud offering - and this release is the first step on this journey.

The next releases on our journey to GCP will focus on porting the streaming enrichment process to Google Cloud Dataflow and making loading Snowplow events into BigQuery a reality.

For more information regarding our overall plans for Google Cloud Platform, please check out our RFC on the subject.

2. Adding GCP support to Stream Collector and Stream Enrich

To those familiar with the current Snowplow streaming pipeline achitecture, this release will look straightforward: we simply take our existing components and add support for publishing and subscribing to Google Cloud PubSub topics.

Specifically, we took our Scala Stream Collector and added support for publishing raw Snowplow events to a Google Cloud PubSub topic. Similarly, we updated our Stream Enrich component to read those raw events off Google Cloud PubSub, enrich them and publish them back to another PubSub topic.

We have written dedicated guides to setting up those micro-services in the following wiki articles:

Huge thanks to our former intern, Guilherme Pires, for laying the foundations for this release.

3. Dedicated artifacts for each platform

As we go multi-cloud and support a growing number of different platforms, it is becoming increasingly important to split the different artifacts according to their targeted streaming technology, in order to:

  1. Keep the JAR sizes from getting out of hand
  2. Prevent a combinatorial explosion of different source and sink technologies requiring testing (e.g. Amazon Kinesis source to Google Cloud PubSub sink)

Therefore, from this release onwards, there will be five different artifacts for the Scala Stream Collector, and five for Stream Enrich.

For the Scala Stream Collector:

JAR Targeted platform
snowplow-stream-collector-google-pubsub-version.jar Google Cloud PubSub
snowplow-stream-collector-kinesis-version.jar Amazon Kinesis
snowplow-stream-collector-kafka-version.jar Apache Kafka
snowplow-stream-collector-nsq-version.jar NSQ
snowplow-stream-collector-stdout-version.jar stdout

For Stream Enrich:

JAR Targeted platform
snowplow-stream-enrich-google-pubsub-version.jar Google Cloud PubSub
snowplow-stream-enrich-kinesis-version.jar Amazon Kinesis
snowplow-stream-enrich-kafka-version.jar Apache Kafka
snowplow-stream-enrich-nsq-version.jar NSQ
snowplow-stream-enrich-stdin-version.jar stdin/stdout

This approach reduces artifact size and simplifies testing, at the cost of some flexibility for Stream Enrich. If you were previously running a “hybrid-cloud” Stream Enrich (reading and writing to different streaming technologies), then we suggest setting up a dedicated app downstream of Stream Enrich to bridge the enriched events to the other stream system.

4. Miscellaneous changes

4.1 Exposing the number of requests made to the collector through JMX

Thanks to GitHub user jspc, the Scala Stream Collector now exposes some valuable metrics through JMX via the new MBean com.snowplowanalytics.snowplow:type=StreamCollector, containing the following attributes:

  • Requests: total number of requests
  • SuccessfulRequests: total number of successful requests
  • FailedRequests: total number of failed requests

You can turn on JMX by launching the collector in the following manner:

java \
  -Dcom.sun.management.jmxremote \
  -Dcom.sun.management.jmxremote.port=9010 \
  -Dcom.sun.management.jmxremote.local.only=false \
  -Dcom.sun.management.jmxremote.authenticate=false \
  -Dcom.sun.management.jmxremote.ssl=false \
  -jar snowplow-stream-collector-google-pubsub-0.13.0.jar --config config.hocon

For more information on setting JMX up, refer to this guide.

4.2 Upgrading to Kafka 1.0.1

We’ve taken advantage of this release to upgrade the Kafka artifacts to Kafka 1.0.1.

5. Upgrading

5.1 Scala Stream Collector

The latest version of the Scala Stream Collector is available from our Bintray here.

A complete setup guide for running the Scala Stream Collector on GCP can be found in the following guides:

5.1.1 Updating the configuration

For non-Google Cloud PubSub users, the only minor change was made to the collector.crossDomain section: it’s now non-optional but has an enabled flag:

crossDomain {
  enabled = false
  domain = "acme.com"
  secure = true
}

However, if you want to leverage Google Cloud PubSub, you’ll need to change the collector.streams.sink section to something akin to the following:

sink {
  enabled = googlepubsub
  googleProjectId = ID
  # values are in milliseconds
  backoffPolicy {
    minBackoff = 50
    maxBackoff = 1000
    totalBackoff = 10000 # must be >= 10000
    multiplier = 2
  }
}

For a complete example, see our sample config.hocon template.

If you’re running the collector from a GCP instance in the same project, authentication will be transparently taken care of for you.

If not, you’ll need to run the following to authenticate using GCP’s CLI gcloud:

gcloud auth login
gcloud auth application-default login

Regarding backoffPolicy, if sinking a raw event to PubSub fails, the first retry will happen after minBackoff milliseconds. For the following failures, this backoff will be multiplied by multiplier each time until it reaches maxBackoff milliseconds, its cap. If the sum of the time spent backing off exceeds totalBackoff milliseconds, the application will shut down.

5.1.2 Launching

As explained in section 3, there is now one JAR per platform, as such you’ll need to use one of the following commands to launch the collector:

java -jar snowplow-stream-collector-google-pubsub-0.13.0.jar --config config.hocon
java -jar snowplow-stream-collector-kinesis-0.13.0.jar --config config.hocon
java -jar snowplow-stream-collector-kafka-0.13.0.jar --config config.hocon
java -jar snowplow-stream-collector-nsq-0.13.0.jar --config config.hocon
java -jar snowplow-stream-collector-stdout-0.13.0.jar --config config.hocon

5.2 Stream Enrich

The latest version of Stream Enrich is available from our Bintray here.

A complete setup guide for running Stream Enrich on GCP can be found in the following guides:

5.2.1 Updating the configuration

Configuration for Stream Enrich has been remodeled in order to only allow a source and a sink on the same platform and, for example, disallow reading events from a Kafka topic and writing out enriched events to a Kinesis stream.

As such, the configuration, if you’re using Kinesis, now looks like:

enrich {
  streams {
    in { ... }                         # UNCHANGED
    out { ... }                        # UNCHANGED
    sourceSink {                       # NEW SECTION
      enabled = kinesis
      region = eu-west-1
      aws {
        accessKey = iam
        secretKey = iam
      }
      maxRecords = 10000
      initialPosition = TRIM_HORIZON
      backoffPolicy {
        minBackoff = 50
        maxBackoff = 1000
      }
    }
    buffer { ... }                     # UNCHANGED
    appName = ""                       # UNCHANGED
  }
  monitoring { ... }                   # UNCHANGED
}

If you want to leverage Google Cloud PubSub, it should look like the following:

enrich {
  streams {
    in { ... }                         # UNCHANGED
    out { ... }                        # UNCHANGED
    sourceSink {                       # NEW SECTION
      enabled = googlepubsub
      googleProjectId = id
      threadPoolSize = 4
      backoffPolicy {
        minBackoff = 50
        maxBackoff = 1000
        totalBackoff = 10000 # must be >= 10000
        multiplier = 2
      }
    }
    buffer { ... }                     # UNCHANGED
    appName = ""                       # UNCHANGED
  }
  monitoring { ... }                   # UNCHANGED
}

For a complete example, see our sample config.hocon template.

If you’re running the collector from a GCP instance in the same project, authentication will be transparently taken care of for you.

If not, you’ll need to run the following to authenticate using GCP’s CLI gcloud:

gcloud auth login
gcloud auth application-default login

Regarding backoffPolicy, if sinking an enriched event to PubSub fails, the first retry will happen after minBackoff milliseconds. For the following failures, this backoff will be multiplied by multiplier each time until it reaches maxBackoff milliseconds, its cap. If the sum of the time spent backing off exceeds totalBackoff milliseconds, the application will shut down.

threadPoolSize refers to the number of threads available to the PubSub Subscriber.

5.2.2 Launching

Same as for the collector, there is now one JAR per targeted platform:

java -jar snowplow-stream-enrich-google-pubsub-0.15.0.jar --config config.hocon --resolver file:iglu.json
java -jar snowplow-stream-enrich-kinesis-0.15.0.jar --config config.hocon --resolver file:iglu.json
java -jar snowplow-stream-enrich-kafka-0.15.0.jar --config config.hocon --resolver file:iglu.json
java -jar snowplow-stream-enrich-nsq-0.15.0.jar --config config.hocon --resolver file:iglu.json
java -jar snowplow-stream-enrich-stdin-0.15.0.jar --config config.hocon --resolver file:iglu.json

6. Roadmap

Upcoming Snowplow releases will include:

Furthermore, this release is only the beginning for Google Cloud Platform support in Snowplow!

As discussed in our RFC, we plan on porting our streaming enrichment process to Google Cloud Dataflow, leveraging the Apache Beam APIs (see this milestone for details). In parallel, we are also busy designing our new Snowplow event loader for BigQuery.

We look forward to your feedback as we continue to roll out and extend our GCP capabilities!

7. Getting help

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problems, please visit our Discourse forum.