Snowplow Google Cloud Storage Loader 0.1.0 released

03 December 2018  •  Ben Fradet

We are pleased to release the first version of the Snowplow Google Cloud Storage Loader. This application reads data from a Google Pub/Sub topic and writes it to a Google Cloud Storage bucket. This is an essential component in the Snowplow for GCP stack we are launching: this application enables users to sink any bad data from Pub/Sub to Cloud Storage, from where it can be reprocessed, and subsequently sink either the raw or enriched data to Cloud Storage as a permanent backup.

Please read on after the fold for:

  1. An overview of the Snowplow Google Cloud Storage Loader
  2. Running the Snowplow Google Cloud Storage Loader
  3. The GCP roadmap
  4. Getting help

1. An overview of the Snowplow Google Cloud Storage Loader

The Snowplow Google Cloud Storage Loader is a Cloud Dataflow job which:

  • Consumes the contents of a Pub/Sub topic through an input subscription
  • Groups the records by a configurable time window
  • Writes the records into a Cloud Storage bucket

As part of a Snowplow installation on GCP, this loader is particularly useful to archive data from the raw, enriched, or bad streams to long-term storage.

It can additionally partition the output bucket by date (up to the hour), making it faster and less expensive to query the data over particular time periods. The following is an example layout of the output bucket:

  • gs://output-bucket/
    • 2018/
      • 11/
        • 01/
          • output-2018-11-01T15:25:00.000Z-2018-11-01T15:30:00.000Z-pane-0-last-00000-of-00001.txt

Note that every part of the filename is configurable:

  • output corresponds to the filename prefix
  • 2018-10-25T15:25:00.000Z-2018-10-25T15:30:00.000Z-pane-0-last-00000-of-00001 is the shard template and can be further broken down as:
    • 2018-11-01T15:25:00.000Z-2018-11-01T15:30:00.000Z, the time window
    • pane-0-last, the pane label, where panes refer to the data actually emitted after aggregation in a time window
    • 00000-of-00001, the shard index and total number of shards respectively, where shards refer to the number of files produced per window which is also configurable
  • .txt is the filename suffix

If the notions of windows or panes are still relatively new to you, we recommend reading the following article series by Tyler Akidau which detail the different Cloud Dataflow capabilities with regards to streaming and windowing:

Finally, the loader can optionally compress data in gzip or bz2. Note that bz2-compressed data can’t be loaded directly into BigQuery.

2. Running the Snowplow Google Cloud Storage Loader

The Google Cloud Storage Loader comes as a ZIP archive, a Docker image, or a Cloud Dataflow template, feel free to choose the one which fits your use case the most.

2.1 Running through the template

You can run Dataflow templates using a variety of means:

  • Using the GCP console
  • Using gcloud at the command line
  • Using the REST API

Refer to the documentation on executing templates to learn more.

Here, we provide an example using gcloud:

gcloud dataflow jobs run ${JOB_NAME} \
  --gcs-location gs://snowplow-hosted-assets/4-storage/snowplow-google-cloud-storage-loader/0.1.0/SnowplowGoogleCloudStorageLoaderTemplate-0.1.0 \
  --parameters \
    inputSubscription=projects/${PROJECT}/subscriptions/${SUBSCRIPTION},\
    outputDirectory=gs://${BUCKET}/YYYY/MM/dd/HH/,\ # partitions by date
    outputFilenamePrefix=output,\ # optional
    shardTemplate=-W-P-SSSSS-of-NNNNN,\ # optional
    outputFilenameSuffix=.txt,\ # optional
    windowDuration=5,\ # optional, in minutes
    compression=none,\ # optional, gzip, bz2 or none
    numShards=1 # optional

Make sure to set all the ${} environment variables included above.

2.2 Running through the zip archive

You can find the archive hosted on our Bintray.

Once unzipped the artifact can be run as follows:

./bin/snowplow-google-cloud-storage-loader \
  --runner=DataFlowRunner \
  --project=${PROJECT} \
  --streaming=true \
  --zone=europe-west2-a \
  --inputSubscription=projects/${PROJECT}/subscriptions/${SUBSCRIPTION} \
  --outputDirectory=gs://${BUCKET}/YYYY/MM/dd/HH/ \ # partitions by date
  --outputFilenamePrefix=output \ # optional
  --shardTemplate=-W-P-SSSSS-of-NNNNN \ # optional
  --outputFilenameSuffix=.txt \ # optional
  --windowDuration=5 \ # optional, in minutes
  --compression=none \ # optional, gzip, bz2 or none
  --numShards=1 # optional

Make sure to set all the ${} environment variables included above.

To display the help message:

./bin/snowplow-google-cloud-storage-loader --help

To display documentation about Cloud Storage Loader-specific options:

./bin/snowplow-google-cloud-storage-loader --help=com.snowplowanalytics.storage.googlecloudstorage.loader.Options

2.3 Running through the Docker image

You can also find the Docker image on our Bintray.

A container can be run as follows:

docker run \
  -e GOOGLE_APPLICATION_CREDENTIALS=/snowplow/config/credentials.json \ # if running outside GCP
  snowplow-docker-registry.bintray.io/snowplow/snowplow-google-cloud-storage-loader:0.1.0 \
  --runner=DataFlowRunner \
  --job-name=${JOB_NAME} \
  --project=${PROJECT} \
  --streaming=true \
  --zone=${ZONE} \
  --inputSubscription=projects/${PROJECT}/subscriptions/${SUBSCRIPTION} \
  --outputDirectory=gs://${BUCKET}/YYYY/MM/dd/HH/ \ # partitions by date
  --outputFilenamePrefix=output \ # optional
  --shardTemplate=-W-P-SSSSS-of-NNNNN \ # optional
  --outputFilenameSuffix=.txt \ # optional
  --windowDuration=5 \ # optional, in minutes
  --compression=none \ # optional, gzip, bz2 or none
  --numShards=1 # optional

Make sure to set all the ${} environment variables included above.

To display the help message:

docker run snowplow-docker-registry.bintray.io/snowplow/snowplow-google-cloud-storage-loader:0.1.0 \
  --help

To display documentation about Cloud Storage Loader-specific options:

docker run snowplow-docker-registry.bintray.io/snowplow/snowplow-google-cloud-storage-loader:0.1.0 \
  --help=com.snowplowanalytics.storage.googlecloudstorage.loader.Options

2.4 Additional options

A full list of all the Beam CLI options can be found in the documentation for Google Cloud Dataflow.

3. GCP roadmap

For those of you following our Google Cloud Platform progress, make sure you read our official GCP launch announcement and our BigQuery Loader release post.

We plan to shortly release:

4. Getting help

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problem, please visit our Discourse forum.