Snowplow Google Cloud Storage Loader 0.1.0 released

Share

We are pleased to release the first version of the Snowplow Google Cloud Storage Loader. This application reads data from a Google Pub/Sub topic and writes it to a Google Cloud Storage bucket. This is an essential component in the Snowplow for GCP stack we are launching: this application enables users to sink any bad data from Pub/Sub to Cloud Storage, from where it can be reprocessed, and subsequently sink either the raw or enriched data to Cloud Storage as a permanent backup.

Please read on after the fold for:

  1. An overview of the Snowplow Google Cloud Storage Loader
  2. Running the Snowplow Google Cloud Storage Loader
  3. The GCP roadmap
  4. Getting help

1. An overview of the Snowplow Google Cloud Storage Loader

The Snowplow Google Cloud Storage Loader is a Cloud Dataflow job which:

As part of a Snowplow installation on GCP, this loader is particularly useful to archive data from the raw, enriched, or bad streams to long-term storage.

It can additionally partition the output bucket by date (up to the hour), making it faster and less expensive to query the data over particular time periods. The following is an example layout of the output bucket:

Note that every part of the filename is configurable:

If the notions of windows or panes are still relatively new to you, we recommend reading the following article series by Tyler Akidau which detail the different Cloud Dataflow capabilities with regards to streaming and windowing:

Finally, the loader can optionally compress data in gzip or bz2. Note that bz2-compressed data can’t be loaded directly into BigQuery.

2. Running the Snowplow Google Cloud Storage Loader

The Google Cloud Storage Loader comes as a ZIP archive, a Docker image, or a Cloud Dataflow template, feel free to choose the one which fits your use case the most.

2.1 Running through the template

You can run Dataflow templates using a variety of means:

Refer to the documentation on executing templates to learn more.

Here, we provide an example using gcloud:

gcloud dataflow jobs run ${JOB_NAME}  --gcs-location gs://snowplow-hosted-assets/4-storage/snowplow-google-cloud-storage-loader/0.1.0/SnowplowGoogleCloudStorageLoaderTemplate-0.1.0  --parameters  inputSubscription=projects/${PROJECT}/subscriptions/${SUBSCRIPTION}, outputDirectory=gs://${BUCKET}/YYYY/MM/dd/HH/, # partitions by date outputFilenamePrefix=output, # optional shardTemplate=-W-P-SSSSS-of-NNNNN, # optional outputFilenameSuffix=.txt, # optional windowDuration=5, # optional, in minutes compression=none, # optional, gzip, bz2 or none numShards=1 # optional

Make sure to set all the ${} environment variables included above.

2.2 Running through the zip archive

You can find the archive hosted on our Bintray.

Once unzipped the artifact can be run as follows:

./bin/snowplow-google-cloud-storage-loader  --runner=DataFlowRunner  --project=${PROJECT}  --streaming=true  --zone=europe-west2-a  --inputSubscription=projects/${PROJECT}/subscriptions/${SUBSCRIPTION}  --outputDirectory=gs://${BUCKET}/YYYY/MM/dd/HH/  # partitions by date --outputFilenamePrefix=output  # optional --shardTemplate=-W-P-SSSSS-of-NNNNN  # optional --outputFilenameSuffix=.txt  # optional --windowDuration=5  # optional, in minutes --compression=none  # optional, gzip, bz2 or none --numShards=1 # optional

Make sure to set all the ${} environment variables included above.

To display the help message:

./bin/snowplow-google-cloud-storage-loader --help

To display documentation about Cloud Storage Loader-specific options:

./bin/snowplow-google-cloud-storage-loader --help=com.snowplowanalytics.storage.googlecloudstorage.loader.Options

2.3 Running through the Docker image

You can also find the Docker image on our Bintray.

A container can be run as follows:

docker run  -e GOOGLE_APPLICATION_CREDENTIALS=/snowplow/config/credentials.json  # if running outside GCP snowplow-docker-registry.bintray.io/snowplow/snowplow-google-cloud-storage-loader:0.1.0  --runner=DataFlowRunner  --job-name=${JOB_NAME}  --project=${PROJECT}  --streaming=true  --zone=${ZONE}  --inputSubscription=projects/${PROJECT}/subscriptions/${SUBSCRIPTION}  --outputDirectory=gs://${BUCKET}/YYYY/MM/dd/HH/  # partitions by date --outputFilenamePrefix=output  # optional --shardTemplate=-W-P-SSSSS-of-NNNNN  # optional --outputFilenameSuffix=.txt  # optional --windowDuration=5  # optional, in minutes --compression=none  # optional, gzip, bz2 or none --numShards=1 # optional

Make sure to set all the ${} environment variables included above.

To display the help message:

docker run snowplow-docker-registry.bintray.io/snowplow/snowplow-google-cloud-storage-loader:0.1.0  --help

To display documentation about Cloud Storage Loader-specific options:

docker run snowplow-docker-registry.bintray.io/snowplow/snowplow-google-cloud-storage-loader:0.1.0  --help=com.snowplowanalytics.storage.googlecloudstorage.loader.Options

2.4 Additional options

A full list of all the Beam CLI options can be found in the documentation for Google Cloud Dataflow.

3. GCP roadmap

For those of you following our Google Cloud Platform progress, make sure you read our official GCP launch announcement and our BigQuery Loader release post.

We plan to shortly release:

4. Getting help

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problem, please visit our Discourse forum.

Share

Related articles