Snowplow Python Analytics SDK 0.2.0 released

We are pleased to announce the 0.2.0 release of the Snowplow Python Analytics SDK, a library providing tools to process and analyze Snowplow enriched event format in Python-compatible data processing frameworks such as Apache Spark and AWS Lambda.

This release adds new run manifest functionality, along with many internal changes.

In the rest of this post we will cover:

  1. Run manifests
  2. Using the run manifest
  3. Documentation
  4. Other changes
  5. Upgrading
  6. Getting help

1. Run manifests

This release provides tooling for maintaining a Snowplow run manifest. A Snowplow run manifest is a simple and robust way to track your job’s progress in processing the enriched events generated by multiple Snowplow pipeline runs.

Historically, Snowplow’s EmrEtlRunner and StorageLoader apps have moved whole folders of data around different locations in Amazon S3 in order to track progress through a pipeline run, and to avoid accidentally reprocessing that data. But file moves are quite problematic:

  1. They are time-consuming
  2. They are network-intensive
  3. They are error-prone - a failure to move a file will cause the job to fail and require manual intervention
  4. They only support one use-case at a time - you can’t have two distinct jobs moving the same files at the same time

Although Snowplow continues to use file moves (for now), it is better to use a run manifest for your own data processing jobs on Snowplow data. The idea of a manifest comes from the old naval term:

a list of the cargo carried by a ship, made for the use of various agents and officials at the ports of destination

In this case, we store our manifest in a AWS DynamoDB table, and we use it to keep track of which Snowplow runs our job has already processed.

2. Using the run manifest

The run manifest functionality resides in the new snowplow_analytics_sdk.run_manifests module.

Here’s a short usage example:

from boto3 import client
from snowplow_analytics_sdk.run_manifests import *

s3 = client('s3')
dynamodb = client('dynamodb')

dynamodb_run_manifests_table = 'snowplow-run-manifests'
enriched_events_archive = 's3://acme-snowplow-data/storage/enriched-archive/'
run_manifests = RunManifests(dynamodb, dynamodb_run_manifests_table)

run_manifests.create() # This should be called only once

for run_id in list_runids(s3, enriched_events_archive):
    if not run_manifests.contains(run_id):
        process(run_id)
        run_manifests.add(run_id)
    else:
        pass

In above example, we create two AWS service clients, one for S3 (to list job runs) and for DynamoDB (to access our manifest). These clients are provided via boto3 Python AWS SDK and can be initialized with static credentials or with system-provided credentials.

Then we list all Snowplow runs in a particular S3 path, and then process (with the user-provided process function) only those Snowplow runs that had not been previously processed. Note that run_id is just a simple string with the S3 key of particular job run.

RunManifests class, then, is a simple API wrapper to DynamoDB, which lets you:

  • create a DynamoDB table for manifests
  • add a Snowplow run to the table
  • check if table contains a given run ID

3. Documentation

As an SDK becomes more featureful it becomes harder to keep all the required documentation in the project’s README. In this release we have split out the README into several wiki pages, each dedicated to a particular feature.

Check out the Python Analytics SDK in the main Snowplow wiki.

4. Other changes

Version 0.2.0 also includes a few internal changes and minor enhancements, including:

  • Adding a Vagrant environment (issue #5)
  • Support for multiple versions of Python (issue #16)
  • Strict PEP8 linting (issue #4) for the CI tests

5. Upgrading

As before, the Snowplow Python Analytics SDK is available on PyPI:

pip install -U snowplow_analytics_sdk==0.2.0

6. Getting help

If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.

And if there’s another Snowplow Analytics SDK you’d like us to prioritize creating, please let us know on the forums!

Thoughts or questions? Come join us in our Discourse forum!

Anton Parkhomenko

Anton is a data engineer at Snowplow. You can find him on GitHub, Twitter and on his personal blog.