We are pleased to announce the 0.2.0 release of the Snowplow Python Analytics SDK, a library providing tools to process and analyze Snowplow enriched event format in Python-compatible data processing frameworks such as Apache Spark and AWS Lambda.
This release adds new run manifest functionality, along with many internal changes.
In the rest of this post we will cover:
1. Run manifests
This release provides tooling for maintaining a Snowplow run manifest. A Snowplow run manifest is a simple and robust way to track your job’s progress in processing the enriched events generated by multiple Snowplow pipeline runs.
Historically, Snowplow’s EmrEtlRunner and StorageLoader apps have moved whole folders of data around different locations in Amazon S3 in order to track progress through a pipeline run, and to avoid accidentally reprocessing that data. But file moves are quite problematic:
- They are time-consuming
- They are network-intensive
- They are error-prone – a failure to move a file will cause the job to fail and require manual intervention
- They only support one use-case at a time – you can’t have two distinct jobs moving the same files at the same time
Although Snowplow continues to use file moves (for now), it is better to use a run manifest for your own data processing jobs on Snowplow data. The idea of a manifest comes from the old naval term:
a list of the cargo carried by a ship, made for the use of various agents and officials at the ports of destination
In this case, we store our manifest in a AWS DynamoDB table, and we use it to keep track of which Snowplow runs our job has already processed.
2. Using the run manifest
The run manifest functionality resides in the new
Here’s a short usage example:
In above example, we create two AWS service clients, one for S3 (to list job runs) and for DynamoDB (to access our manifest). These clients are provided via boto3 Python AWS SDK and can be initialized with static credentials or with system-provided credentials.
Then we list all Snowplow runs in a particular S3 path, and then process (with the user-provided
process function) only those Snowplow runs that had not been previously processed. Note that
run_id is just a simple string with the S3 key of particular job run.
RunManifests class, then, is a simple API wrapper to DynamoDB, which lets you:
createa DynamoDB table for manifests
adda Snowplow run to the table
- check if table
containsa given run ID
As an SDK becomes more featureful it becomes harder to keep all the required documentation in the project’s README. In this release we have split out the README into several wiki pages, each dedicated to a particular feature.
Check out the Python Analytics SDK in the main Snowplow wiki.
4. Other changes
Version 0.2.0 also includes a few internal changes and minor enhancements, including:
- Adding a Vagrant environment (issue #5)
- Support for multiple versions of Python (issue #16)
- Strict PEP8 linting (issue #4) for the CI tests
As before, the Snowplow Python Analytics SDK is available on PyPI:
6. Getting help
And if there’s another Snowplow Analytics SDK you’d like us to prioritize creating, please let us know on the forums!