Piinguin, Snowplow PII usage management service, released

Snowplow Team

This is a detailed technical walkthrough of Piinguin; to learn more about what Piinguin does and why we built it, see the Piinguin introduction post.

We are pleased to announce the first release of Piinguin and the associated Snowplow Piinguin Relay. This initial release introduces basic capabilities for managing the usage of personally identifiable information data from Snowplow.

Read on for more information on Piinguin and the Snowplow Piinguin Relay:

  1. Overview
  2. Piinguin
  3. Snowplow Piinguin Relay
  4. Deploying
  5. Help

1. Overview

Following the release of Snowplow R106 Acropolis, which added the capability to emit a stream of PII transformation events, we have continued to develop tools to support the responsible management of personally identifiable information.

If you want to learn more about PII and how it is managed by the Snowplow PII enrichment, you can read more in the release posts for Snowplow R100 Epidaurus and R106 Acropolis.

Piinguin aims to round out our approach to PII management, by providing a service which stores PII and helps control access by requiring that anyone who reads PII data provides a justification based on the lawful basis for processing PII specified under GDPR.

Piinguin consists of several elements that sit alongside Snowplow to store and serve PII data. Here is an overview of the architecture:


The first component that receives data out of Snowplow’s stream of PII transformation events is the Snowplow Piinguin Relay, an AWS Lambda function which uses the piinguin-client artifact to send data to Piinguin. You can read more details about this relay below, and detailed instructions on how to install and run it in the deploying section.

The second component is the piinguin-server itself which has to be in the same secure VPC as the Lambda function. In addition it needs to have access to an AWS Dynamo DB table to store the data. You can read more details about Piinguin below, along with detailed instructions on how to install and run it under deploying.

The final component is the aforementioned piinguin-client, potentially running embedded in your own code to manage your interactions with the PII stored in Piinguin. This client library is discussed in more detail in the upcoming Piinguin section.

2. Piinguin

The Piinguin project consists of three parts. These are the:

Piinguin is based on GRPC which is a Protocol Buffer-based RPC framework. The protocol in the Piinguin project specifies the interface between the client and server. There is a .proto file which describes the interactions between the client and the server for reading, writing and deleting PII records. That file is used with the excellent scalapb Scala compiler plug-in to generate Java code stubs for both the server and the client. These can then be used to implement any behavior based on that interface.

The piinguin-server implements the behavior of the server according to the interface, which in this case means writing to and reading from DynamoDB using another excellent library, scanamo. In the highly unlikely event (as unlikely as a hash collision) that a hash coincides for two values, the last seen original value will be kept. (There are thoughts of keeping all values in that case, although their utility is dubious – feel free to discuss in the relevant issue on GitHub.)

Finally, the piinguin-client artifact provides a client API for use from Scala. There are three ways to use the client API: with plain Scala Futures, FS2 IO, and FS2 Streaming. Please note that the FS2 Streaming implementation remains highly experimental and its use is currently discouraged as it is likely to change significantly; any and all comments and PRs are of course welcome.

3. Snowplow Piinguin Relay

The Snowplow Piinguin Relay uses the aforementioned piinguin-cient in an AWS Lambda function to forward all PII transformation events to the piinguin-server.

The relay uses the Snowplow Analytics SDK to read the PII transformation enriched events that are contained in the Kinesis stream and extract the relevant fields (currently, the modified and original value only), and perform a createRecord operation against piinguin-server.

4. Deploying

Both the Piinguin Server and the Piinguin Relay currently support AWS only, and they should be deployed to the same VPC.

4.1 Configuring the Snowplow Piinguin Relay

You can obtain the relay artifact from our S3 public assets buckets appropriate for your region.

In order for you to create an AWS Lambda function, please follow the detailed developer guide. When creating the Lambda, make sure to:

The PIINGUIN_TIMEOUT_SEC value should be lower than the AWS Lambda timeout in order to get a meaningful error message if the client times out while communicating with the server. Here is an example of that configuration:

PIINGUIN_HOST = ec2-1-2-3-4.eu-west-1.compute.amazonaws.com PIINGUIN_PORT = 8080 PIINGUIN_TIMEOUT_SEC = 10

4.2 Setting up relay permissions to the VPC

As stated befo
re, both the relay and the Piinguin Server need to reside in the same VPC. In addition, the Lambda function needs to have sufficient access from IAM to run. You should create a service role and attach policies that will permit it to run following this guide. Like many Lambda functions, this one also needs permission to send its output to CloudWatch Logs – this IAM policy should cover that:

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "logs:CreateLogGroup", "Resource": "arn:aws:logs:<region>:<account-id>:*" }, { "Effect": "Allow", "Action": [ "logs:CreateLogStream", "logs:PutLogEvents" ], "Resource": [ "arn:aws:logs:<region>:<account-id>:log-group:/aws/lambda/piinguin-relay:*" ] } ] }

As the Lambda will be reading its PII transformation events from Kinesis, it will also need to have permissions to do that, with a policy document such as:

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "kinesis:*", "Resource": [ "arn:aws:kinesis:<region>:<account-id>:stream/<pii-events-stream-name>" ] } ] }

4.3 Deploying the Piinguin Server

The simplest way to deploy Piinguin Server is to obtain the Docker image by running the following on your Docker host:

$ docker run snowplow-docker-registry.bintray.io/snowplow/piinguin-server:0.1.1

This will run the server on the default port 8080 and will use the default DynamoDB table piinguin. Both are configurable to other values using PIINGUIN_PORT and PIINGUIN_DYNAMODB_TABLE, if needed. See the relevant readme for more information.

4.4 Setting up server permissions to the VPC

As stated before, both the Relay and the Server need to reside in the same VPC. In addition, the Docker host needs to have sufficient access from IAM to run. You should create a service role and attach policies that will permit it to run following this guide.

As the server writes its data to DynamoDB it will need to have access to it with a policy document such as:

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "dynamodb:DeleteItem", "dynamodb:GetItem", "dynamodb:PutItem", "dynamodb:Scan", "dynamodb:UpdateItem" ], "Resource": "arn:aws:dynamodb:<region>:<account-id>:table/<table-name>" } ] }

4.5 Setting up the DynamoDB table

You will need to create the appropriate DyanamoDB table in order to use Piinguin.

To create a DynamoDb table, log-in as normal to your AWS console and type DynamoDB into the services field and select DynamoDB from the list:

List of services

From the DynamoDB page, click create table:

create table

Finally, specify the desired table name, set the primary key to modifiedValue and its type to String, then click Create.

create table details

If you are comfortable with the C
LI, you can also create the DynamoDB table using the following commands:

aws dynamodb create-table --table-name piinguin-prod --attribute-definitions AttributeName=modifiedValue,AttributeType=S --key-schema AttributeName=modifiedValue,KeyType=HASH --provisioned-throughput ReadCapacityUnits=1,WriteCapacityUnits=1 

With the DynamoDB table created, setup is now complete and you can use Piinguin.

4.6 Testing that Piinguin is functioning

One way to verify that your setup works is to checkout the Piinguin project on GitHub and try to write and the read back a record:

 $ sbt "client/console" scala> import scala.concurrent.{ExecutionContext, Await} scala> import scala.concurrent.duration._ scala> import com.snowplowanalytics.piinguin.client.PiinguinClient scala> implicit val ec = ExecutionContext.global scala> val c = new PiinguinClient("localhost", 8080) scala> val createResult = Await.result(c.createPiiRecord("123", "456"), 10 seconds) createResult: Either[com.snowplowanalytics.piinguin.client.FailureMessage,com.snowplowanalytics.piinguin.client.SuccessMessage] = Right(SuccessMessage(OK)) scala> import com.snowplowanalytics.piinguin.server.generated.protocols.piinguin.ReadPiiRecordRequest.LawfulBasisForProcessing scala> val readResult = Await.result(c.readPiiRecord("123", LawfulBasisForProcessing.CONSENT), 10 seconds) readResult: Either[com.snowplowanalytics.piinguin.client.FailureMessage,com.snowplowanalytics.piinguin.client.PiinguinClient.PiiRecord] = Right(PiiRecord(123,456)) 

You can also verify that the record is in DynamoDB by clicking on items in the console:

DynamoDB Items

And verifying that your item is there.

5. Getting help

For more details on working with Piinguin and the Snowplow Piinguin Relay, please check out the documentation here:

If you have any questions or run into any problems, please visit our Discourse forum.

Related articles