Snowplow Snowflake DB Loader released

Share

We are tremendously excited to announce the first public release of the Snowplow Snowflake Loader.

Snowflake is a cloud-native data warehouse that has been rapidly growing in popularity. It competes with Amazon’s own Redshift and Google’s BigQuery, as well as other analytical database technologies like Teradata, Vertica and Exasol.

Collectively, these database technologies have made working with high volumes of rich, granular data, like Snowplow generates, significantly easier and more manageable by making it cost effective to query that data in flexible ways at scale, and plug in a wide variety of business intelligence, data visualization, data analytics and data science toolsets directly on top of the very large data sets.

We are delighted to be able to give Snowplow users the option to easily load their Snowplow data into Snowflake DB and use this as their primary data warehousing technology. Read on to learn about the many benefits of this exciting new data warehouse technology.

  1. More Snowplow Loaders
  2. Introducing Snowflake DB
  3. Introducing the Snowflake Loader
  4. Installing the Snowflake Loader
  5. Roadmap
  6. Getting Help

1. More Snowplow Loaders

Where your data lives has a significant impact on what you can do with your data. With Snowplow, we want to make it easy for companies to deliver their data into the different places that let them drive real insight from that data. To date we’ve supported loading the data into:

Over the last few years Amazon Redshift became a de-facto solution used across most of the Snowplow userbase. The simplicity of setting up Redshift, along with its ability to scale to handle large data volumes and its widely supported PostgreSQL interface have made it the number one data warehousing location for Snowplow data.

While Redshift is a time-tested platform with many advantages, we are committed to providing our users more opportunities to choose the best solutions for their own, unique requirements. That means building out support for many more SQL data warehouses and other, NoSQL datastores.

As an important step in this direction, we recently refactored our RDB Loader – the application responsible for preparing and ingesting enriched data into Amazon Redshift. This move laid the foundation for us to release Snowplow Loaders for many more data stores.

Today we’re proud to announce the latest one: Snowflake Loader.

2. Introducing the Snowflake DB

Snowflake is a relatively new data warehousing technology, quickly gaining popularity. Snowflake has three features that distinguish it from Redshift in particular:

2.1 Scale compute and storage independently

With Snowflake you can scale the computational capability of your data warehouse independently of the storage component. In practice, that means you can:

2.2 Faster scaling

Snowflake DB scales fast. In contrast, resizing a Redshift cluster takes hours, sometimes days – periods during which you can no longer load data into the data warehouse. Whilst this is fine for many users, for high volume users it can create a backlog of data that needs to be loaded.

2.3 Excellent support for nested data types including JSON and Avro

Querying JSON data in Snowflake is fast and effective. That means that when we load Snowplow data into Snowflake, we’re able to load it all into a single table with a single column per event and context type, and the individual event-specific and context-specific fields available as nested types in those columns. This is an attractive data structure because all your data is in a single, easy-to-reason about table.

Redshift, by contrast, has much more limited JSON parsing support: JSON parsing is very brittle (a single malformed JSON will break a whole query) and querying JSONs in Redshift is not performant. As a result, when we load Snowplow data into Redshift we do not use JSON: we shred out each event and context type into a dedicated table, so that they can be queried in a performant way.

This works well but means often that expensive joins and unions operations are required across tables in Redshift that are not in Snowflake because everything is in a single table.

Easier management

Snowflake is a managed service – you don’t need to setup indexes or calculating capacity, in general you just choose a size of “virtual warehouse”. This is the entity responsible for loading and querying your data, so you pay only for time your warehouse is in the resumed state.

With Redshift by contrast you have to actively manage the structure of your data tables and ensure the health of your data warehouse by regularly running vacuum and analyze jobs on your data.

3. Introducing the Snowflake Loader

The Snowplow Snowflake Loader, very much like RDB Loader, consists of two parts, both found in the same GitHub repo:

However, this is where the similarities end.

The Snowflake Loader takes a very different approach to maintaining the state of loaded data and structuring data inside a database.

Instead of the slow file-moving approach taken by RDB Loa
der, Snowflake Loader uses a run manifest on top of AWS DynamoDB which closely resembles one we introduced in Snowplow Scala Analytics SDK 0.2.0.

Each time the Transformer job is launched, it checks the location in S3 where the enriched data lives to identify entries that haven’t been transformed and loaded into Snowflake DB. If a new folder of data is found, it:

  1. adds it to the manifest, and
  2. processes the folder has been processed – changing its state to “processed”, then
  3. logs all the new event and context types found inside it

This last point is important: if the Transformer discovers a new event or context type, this information is used to update the atomic.events table definition in Snowflake to accommodate the new type, so that any new types can be queried immediately.

With contexts and unstructured events being the VARIANT data type you can query any nested fields directly like they’re native types.

You can find out more on how data is structured inside Snowflake in the dedicated Discourse post.

4. Installing Snowflake Loader

As stated above – it’s very easy to operate Snowflake DB. Not only that, but Snowflake Loader doesn’t add any complexity and is very easy to setup:

At Snowplow, we have been running multiple pipelines loading both Redshift and Snowflake. If you do wish to load both databases alongside each other, we recommend running two separate DAGs: the standard Snowplow one (via EmrEtlRunner) to load Redshift, and an additional DAG that that picks up the enriched data in the enriched archive, processes it and loads it into Snowflake. The two DAGS can be run with different (even overlapping) schedules without problems: whenever the “main” pipeline leaves an enriched folder in archive – the next Snowflake run will load it. Otherwise, it will spot that no new data is available and won’t do anything.

To get started loading your Snowplow data into Snowflake, you first need to create following config files:

Here’s an example Dataflow Runner cluster configuration:

{ "schema":"iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-1-0", "data":{ "name": "Snowflake Pipeline", "logUri": "s3://<<UPDATE ME>>", "region": "us-east-1", "credentials":{ "accessKeyId": "env", "secretAccessKey": "env" }, "roles":{ "jobflow":"EMR_EC2_DefaultRole", "service":"EMR_DefaultRole" }, "ec2":{ "amiVersion": "5.9.0", "keyName": "<<UPDATE ME>>", "location":{ "vpc":{ "subnetId": "<<UPDATE ME>>" } }, "instances":{ "master":{ "type":"m2.xlarge" }, "core":{ "type":"m2.xlarge", "count":1 }, "task":{ "type":"m1.medium", "count":0, "bid":"0.015" } } }, "tags":[ ], "bootstrapActionConfigs":[ ], "configurations":[ { "classification":"core-site", "properties":{
 "Io.file.buffer.size":"65536" } }, { "classification":"mapred-site", "properties":{ "Mapreduce.user.classpath.first":"true" } }, { "classification":"yarn-site", "properties":{ "yarn.resourcemanager.am.max-attempts":"1" } }, { "classification":"spark", "properties":{ "maximizeResourceAllocation":"true" } } ], "applications":[ "Hadoop", "Spark" ] } }

You will need to update:

You also might want to change the instances configuration – though note that in general Snowflake Transformer requires significantly smaller clusters compared to RDB Shredder due to the reduced impact on the cluster filesystem.

Next, here’s an example self-describing JSON for configuring the Snowflake Loader:

{ "schema": "iglu:com.snowplowanalytics.snowplow.storage/snowflake_config/jsonschema/1-0-0", "data": { "name": "Acme Snowflake Storage Target", "auth": { "accessKeyId": "<<UPDATE ME>>", "secretAccessKey": "<<UPDATE ME>>" } "awsRegion": "us-east-1" "manifest": "acme-snowflake-run-manifest", "snowflakeRegion": "us-east-1", "database": "snowflake-database", "input": "s3://com-acme-snowplow/archive/enriched/", "stage": "arbitraryStageName", "stageUrl": "s3://com-acme-snowplow/archive/snowflake/", "warehouse": "snowplow_wh", "schema": "atomic", "account": "acme", "username": "snowflake-loader", "password": "secret", "purpose": "ENRICHED_EVENTS" } }

You’ll probably want to change all fields here, apart from purpose and schema.

Note, that storing your password and credentials in plain text in configuration is not secure. Therefore, we strongly advise you follow AWS Role approach for storing credentials and EC2 Parameter Store for storing password. Full guide on how to use these approaches as well as full description of all the important fields is available at Snowflake Loader documentation.

Another configuration file you’ll need is a common Iglu Resolver config. So far it is used only to validate the configuration itself, so feel free to use one with only Iglu Central in it.

And the last file you’ll need is a Dataflow Runner playbook responsible for running Transformer and Loader. Here’s an example:

{ "schema":"iglu:com.snowplowanalytics.dataflowrunner/PlaybookConfig/avro/1-0-1", "data":{ "region":"us-east-1", "credentials":{ "accessKeyId":"env", "secretAccessKey":"env" }, "steps":[ { "type":"CUSTOM_JAR", "name":"Snowflake Transformer", "actionOnFailure":"CANCEL_AND_WAIT", "jar":"command-runner.jar", "arguments":[ "spark-submit", "--deploy-mode", "cluster", "--class", "com.snowplowanalytics.snowflake.transformer.Main", "s3://snowplow-hosted-assets/4-storage/snowflake-loader/snowplow-snowflake-transformer-0.3.0.jar", "--config", "{{base64File "./config.json"}}", "--resolver", "{{base64File "./resolver.json"}}" ] }, { "type":"CUSTOM_JAR", "name":"Snowflake Loader", "actionOnFailure":"CANCEL_AND_WAIT", "jar":"s3://snowplow-hosted-assets/4-storage/snowflake-loader/snowplow-snowflake-loader-0.3.0.jar", "arguments":[ "load", "--base64", "--config", "{{base64File "./config.json"}}", "--resolver", "{{base64File "./resolver.json"}}" ] } ], "tags":[ ] } }

You can leave the job steps as they are, or optionally you could run the Loader step from a local machine – it’s up to you.

5. Roadmap

Right now Snowplow Snowflake Loader is version 0.3.0, which means it has been battle-tested by our team internally at Snowplow for some time. We’re enormously excited to make it public! There is still a lot we want to do add to it, however:

6. Getting Help

For more details on this release, as always do check out the release notes on GitHub.

If you have any questions or run into any problems, please visit our Discourse forum.

Share

Related articles