Building data models with Snowplow and dbt

Share

This is a four-part series

Click below to navigate to the next chapter

In our previous chapter, we looked at how to design data models to fit your business. This time, we’ll explore how to get started with building, testing and deploying data models with Snowplow and dbt.

Snowplow and dbt: A powerful combination to capture and model your data

Before we dive in, let’s explore why Snowplow and dbt make for a great combination when it comes to collecting and modeling behavioral data at scale. It comes down to two key reasons:

  1. Snowplow enables you to capture the best possible behavioral data to feed into your data models, your data warehouse and your applications further downstream;
  2. dbt is the most versatile, flexible tooling for data transformation, with several useful features right out of the box. 

Without going into the intricacies of why Snowplow data is so powerful (if you’re interested, you can read more about that here), behavioral data captured from Snowplow is rich, well-structured and perfectly suited to your data modeling use cases. You can also explore Snowplow data for yourself to see what it looks like. Other reasons you should consider Snowplow data for your data transformation might include:

Working with Snowplow data within dbt data is a perfect setup for building data models that match your use cases and business objectives. Let’s look at a few examples of what makes dbt ideal for analytics engineers:

These are just a few highlights of what makes Snowplow and dbt a powerful combination when it comes to delivering rich insights from your data models. Let’s dive into how to get started. 

What do I need to get started?  

Aside from dbt and Snowplow, there are a few other things to consider before you start transforming.

With these in place, including an analytics engineer or analyst, you’re in a great position to start building data models for your team. 

Best practices for building your first data model 

Data modeling, like other aspects of engineering, should be carefully governed from a centralized set of best practices, documented and shared throughout the data team. This will prevent a data modeling free-for-all, with code inconsistencies, duplication of work and and overall confusion. It’s worth considering that the planning and design phase of your data modeling project is as important, if not more vital than the process of writing SQL itself. 

With that understood, there are some best practices you can take away when it comes to building your first data model. 

  1. Build and adhere to a SQL style guide 

When building data models, especially as the team grows, you’ll need a style guide to ensure the team writes SQL effectively and consistently. While some variation is expected, the way your SQL is written is crucial when it comes to iterating on your data model. If your code is impossible to read, or, even if it’s elegantly written but unclear to other analysts, fixing bugs in the SQL or evolving the data model for new requirements becomes a huge challenge. 

In your style guide, you should set out 

  1. Make important decisions early in the data modeling lifecycle

Building robust data models is about preparing for the future. Planning and designing data models effectively means you’ll significantly reduce the ‘clean up’ required later downstream. It’s also a critical part to get right, because the success of your data models depends on internal communication with internal customers such as the product and marketing team.

You don’t want to be in a position where you’re making key decisions about the purpose and design of your data models when you’re about to start writing SQL. By then, it’s already too late. Spend sufficient time architecting your data models, and the implementation will be faster and easier. 

  1. Build a data dictionary 

Within dbt, it’s possible to describe your properties, such as unique identifiers, timestamps and so on to make it clear what variables you’re referring to in your data model, and easily access definitions later via the built in documentation functionality.

You’ll find an example of how we’ve broken down each variable in our Snowplow dbt package. It takes additional effort to create these descriptions, but the payoff in the long run is well worth it. When each property is explicitly described, it’s easy for others in the data team and the wider business to understand the data they’re working with. This is especially useful as the business grows, when people leave or for new analysts getting up to speed with your data modeling process.

  1. Decide on a cost effective approach that fits your use case

In our last chapter, we mentioned how the design and planning phase of data modeling is critical to the data models running efficiently and cost effectively. For instance, the way you structure your tables will have a big impact on the costs of running queries in your warehouse. Large, flat tables can use large amounts of storage and lead to huge warehousing costs, for example. 

Behavioral data volumes tend to be large by nature, so understanding your warehouse and databasing solution is key to keeping running costs efficient. By following best practices like always placing limits on partition columns, selecting only the required columns and avoiding unnecessary  use of expensive functions in SQL such as window functions can significantly help reduce costs.. It’s worth factoring these factors into your design process to avoid landing unexpectedly high warehousing costs. 

  1. Build a process for robust testing

Testing is important to build assurance in your data models before you deploy to production. There are various tests you can run within dbt, let’s explore a few of these:

Before you deploy 

You’ve tested your data model and you’re comfortable with the results. There’s a few steps to check off before you submit that pull request. 

In our next chapter, we’ll look at how data modeling has evolved as a practice, where it stands today, and what the future holds for how data modeling technology and processes will develop. 

Build better data models with rich, behavioral data

Share

Related articles