How to build an open source data stack

Share

There’s a problem with your analytics. They aren’t your analytics.

When you choose a proprietary packaged option for your analytics stack, without realizing, you might be choosing convenience over clarity. Perhaps, when you start out, convenience is what’s needed. You don’t have the resources to manage anything but a plug-n-play solution. You aren’t thinking about the optimal data stack; you just need a stack.

But as you scale, using a black-box solution to power your analytics means you’ll lose more and more insight into your customers and product, as well as losing more and more control over your core business asset—your data. If you don’t run the infrastructure, you don’t control the data. If you want a precise understanding of how your users are using your product—not a sample of usage or a summary of data defined by a vendor’s rules—you have to own the infrastructure.

An open source analytics stack allows you to do just that. It can give you more flexibility and more control. You have to put in more work initially to set it up, but the payoff is worth it in the long term as you learn more and iterate faster.

What are open source analytics tools?

Imagine if you could take your favorite analytics tool and look at all the code underneath to understand exactly how it works. Not just that, but then change any code you wanted to work better for your use cases and then host it yourself without having to pay—that is what open source analytics allow you to do.

Open source analytics tools have underlying code that’s open and available to everyone to not just view but also copy, modify, redistribute, and use. There is no proprietary code and no “trade secret” way that things work. All functionality is out in the open. Though you can pay for hosted solutions, you can take the code and run open source analytics tools on your own infrastructure. They are free to explore and free to use.

Here’s an example. Redash is an open source data-visualization and query-editor tool. It is an open source version of Looker. You can sign up for a hosted version of Redash, or you can follow this guide to set it up on your own servers.

An example dashboard in redash

Because Redash is open source, unlike Looker, you can get the entire codebase for Redash on GitHub. You can examine the code and see exactly how Redash works. For instance, you can immediately see that Redash is a python back end with a front-end JavaScript client. If you want to dive deeper, you can check out how they make their sankey diagrams. Turns out they use D3.

A pie chart in redash made from a simple query

It’s fun to go through and read code to see how a tool works. But the strength of open source lies in when a tool doesn’t work. Open source leads to robustness and flexibility in a codebase. If your favorite SaaS analytics tool shows an error, all you can do is open a support ticket. If it doesn’t have the integration you need, all you can do is nudge the company on their forum.

With open source, you can go in and fix the issue and build the integration yourself, or talk with other like minded people to find a solution, as open source tech is often host to a community of users in the same boat as you.


Snowplow is another example of this. There have been over 6,000 commits to Snowplow repos to date. Most of these are from the core team, but a significant number are from Snowplow users scratching their own itch. Users will add additional functionality to the codebase to satisfy a need they have that a lot of other users have as well. For example, a data engineer at the Toronto Globe and Mailadded the optional endpoint to Dynamodb to make it work with Localstack.

This is a small change (31 lines added/6 lines deleted), but one that aids the entire community.

Community is an important watchword in open source. The strength of an open source tool lies in the people who use it. When assessing open source analytics tools, you need to look at not just whether the tool services your needs but also whether there is a thriving community that will help keep the tool working and growing.

If there are no open issues, no PRs, or even just commits from the core team, it might not have the high levels of community and engagement for long-term viability.

What you want from an open source analytics stack

Choosing open source analytics over a proprietary stack comes with more work up front. With a SaaS package, often all you need is the right JavaScript snippet in the right place and you can immediately ingest data. It’s also easier to get buy-in for a tool that is well known, requires minimal setup, and works immediately.

But the initial heavy lift of open source analytics is outweighed by the long-term benefits of having total control and an understanding of how the stack (and, by proxy, your raw data) works.

Here are the main benefits of open source analytics:

More control over your data and data infrastructure

This is the core reason to choose these tools. With open source analytics, you own your data and, just as importantly, you own how it is processed.

No vendor lock-in

Your data is an asset for you, it shouldn’t be an asset for your analytics company. Yet, that is exactly what vendor lock-in entails—you can’t leave because they effectively own your data.

With open source analytics, you own the infrastructure and the data.

Flexibility for specific use cases

Your tech should reflect your use cases. You shouldn’t have to crowbar your specific needs into a generic system.

When using proprietary stacks, you’re limited to the use cases they are built for. You are also limited to the integrations they have built.

Open source tools allow you to build around the specific use cases you have.

More cost-effective

You can feasibly build an entire analytics stack with open source products, for close to free. Check out the following examples:

Though the software itself is freely available, you will still have to find a data warehouse, a hosting solution, and engineering resources. Each of these will add to the cost of running an open source analytics stack.

However, at low-volume levels the costs for these are likely to still be much less than, for example, a GA360 solution (where you’ll still have BigQuery and engineering team costs).

Greater control over security and privacy

When you have control over the information you capture about your users, it makes it easier to understand your data responsibilities.

For example, when you control your data, you have more insight into how your data systems are designed for SOC 2 Type I and how they are operating for SOC 2 Type II, and how your system deals with the SOC 2 five trust principles of security, availability, processing integrity, confidentiality, and privacy.

How to open source your data stack

Open sourcing your stack can start with you finding a cool new tool on GitHub and forking it. But the better way to start is from first principles—your use case. Then, move on to the cool tools!

First, decide your use cases. Your use case is the foundation of your stack. It decides what data sources you need, what schema works best, how you’ll enrich your data, and what analysis/visualization/storage fits best.

Building a content recommendation system is different from building marketing attribution. You need different data for each, and you are going to perform different analysis on that data. For a content recommendation system you’ll need granular data on how a user interacts with pages, such as time on page, whereas just logging the page visit might suffice for marketing attribution.

Conversely, you’ll need Redash or another data visualization tool and their sankey diagrams for marketing attribution so you can visualize the journey that users are taking through your content. You might not need any end visualization tool for a content recommender, instead feeding the data into an algorithm and using the output directly on your site.

Once you have the use case, you can think about the right tools for your pipeline. A modern data stack looks like this:

Outline of the modern data stack
dbt
Source: dbt

These are just some of the tools available. Open source analytics has been growing for the last decade, and the market is exploding with open source technologies. As data engineers and data engineering become more sophisticated and nuanced, more and more tools are being built to service the exact needs of data teams.

Opening your stack

Open sourcing your analytics isn’t without headaches. The code is free, but the infrastructure and the resources required to maintain and manage it are not. But even if you choose a hosted solution to take some of those headaches away, the important part of open source analytics remains: the control.

When you own and run the infrastructure you use to capture all your events, and when the data is collected, processed, stored, and analyzed according to your rules and business needs only, then you end up with an incredibly valuable asset: your data.


Of course, with Snowplow BDP, you have the option to host your infrastructure and manage it within your cloud account. You can think of it as having the best of the open source flexibility, without the hard work of setting it up. To discover more, check out Snowplow BDP for yourself.

Discover the flexibility of open source and security of a managed pipeline with Snowplow BDP.

Share

Related articles