The top 14 open source analytics tools in 2021

Share

It’s fairly easy to find the top SaaS analytics tools. Just search Google for “analytics tools,” and you’ll be presented with some well-placed, high-cost ads pointing you in the right direction.

Finding open source analytics tools isn’t quite as easy. You can go to GitHub and look for repos tagged as “analytics,” but this can be a flawed strategy because:

To help, we’ve rounded up the top open source analytics tools. We will break down what they do well and also highlight any weak spots they might have.

Open source analytics tools to try in 2021 - graphic representation

The rankings here are partly due to overall popularity—the most starred and forked open source analytics tools on GitHub. But we’ve also included some that are of real interest but aren’t yet in widespread use. These are the tools we think will become ever more important in the open source analytics ecosystem in 2021.

What’s not covered here is Google Analytics-specific competitors: Matomo, Plausible, Koko, and Offen. They are great for their specific use cases, but if you are just starting to use open source analytics tools, it makes more sense to do so completely, breaking free of all your packaged tools.

We’ve split the software into nine separate categories of open source analytics tools, each serving a different need:

Let’s look at a few instances of each that will help you open source your analytics.


Open source product analytics tools

These are entire platforms that can supersede your packaged SaaS tools and give you end-to-end control and insight into your data. The overall pluses of these types of tools are control and customization. You have complete access to your data and can decide exactly how the data is analyzed. The downside is that they can be resource-intensive to set up and run.

1. Countly for easy mobile analytics

Countly / Countly GitHub / AGPL v3 license / 4.6k stars

The strength of Countly is easy access to your data with read and write API access and analytics for mobile, web, and desktop. It features a number of open source plugins to help you collect and understand your data better.

Image source: Countly

The downside of the open source version is that it doesn’t include all the features of the Enterprise paid version. With open source, you miss out on real-time data, user profiles, and the ability to design funnels. The open source version also “stores data (only) in an aggregated format,” so you can’t export the data and perform more granular analysis elsewhere (though this does make reporting faster).

2. PostHog for quick setup of self-hosted analytics

PostHog / PostHog GitHub / MIT license / 3.4k stars

PostHog is a self-hosted, open source analytics platform that allows for extremely easy deployment. You can deploy the tool directly to Heroku in one click. This sets it apart from a lot of other open source analytics tools that have a more involved setup process and require more knowledge to get up and running. PostHog works well for teams new to the open source world.

PostHog overview dashboard
Image source: PostHog Github

A weakness of PostHog is that you might be limited if you are building out marketing attribution with open source analytics. PostHog doesn’t currently have email link tracking or ad campaign tracking, so you will be missing a subset of your data when trying to understand your marketing campaigns better.

A note on ‘enterprise scale’ web analytics tools:

If you are looking for open source analytics at enterprise scale then you might actually want to consider a mesh of tools which deliver analytics data into a data warehouse. This is exactly what Snowplow was made for, and is number 14 on this list.

This would then enable you to build competitive advantage based on how you amass high quality data at scale, and activate it within tools built especially for real-time marketing automation, customer engagement and business intelligence.


Open source a/b testing tools

3. Wasabi – A real time enterprise grade a/b testing platform

Wasabi/Wasabi GitHub/ Apache-2.0 license / 973 stars

Wasabi is a real-time, open-source, 100% API-driven, A/B testing platform by Intuit. The open-source testing software allows users to own their data and experiment across the web, mobile, and desktop. Users utilize Wasabi because it’s fast, scalable, and easy to use for organizations of all sizes.

Wasabi AB test results example
Image source: Wasabi GitHub

Developers lean toward Wasabi for A/B testing because it is 100% API-driven and can be developed in any programming language and environment. The software has been tested for years with products like TurboTax and QuickBooks.

While Wasabi is a proven open-source platform that can run on your servers or in the cloud, it is no longer under active development or supported by Intuit, as of August 28, 2019. 


Open source CDPs / Reverse ETL tools

4. Grouparoo for integrating customer data with cloud-based tools

Grouparoo/Grouparoo Github/ Mozilla Public License 2.0/ 428 stars

Grouparoo is an open-source Reverse ETL solution that makes it easy to send data from your data warehouse to cloud-based marketing, sales and customer platforms like Mailchimp, Salesforce and Zendesk. Grouparoo integrates with any tech stack; you can configure your setup locally, commit changes, and deploy with git – just like how you’d deploy DBT projects. There’s also a web-based user interface to support complex configurations.

Image source: Grouparoo

Grouparoo is a very new solution and therefore doesn’t feature as many integrations as its non open-source counterparts in the reverse ETL category. That being said, it’s a hugely promising platform with advantages in its privacy and the fact you can fit it into your existing engineering workflow. Grouparoo also has great segmentation capabilities, including a group building tool that can be used by engineers as well as less technical teams like marketers. This can be used to determine which profiles get synced to certain tools and will also create tags or lists in the destination systems.

5. Pimcore for managing digital data

Pimcore/Pimcore Github / GPLv3 license /2K stars

Pimcore was introduced to the open-source world in 2010. The open-source platform assists organizations in managing digital data and customer experience. Pimcore is 100% API-driven, allowing integration into any tech stack. Eighty-two thousand customers across 56 countries utilize Pimcore to manage their data, including, SONY and Pepsi. 

PimCore architecture outline
Image source: Pimcore GitHub

Pimcore stores data independently and can provide the managed data to any channel, such as B2B websites, ecommerce systems, and mobile applications. 

It is important to know that Pimcore is not an “out of the box” software product and, therefore, is meant for people with software development experience.


Open source data validation tools

These tools have a specific use within your data pipeline. You can add them in as a step within an open source data platform to perform a single function. The plus of these tools is that they perform important operations that you are unlikely to get in packaged SaaS tools. The downside is that they are built specifically for certain purposes—you need multiple tools like these to answer every use case you have.

6. Great Expectations for data validation

Great Expectations / Great Expectations GitHub / Apache-2.0 license / 3.2k stars

The strength of Great Expectations (apart from its amazing name!) is that it allows you to set and assert specific validation rules for your data and be alerted when your data is straying from those rules. You can also automatically create documentation directly from these assertions:

Great expectations data validation example assertions
Image source: Great Expectations

A caveat is that Great Expectations is very new. It has a lot of promise, but key features, such as autogenerated documentation from tests and data profiling, are still experimental.


Open source analytics engineering tools

7. dbt for improved analytics workflow

dbt / dbt GitHub / Apache-2.0 license / 2.2k stars

dbt’s strength is that it allows you to bring general engineering principles, such as version control, testing, and sandboxing, into your data pipeline. You can perform data transformation and business logic without impacting users in separate, collaborative environments.

dbt data pipeline example
Image source: dbt

The limitation of dbt is that it is purely a transformation tool. It expects that extraction and loading will be done by another tool. This is fine, as there are plenty of other tools that can do these jobs in the pipeline, but it’s important to realize this is just one step in a larger process.


Open source anomaly detection tools

8. Hastic for data anomaly detection

Hastic / Hastic GitHub / Apache-2.0 license / 269 stars

The strength of Hastic is its ability to find anomalies in your data and alert you immediately. You set up predefined parameters for possible anomalies in your data, and Hastic will find them if they reoccur:

Hastic anomoly reoccurance detection animation
Image source: Hastic

The limitation here is that Hasitc only works with open source analytics monitoring platform Grafana, so you can’t see these plots in Superset or Metabase. Hastic is also currently lightly documented, so setup and maintainability might be a challenge.


Open source databases

Open source databases allow you to store your data outside of the larger proprietary warehouses. A lot of databases, such as MySQL, PostgreSQL, CockroachDB, MongoDB, and SQLite, are open source, but the two highlighted here are different in that they are engineered to deal with specific types of data and analysis.

9. Apache Druid for real-time DB querying

Druid / Druid GitHub / Apache-2.0 license / 10.3k stars

The strength of Druid is in real-time analytics, where a user is performing multiple queries in rapid succession and needs sub-second answers. If you are working on a product that requires you to analyze data on the fly, then Druid is the right database to choose.

Apache Druid gif ordering and sorting example
Image source: Druid GitHub

Druid’s lack of fault tolerance has been cited as a weakness, specifically if you are susceptible to network failures.

10. Timescale for time-series querying

Timescale / Timescale GitHub / Apache-2.0 license / 9.8k stars

Timescale’s strength is that it is optimized for time-series data. If you are working with time-series data, such as ongoing product usage, Timescale allows you to perform complex queries on the data.

Example timescale query
Image source: Timescale

A weakness of Timescale is that, though the relational database model is versatile, it can be more difficult to get started with. There is a steep learning curve for the tool.


Open source data visualization tools

For any data analysis, you want the ability to query and visualize the data. Proprietary dashboards and business intelligence tools such as Looker, Tableau, or Chartio are extremely popular, but so are some of the open source visualization tools available. These are some of the most starred and forked open source analytics tools out there.

11. Superset for visualizing data in any DB

Superset / Superset GitHub / Apache-2.0 license / 31.4k stars

The main strength of Superset is that it integrates with dozens of modern databases, so wherever your data currently lives, Superset can interface, allowing you to visualize your data. You can also visualize and analyze data from different sources simultaneously.

Superset pull request chart
Image source: Superset

Superset is not necessarily an “enterprise-ready” tool. There is a challenging setup process, and some cite potential security risks of giving a Docker image access to your data. But it is an extremely powerful tool if you take the time to learn all that Superset has to offer.

12. Metabase for quick visualization

Metabase / Metabase GitHub / AGPL license / 22.9k stars

The strength of Metabase is its simplicity, both in setup (boasting a five-minute setup process) and in the analysis, where anyone on your team can use Metabase to query your data and get answers.

Metabase company wide KPI dashboard
Image source: Metabase

Its strength is also its weakness, in that the simplicity can mean complex querying of your data is more difficult. There is an SQL mode, but this isn’t the main feature of the tool as in other business intelligence tools.

13. Redash for different dashboards for different teams

Redash / Redash GitHub / BSD-2-Clause license / 17.7k stars

Like Metabase, the strength of Redash is in its ease of use. Though you do need some SQL experience to get the most out of the tool, you can easily create visualizations based on your data, and you can create different dashboards for different teams.

Redash browser granular device usage data example
Image source: Redash

Probably, the downside of Redash is that the visualizations and dashboards of Redash aren’t quite as pretty and sophisticated as you can produce with Metabase, and it doesn’t have quite the power of Superset. It also has recently been acquired by Databricks, meaning its future is unknown.


Open Source behavioral data platforms

14. Snowplow 

Snowplow / Snowplow GitHub / Apache-2.0 license / 5.8k stars

The strength of Snowplow is complete ownership of your data and data infrastructure. You have direct access to your granular data and can collect, process, analyze, and store it exactly as you need. Snowplow has trackers and webhooks to pull in multiple data sources and integrates with the main data warehouses.

Snowplow data management pipeline graphic
Image source: Snowplow

Snowplow’s extensive toolset means it can be daunting for engineers to set up and run. It takes time to build a tracking strategy and implement Snowplow in an effective way for your team. Thankfully, it’s possible to gain support via Snowplow Insights, a managed, private SaaS version of the product.

With over 600,000 mobile apps and websites using Snowplow, there is a vibrant community of users on hand to help you answer questions while setting up Snowplow for your organization.

There is a wealth of opportunities for data teams looking to leverage open source analytics tools.

Take control of your data with open source tools

With some, you can get an entire pipeline, from collection to transformation and visualization, up and running in an hour. Others will take your entire data team weeks to configure.

Whatever your use case, it makes sense to explore the flexibility of open source tools. In particular, it’s worth taking advantage of thriving open source analytics communities, discourse forums, Slack environments, and Twitter chats to find the best tools for your chosen use case.

Snowplow users frequently integrate many of the above tools in order to open source a data stack. Snowplow’s modular technology can slot into your existing processes, giving the flexibility to leverage Snowplow for multiple use cases

If you’d like to learn more about how Snowplow’s open core infrastructure can empower you on your data journey, why not try Snowplow yourself?

Learn more about our open source technology with a Snowplow demo or try us out!

Share

Related articles