Future-proofing your organization with a universal data language

Share

In our content series on Treating Data as a Product, we explored the major challenges organizations face when it comes to keeping productivity high. 

In our conversations with data leaders, we learned that communication – particularly how people interact with each other around data (especially behavioral data, which is rich, voluminous and complex by nature) – is integral to the success of the data function, but notoriously tricky to get right. There are many factors that hamper communication. It can come down to internal teams misunderstanding their data, a lack of good storytelling from data professionals, or mistrust in the data’s integrity from data consumers.

It’s vital therefore, that organizations build solid foundations when it comes to communicating around their data. Enforcing a data language can help us do that; a blueprint for how behavioral data should look and feel as it moves throughout the organization.

What is a universal language, and why is it important?

A universal data language is a human and machine-readable language. When enforced, this is a robust solution for internal communication that minimizes communication failures. In this instance, a universal language around data acts as a framework to define and determine data structures across the organization. 

The key here is a language that is both human and machine readable. Most efforts to standardize the data language are human readable, which most often results in an event dictionary. Event dictionaries can be helpful, but are onerous to maintain and are inherently flawed as a means to build a data language. 

Event dictionaries can be costly and inefficient 

In many cases, event dictionaries are the result of resetting the data setup. This often materialized through the help of an external consultant, who audits an existing data function and recommends that the company should reinstrument tracking. Event dictionaries are thus created as a deliverable of these projects, which can be long and costly.

To do that, the consultant would work with front-end developers to develop an event dictionary. These most often take the form of a spreadsheet spreading to hundreds of rows – given the high-volume nature of behavioral data, these are often extensive, sprawling documents.

However, once the consultant leaves after completing the project, there is a sudden lack of accountability, with no one left to maintain this huge, multi-sheet spreadsheet. The document rapidly becomes outdated, because as the company grows and adds new features and trackers, or as the existing tracking gets updated, none of this is recorded or reflected in the event dictionary. 

This approach does sometimes work, particularly when the dictionary owner is invested in its long term success, perhaps as one of the data consumers. However, the dictionary is often created as a one-off project by a specialist consultant and ongoing ownership is unclear. 

If we consider two main stakeholders, front end developers and data consumers from front-line teams, both groups face challenges from a sprawling event dictionary:

  1. Developers can’t interpret the event dictionary and their goals and incentives often don’t line up with ensuring tracking matches intent exactly, instead they are focused on getting “good-enough” live on time.
  1. Data consumers either can’t interpret the event dictionary or aren’t sure if the values loading in the database match the data dictionary intent. 

At this point, the data team heads to instant messaging platforms like Slack to communicate changes to tracking, but natural silos appear and certain groups hold onto ‘tribal knowledge’ about what certain events mean. 

Soon we have a perfect storm of sprawling slack channels, company-wide confusion and ineffective data dictionaries. In other words, the data dictionary cannot be ‘enforced’ and the data language cannot be standardized across the organization. 

Updates to the way data is collected only serve to make matters worse. 

The need for data governance in a changing world

The typical spreadsheet-based ‘data dictionaries’ soon become obsolete, which is a challenge in itself. But this is only exacerbated as 

All three of these changing environments influence the data collection process. When new product features are added, the product team must instrument new tracking and data models must be updated in tandem. New events need to be set up to meet the needs of more nuanced questions from internal customers; and as regulatory changes like GDPR or privacy features like Safari’s ITP expand, they directly impact our ability to capture behavioral data.

Data governance is the necessary component that helps you stay on top of your data structure and your evolving data language. In the best case scenario, it takes the form of a centralized, accountable framework that governs how your data structures will evolve over time, designed in a manner that allows flexibility. 

For example, if your users can only login with username and password at present, but you plan on enabling users in the future to use Single Sign On (SSO), via Google or Apple, your tracking should be designed to easily allow for this information to be captured in the future. This should be planned into your tracking structures to adapt as your business evolves. 

You also need to factor in, not just how you define and name your events but the framework in which you structure your data in general. For instance, will you be using Camel Case, Snake Case, and how you’re going to standardize across the business. These are a few examples of the building blocks of your data language. 

Defining your events, now and in the future

For example, let’s say you want to track a new event called “Button Click”. How do you spell out “Button Click”? For instance, you will need to agree on whether you’re using American spelling (which might affect other event definitions), and whether you’ll be using “button_click” (referred to as snake case), or “buttonClick” (often referred to as Camel case).

These are the decisions that need to be centrally governed, giving the organization a reference point from which they can confidently instrument tracking or define new events in a consistent manner. 

The power of standardizing here is that, as an analyst looking to understand data in the warehouse (say, exploring causes behind a drop-off in the signup funnel), you may want to look at the login event and what the data looks like across both web and mobile. Standardization between teams, platforms and products means that data exploration like this is far less painful. Since the data flows in a uniform format, it should be straight forward to join data sets together and compare the customer journey across multiple instances – often referred to as the single customer view

How do I build a process of standardization for my data language? 

There are a few different approaches you can take to standardize your data language. 

How to build a single source of truth with Snowplow

Another way is to take advantage of Snowplow’s dedicated schema technology. Snowplow’s schema registry (Iglu) is available for anyone to use, as a modular component, or within the Snowplow Behavioral Data Platform. 

Iglu enables data teams to leverage self-describing JSON schemas to enforce a data language that can be universally interpreted by humans and machines. It acts as a ruleset for what data is allowed to load into the warehouse – removing the need for sprawling data dictionaries. 

Going back to our two main stakeholders:

  1. A developer must set up tracking in a way that conforms to the data language ‘ruleset’. If they don’t, their data fails validation, which can be picked up in failed event logs, or ideally during the testing phase before tracking has entered production.
  1. Data consumers are empowered to collaborate to create their own ‘rules’ in the universal data language (e.g. JSON schema). They can control the structure of the data in the warehouse (and other targets) and therefore have confidence in what the input of the data product will look like. Furthermore all new joiners to the data team know exactly what each field means.

In this example, no one has to communicate design intent with their own tracking conventions – and no one is left wondering how to interpret intent. As a result, no two people need to communicate directly for successful tracking to take place – relieving bottlenecks and reducing the likelihood of miscommunication.

What this looks like in practice

Within Snowplow, it’s possible to design your events in advance, shaping what the events coming to your data warehouse look like before they are even sent.

You can do this by writing a set of ‘rules’ that dictate the structure of your behavioral data

For example, the ruleset for a click event could look like this:

{
    "element_name": {
        "enum": [
            "share",
            "like",
            "submit_email",
            "rate",
            ...
            "close_popup”
        ],
        "description": "The name of the element that is clicked"
    },
    "value": {
        "type": ["string", "null"],
        "description": "What is the value associated with the click"
    },
    "element_location": {
        "type": ["string", "null"],
        "description": "Where on the screen is the button shown eg. Top, left"
    },
    "click_error_reason": {
        "type": ["string", "null"],
        "description": "If the click resulted in an error, what was the reason eg. Invalid character in text field"
    }
}

With Snowplow, each event is ‘checked’ against these rulesets to see that it adheres to your structures, before it lands in the data warehouse. You can think of it as a machine doing the data governance for you, ticking off items from a checklist to ensure data quality standards are met. 

This method means the structure of data in the warehouse is controlled strictly by those consuming it.

Subset of properties sent automatically for every eventCustom properties of the click event
User IDPlatformTimestampEvent NameElement_nameValueElement_locationClick_error_reason
JoeWeb2019-10-01 12:33:21Page_view
JoeWeb2019-10-01 12:33:29Click submit_emailjoe@email.comhomepage_footer
JoeiOS2019-10-01 23:31:03Click rateno_rating_selected

Snowplow does this by validating behavioral data at the Enrichment stage of the data collection process. At this stage, nothing passes through to the data warehouse unless it meets the requirements of the schema. If it fails, it isn’t deleted, but sent to another repository for ‘Failed Events’, where the data can be rectified and restored. 

With this simple change to the setup of introducing an enforced ruleset, your front end developers can QA your analytics in the same way as they would QA the rest of any build, by adding to their integrated testing suite using something like the open source tool, Snowplow Micro.

Snowplow tracking can also be versioned – definitions can be updated according to semantic versioning with all changes automatically manifesting in the warehouse table structure.

In a typical tracking workflow, a data practitioner would: 

  1. Collaborate in a tracking design workshop;
  2. Upload the rules (event and entity definitions) to the pipeline;
  3. Test tracking against these rules in a sandbox environment (to check the data is firing as expected, before pushing to production);
  4. Set up integrated tests to ensure each code push takes analytics into account;
  5. Set up alerting for any spike in events failing validation.

This enables you to build a high level assurance in the data landing in your warehouse, and empower the data to action failed data, rather than ignoring it. 

Evolving your data workflow 

Your data workflow, from instrumenting tracking to dashboarding and analytics will largely depend on the data maturity of your organization

A company that’s just getting started with its data setup might rely on a single data practitioner, responsible for building reports, implementing tracking and laying the foundations for data governance. At this stage and scale, the challenges around the data language are minimal. 

However, as the organization grows in size and data requirements, with data-hungry teams making constant tracking requests for multiple aspects of the product, the complexity of data governance increases exponentially. At a certain point it makes sense to devolve responsibility for data governance to those teams with domain expertise. By empowering them to define their own event structures, you can avoid the data team becoming a bottleneck. 

It makes sense for the data team to retain responsibility for central data governance and the blueprint for how new events are created and implemented. But with the right standardization in place, product teams can instrument tracking themselves, making for a far faster, leaner process. Those product teams become partly-autonomous, empowered to create insights without the need for a data engineer. 

This centralized framework gets you closer to the ‘nirvana’ of self-serve data orchestration, something that only a few organizations have achieved so far. Strava is a great example of a company who has managed this, enabling analysts across the business to fetch data for themselves. 

Well-defined events help you gain deeper insights about your product

Defining events used in your tracking might not seem the most important aspect of your data workflow, but it can pay huge dividends downstream. 

To take one example, let’s say you’re trying to better understand how your search function works within your product or website. 

The search function is a surprisingly complex feature that can generate nuanced questions from the product team and lead to intricate event definitions. Questions might include 

To answer these questions, you would need to look again at the schema for the search event that the developer or data producer first defined, to understand how the search trigger was implemented. This is where a well-defined central data language comes in. Analysts with access to a centralized schema registry will be able to explore exactly what’s happening when the event fires, and understand on a granular level how their search function can be improved. 

Search functions are a good example because, when it comes to behavioral data capture, they’re often poorly understood. But this is just one example where, if events are defined and planned before tracking is implemented, the data team and data consumers alike have access to a much clearer picture of user interactions. What’s more, they’re looking at the same picture, from a single source of truth, with no ambiguity or misinterpretation. 

A universal language has the potential to unite data teams and data consumers, while creating room for data structures to grow and adapt as the organization evolves.

In our content series on Treating Data as a Product, we explored the major challenges organizations face when it comes to keeping productivity high. 

In our conversations with data leaders, we learned that communication – particularly how people interact with each other around data (especially behavioral data, which is rich, voluminous and complex by nature) – is integral to the success of the data function, but notoriously tricky to get right. There are many factors that hamper communication. It can come down to internal teams misunderstanding their data, a lack of good storytelling from data professionals, or mistrust in the data’s integrity from data consumers.

It’s vital therefore, that organizations build solid foundations when it comes to communicating around their data. Enforcing a data language can help us do that; a blueprint for how behavioral data should look and feel as it moves throughout the organization.

What is a universal language, and why is it important?

A universal data language is a human and machine-readable language. When enforced, this is a robust solution for internal communication that minimizes communication failures. In this instance, a universal language around data acts as a framework to define and determine data structures across the organization. 

The key here is a language that is both human and machine readable. Most efforts to standardize the data language are human readable, which most often results in an event dictionary. Event dictionaries can be helpful, but are onerous to maintain and are inherently flawed as a means to build a data language. 

Event dictionaries can be costly and inefficient 

In many cases, event dictionaries are the result of resetting the data setup. This often materialized through the help of an external consultant, who audits an existing data function and recommends that the company should reinstrument tracking. Event dictionaries are thus created as a deliverable of these projects, which can be long and costly.

To do that, the consultant would work with front-end developers to develop an event dictionary. These most often take the form of a spreadsheet spreading to hundreds of rows – given the high-volume nature of behavioral data, these are often extensive, sprawling documents.

However, once the consultant leaves after completing the project, there is a sudden lack of accountability, with no one left to maintain this huge, multi-sheet spreadsheet. The document rapidly becomes outdated, because as the company grows and adds new features and trackers, or as the existing tracking gets updated, none of this is recorded or reflected in the event dictionary. 

This approach does sometimes work, particularly when the dictionary owner is invested in its long term success, perhaps as one of the data consumers. However, the dictionary is often created as a one-off project by a specialist consultant and ongoing ownership is unclear. 

If we consider two main stakeholders, front end developers and data consumers from front-line teams, both groups face challenges from a sprawling event dictionary:

  1. Developers can’t interpret the event dictionary and their goals and incentives often don’t line up with ensuring tracking matches intent exactly, instead they are focused on getting “good-enough” live on time.
  1. Data consumers either can’t interpret the event dictionary or aren’t sure if the values loading in the database match the data dictionary intent. 

At this point, the data team heads to instant messaging platforms like Slack to communicate changes to tracking, but natural silos appear and certain groups hold onto ‘tribal knowledge’ about what certain events mean. 

Soon we have a perfect storm of sprawling slack channels, company-wide confusion and ineffective data dictionaries. In other words, the data dictionary cannot be ‘enforced’ and the data language cannot be standardized across the organization. 

Updates to the way data is collected only serve to make matters worse. 

The need for data governance in a changing world

The typical spreadsheet-based ‘data dictionaries’ soon become obsolete, which is a challenge in itself. But this is only exacerbated as 

All three of these changing environments influence the data collection process. When new product features are added, the product team must instrument new tracking and data models must be updated in tandem. New events need to be set up to meet the needs of more nuanced questions from internal customers; and as regulatory changes like GDPR or privacy features like Safari’s ITP expand, they directly impact our ability to capture behavioral data.

Data governance is the necessary component that helps you stay on top of your data structure and your evolving data language. In the best case scenario, it takes the form of a centralized, accountable framework that governs how your data structures will evolve over time, designed in a manner that allows flexibility. 

For example, if your users can only login with username and password at present, but you plan on enabling users in the future to use Single Sign On (SSO), via Google or Apple, your tracking should be designed to easily allow for this information to be captured in the future. This should be planned into your tracking structures to adapt as your business evolves. 

You also need to factor in, not just how you define and name your events but the framework in which you structure your data in general. For instance, will you be using Camel Case, Snake Case, and how you’re going to standardize across the business. These are a few examples of the building blocks of your data language. 

Defining your events, now and in the future

For example, let’s say you want to track a new event called “Button Click”. How do you spell out “Button Click”? For instance, you will need to agree on whether you’re using American spelling (which might affect other event definitions), and whether you’ll be using “button_click” (referred to as snake case), or “buttonClick” (often referred to as Camel case).

These are the decisions that need to be centrally governed, giving the organization a reference point from which they can confidently instrument tracking or define new events in a consistent manner. 

The power of standardizing here is that, as an analyst looking to understand data in the warehouse (say, exploring causes behind a drop-off in the signup funnel), you may want to look at the login event and what the data looks like across both web and mobile. Standardization between teams, platforms and products means that data exploration like this is far less painful. Since the data flows in a uniform format, it should be straight forward to join data sets together and compare the customer journey across multiple instances – often referred to as the single customer view

How do I build a process of standardization for my data language? 

There are a few different approaches you can take to standardize your data language. 

How to build a single source of truth with Snowplow

Another way is to take advantage of Snowplow’s dedicated schema technology. Snowplow’s schema registry (Iglu) is available for anyone to use, as a modular component, or within the Snowplow Behavioral Data Platform. 

Iglu enables data teams to leverage self-describing JSON schemas to enforce a data language that can be universally interpreted by humans and machines. It acts as a ruleset for what data is allowed to load into the warehouse – removing the need for sprawling data dictionaries. 

Going back to our two main stakeholders:

  1. A developer must set up tracking in a way that conforms to the data language ‘ruleset’. If they don’t, their data fails validation, which can be picked up in failed event logs, or ideally during the testing phase before tracking has entered production.
  1. Data consumers are empowered to collaborate to create their own ‘rules’ in the universal data language (e.g. JSON schema). They can control the structure of the data in the warehouse (and other targets) and therefore have confidence in what the input of the data product will look like. Furthermore all new joiners to the data team know exactly what each field means.

In this example, no one has to communicate design intent with their own tracking conventions – and no one is left wondering how to interpret intent. As a result, no two people need to communicate directly for successful tracking to take place – relieving bottlenecks and reducing the likelihood of miscommunication.

What this looks like in practice

Within Snowplow, it’s possible to design your events in advance, shaping what the events coming to your data warehouse look like before they are even sent.

You can do this by writing a set of ‘rules’ that dictate the structure of your behavioral data

For example, the ruleset for a click event could look like this:

{

    “element_name”: {

     “enum”: [

“share”,

“like”,

“submit_email”,

“rate”,

“close_popup”

]

     “description”: “The name of the element that is clicked”

},

    “value”: {

     “type”: [“string”,”null”],

     “description”: “What is the value associated with the click”

},

    “element_location”: {

     “type”: [“string”,”null”],

     “description”: “Where on the screen is the button shown eg. Top, left”

},

    “click_error_reason”: {

     “type”: [“string”,”null”],

     “description”: “If the click resulted in an error, what was the reason eg. Invalid character in text field”

}

}

With Snowplow, each event is ‘checked’ against these rulesets to see that it adheres to your structures, before it lands in the data warehouse. You can think of it as a machine doing the data governance for you, ticking off items from a checklist to ensure data quality standards are met. 

This method means the structure of data in the warehouse is controlled strictly by those consuming it.

Subset of properties sent automatically for every eventCustom properties of the click event
User IDPlatformTimestampEvent NameElement_nameValueElement_locationClick_error_reason
JoeWeb2019-10-01 12:33:21Page_view
JoeWeb2019-10-01 12:33:29Click submit_emailjoe@email.comhomepage_footer
JoeiOS2019-10-01 23:31:03Click rateno_rating_selected

Snowplow does this by validating behavioral data at the Enrichment stage of the data collection process. At this stage, nothing passes through to the data warehouse unless it meets the requirements of the schema. If it fails, it isn’t deleted, but sent to another repository for ‘Failed Events’, where the data can be rectified and restored. 

With this simple change to the setup of introducing an enforced ruleset, your front end developers can QA your analytics in the same way as they would QA the rest of any build, by adding to their integrated testing suite using something like the open source tool, Snowplow Micro.

Snowplow tracking can also be versioned – definitions can be updated according to semantic versioning with all changes automatically manifesting in the warehouse table structure.

In a typical tracking workflow, a data practitioner would: 

  1. Collaborate in a tracking design workshop;
  2. Upload the rules (event and entity definitions) to the pipeline;
  3. Test tracking against these rules in a sandbox environment (to check the data is firing as expected, before pushing to production);
  4. Set up integrated tests to ensure each code push takes analytics into account;
  5. Set up alerting for any spike in events failing validation.

This enables you to build a high level assurance in the data landing in your warehouse, and empower the data to action failed data, rather than ignoring it. 

Evolving your data workflow 

Your data workflow, from instrumenting tracking to dashboarding and analytics will largely depend on the data maturity of your organization

A company that’s just getting started with its data setup might rely on a single data practitioner, responsible for building reports, implementing tracking and laying the foundations for data governance. At this stage and scale, the challenges around the data language are minimal. 

However, as the organization grows in size and data requirements, with data-hungry teams making constant tracking requests for multiple aspects of the product, the complexity of data governance increases exponentially. At a certain point it makes sense to devolve responsibility for data governance to those teams with domain expertise. By empowering them to define their own event structures, you can avoid the data team becoming a bottleneck. 

It makes sense for the data team to retain responsibility for central data governance and the blueprint for how new events are created and implemented. But with the right standardization in place, product teams can instrument tracking themselves, making for a far faster, leaner process. Those product teams become partly-autonomous, empowered to create insights without the need for a data engineer. 

This centralized framework gets you closer to the ‘nirvana’ of self-serve data orchestration, something that only a few organizations have achieved so far. Strava is a great example of a company who has managed this, enabling analysts across the business to fetch data for themselves. 

Well-defined events help you gain deeper insights about your product

Defining events used in your tracking might not seem the most important aspect of your data workflow, but it can pay huge dividends downstream. 

To take one example, let’s say you’re trying to better understand how your search function works within your product or website. 

The search function is a surprisingly complex feature that can generate nuanced questions from the product team and lead to intricate event definitions. Questions might include 

To answer these questions, you would need to look again at the schema for the search event that the developer or data producer first defined, to understand how the search trigger was implemented. This is where a well-defined central data language comes in. Analysts with access to a centralized schema registry will be able to explore exactly what’s happening when the event fires, and understand on a granular level how their search function can be improved. 

Search functions are a good example because, when it comes to behavioral data capture, they’re often poorly understood. But this is just one example where, if events are defined and planned before tracking is implemented, the data team and data consumers alike have access to a much clearer picture of user interactions. What’s more, they’re looking at the same picture, from a single source of truth, with no ambiguity or misinterpretation. 

A universal language has the potential to unite data teams and data consumers, while creating room for data structures to grow and adapt as the organization evolves.

Share

Related articles