Squads are self contained units, popularized by companies like Spotify, containing developers, engineers, analysts, data scientists, and individuals from other disciplines that allow the squad to operate independently. Squad-based organizations have demonstrated the effectiveness of their style of product development: a key strength of the squad model is that individual teams can have a singular focus and the capability to execute on that goal because each squad is cross-functional.
Effective squads are data consumers with each squad using data to inform the product development process. At the same time, each squad is a data producer, responsible for any data that describes user interactions with the particular product or feature they are responsible for. In a company with multiple squads, this presents a data workflow challenge: to be effective, each squad needs to be autonomous enough to make any required changes to the data they produce while simultaneously relying on data produced by other squads (who have the same freedom to change their data) to understand user behavior and how it’s changing application-wide. Efficiently managing these complicated data workflows, along with the underlying data, presents a massive challenge.
In order to be effective, squads need to understand all aspects of the area they’re trying to solve. This might involve behaviors that occur elsewhere in the user journey, or experiences created by other squads. Each squad might be organized against a business goal, like “improve customer acquisition,” or related to part of the product like “improve recommendation engine.” As data consumers, squads want access to as much data as possible to be well informed and best able to achieve their goals. However, the squads are only directly responsible for a subset of the data (that which is generated by the tech they’ve built) and consequently need to be able to consume data that is “controlled” by other squads. Therefore data must be shareable and, more importantly, useable among all squads. And of course, each squad will want the flexibility to evolve the structure of their data as their needs change and the freedom to do so without liaising with every other squad that might be using that data.
Learn how teams at Spotify and Airbnb use data to build world class products.
But users don’t use the features of a product in isolation; the changes that squads make, however well-intentioned or positive for their domain, fundamentally alter what happens to users throughout the product and effect the data and experiences for other squads. A squad’s data is part of a much larger set that describes a user’s behavior.
The unified log paradigm, around which Snowplow is architected, is a way of structuring your data stack so that all of your data, regardless of source, is combined into a single log, upstream from your data warehouse. Under this model, each squad can be responsible for their own part of the user journey and the data that describes that part of the journey. Squads can track any and all of the data points relevant to their goals, like pageviews per session or new customers per month.
All data from the different parts of the user journey are then combined in the unified log. Because the log is upstream from the data warehouse, all of your tools that consume data from the warehouse are reporting off the same source of truth, giving all teams full visibility. This addresses the two primary data concerns squads have: can they be responsible for generating their own data and can they consumer the full data set.
Though the unified log addresses the data workflow management problem, this only gets us halfway to a full solution. The unified log provides a way of making data produced by any given squad accessible to the rest. However, it doesn’t solve the second problem that each individual squad wants the freedom (autonomy) to change the structure of data produced to reflect any changes they’re making to the product or user experience in pursuit of their goal. With a potentially large number of other squads that rely on that data, changing the data structure is going to break any downstream analyses those squads are using.
Schema versioning and schema registries are the key innovation here. By providing a formal framework for squads that produce data to alter the structure of their data over time by publishing new schemas that are accessible by anyone who uses that data, each squad can make any changes while data consumers can track those changes and use the schemas to correctly read them. In the Snowplow ecosystem, schema versioning is done according to
schema-ver and we use Iglu to provide each Snowplow user with their own schema registry.
A schema is a definition for a particular type of data that defines how that data is structured. In a unified log setup, schemas function as contracts between the squads that produce the data and those that consume; as long as the squads that consume the data do so with reference to the schema, the data should be structure the way the consumer expects and can therefore be successfully processed. This makes building data consuming processes that correctly read the data possible, regardless of any changes made by other squads. Versioning schemas allows for squads to change the way they produce data, down to the fundamental structure, over time as their needs evolve. If other teams need to work with that data, they can use the schemas to understand how the data is structured and how it has changed so they can easily work with it to build insight.
Using schemas provides a way to give data producers the flexibility to keep rapidly evolving their data structures, ensuring their autonomy, while simultaneously empowering all other squads with the information required to use that data. Schemas are an essential element of making a unified log architecture function optimally, but they’re not the only component.
In a world with multiple different events and entities being tracked and consumed by different squads, and in which the structure of those different entities can change over time, it’s important that it’s clear what each event and entity actually is in order for it to be processed using the correct schema definition. With Snowplow, all data is “self-describing,” meaning the correct schema, including which version, is attached to the data. Being self-describing makes it straightforward to have multiple versions of your events and entities tracked and processed at the same time, giving every squad more flexibility and autonomy.
The rise of the squad model has driven a big shift in the way product teams are organized and operated, particularly within larger organizations.
Organizing people into autonomous, self-directing teams is not enough, however. It is important that the technical architecture, including the data architecture, mirrors and facilitates how squads are organized and the way they work. Snowplow is architected to do exactly that: the powerful combination of the unified log, schema versioning, and self-describing data make it possible for individual squads to move fast and change the data they produce, all the while enabling other squads to access and successfully use it so they can be data-driven in their decision making.