Loading JSON data into Redshift - the challenges of quering JSON data, and how Snowplow can be used to meet those challenges
Very many of our Professional Services projects involve forking the Snowplow codebase so that specific clients can use it to load their event data, stored as JSONs, into Amazon Redshift, so that they can use BI tools to create dashboards and mine that data.
We’ve been surprised quite how many companies have gone down the road of using JSONs to store their event data. In this blog post, we look at:
- Why logging event data as JSONs has become so popular
- The limitations of this approach
- Using the Snowplow tech stack to load JSON data into Redshift
There are four reasons logging event data as JSONs has become as popular as it is:
1. Easy to implement
Representing an event as a JSON is an extremely simple, easy-to-understand approach. Let’s look at an example, a video play event represented as a JSON:
The JSON is very easy to read and comprehend. Crucially for application developers, it is straightforward to compose.
Note that there are many ways we could choose to represent a video play event with a JSON. We might want e.g. to use nesting, to capture a richer data set both about the video itself, and about the user who watched it:
In the above example, we’ve used nesting to group related fields (fields related to the viewer, fields related to the video, and fields related to the producer of the video). Again, even with all the additional data, the JSON is easy for the human eye to parse, and the nesting provides a tidy way of structuring our data.
JSONs are flexible. If one day, an application developer decides she wants to add a new field to the “play_video” JSON, there’s nothing stopping her!
Typically, application developers can create new JSONs to represent new event types over time, and update the structure of JSONs for existing event types over time (adding or dropping fields), as they see fit. If the analytics system is simply logging the JSONs, there’s no need to update any downstream analytics infrastructure in light of changes to the JSON schema.
Lots of analytics applications store event data as JSONs, including Kissmetrics, Mixpanel, KeenIO and Swrve. At Snowplow, we’re in the process of building out support for unstructured events, where events are represented as JSONs.
4. Seemingly easy to analyze using Hive and JSON Serde
JSONs, stored as flat files in S3 or HDFS should be easy to analyze using Hive and the JSON serde. (We’ve blogged an example of this here.) You ‘simply’ specify the schema as part of your Hive table definitions, implicit in the JSON. for each event type, and then you can query the data using SQL statements, as if it was a relational database. Easy peasy!
Unfortunately, querying the JSON data is not as easy as it first appears:
- It is not always obvious, to the analyst creating the Hive table definitions, what the schema / structure for each event type should be. The analyst can visually inspect some of the data to get some idea, but it is impossible to tell:
- Whether they have spotted all the possible fields that might appear. (E.g. there may be some that are sparsely populated, but very important when they are filled in.)
- Which fields are compulsory, and which are optional?
- Remember that application developers have been free to keep modifying and updating event schemas over time, with no requirement to document or sense-check any of these updates. The analyst suffers, as a consequence, as she has to work out how that schema has evolved, in order to interrogate the data.
- When exploding nested data into separate tables, it is can be hard to identify on what key that data should be joined to the parent table. A nice feature of JSONs is their ability to nest data, but unless we can explode that nested data out, querying it is not going to be easy.
- It can be hard (if not impossible) for analysts and application developers to spot “errors” in the JSON at the point the event data is generated and captured. It is very easy for mistakes to creep in: JSONs are very fragile (a missing comma or inverted comma will break a JSON.) There’s also no way to check the type of individual field.
When Amazon Redshift was launched earlier this year, many companies wanted to load their event data into Redshift, to enable faster querying than was possible in Apache Hive, and also so that they could use BI and analytics tools to create dashboards, visaulize and mine the data. Unfortunately, loading JSON data into Redshift is even harder:
- Redshift tables have traditional schemas where each field has a fixed type. To make loading data into Redshift reliable, you really want to enforce the strong types on variables all the way through the data pipeline, from data collection. However, JSONs do not support strong typing or schemas.
- Any input line that does not conform to the Redshift schema fails to load. Many companies are then stuck between two unappealing approaches: junk data that doesn’t fit the schema (which may be a significant subset of the data), or only load a very small subset of the fields that have been reliably collected across the different event types. (Thereby relaxing the requirements on input data to successfully load, but again, effectively loosing a lot of the richness in the raw JSON data set.)
The Snowplow stack can be forked so that Snowplow transforms JSON data and loads it into Redshift. We’ve found this is a much better approach then building an ETL pipeline using e.g. Apache Hive and the JSON serde, because Snowplow has a lot of validation capabilities. We’ll discuss this in a second - first, let’s overview the process for adapting Snowplow to load your custom JSONs into Redshift:
- We develop an event dictionary for the client, which catalogs all the different events in their application and the fields that are captured with each of those events.
- We use that event dictionary to define table definitions in Redshift where the data will be loaded. (So we have a mapping of the event dictionary to the output tables.)
- We work with the client to map the contents of their event JSONs to the dictionary. (So we have a mapping of the input JSONs to the event dictionary.)
- We then modify the Snowplow stack to unpick the JSONs (as per the JSON -> dictionary mapping) and write the data back to S3 in a format suitable for loading directly into Redshift (as per the dictionary -> Redshift table definitions mapping)
As mentioned above, the key to making this work is to use Snowplow’s rich validation capabilities. We use these to:
- Check that the input data conforms to the schemas specified
- Output any data that does not conform to the schema to a “bad buckeet”. This means that the “good data” will successfully load into Redshift, while we don’t lose any “bad data”. We can now easily spot errors as they arise (by ensuring that the ETL process is run every few hours) and deal with them immediately. It also means that the “bad” data can be inspected, updated, reprocessed, and then loaded into Redshift, which is much preferable to simpy dropping it.
Going forwards, we plan to build out the validation capability, so that as well as simply checking if incoming JSONs adhere to the schema, Snowplow will also spot “orphaned data” (i.e. name / value pairs in the JSON that are not accommodated in the schema) so that the schema can be updated to incorporate this data, and we save it from being lost.
Interested in talking to the Snowplow team about loading your JSON data into Redshift?
We’re not the only people who think storing event data as JSONs is not a great. For more opinions:
- Nathan Marz wrote a blog post back in 2010, highlighting their weaknesses, as part of presenting an alternative, graph-based model.
- The Unified Logging Infrastructure for Data Analytics at Twitter gives a very clear description of how this approach breaks down at scale. (See particularly section 3.1.)