16 January 2017  •  Yali Sassoon

Data collection: the essential, but unloved, foundation of the data value chain

it is so obvious no one bothers saying it.

Data collection is an essential part of any data strategy

After all: without data collection, there is no data. Without data there is no data value chain. No reporting, no analysis, no data science, no data-driven decision making.

It is not just that people in data don’t remark on the importance of data collection. They do not talk about data collection at all. To take just one example, let’s review Firstmark’s Big Data Landscape:

Firstmark-big-data-landscape-2016

Roughly 15% of the landscape is given over to the ‘Data Sources and API’ providers. However, none of the providers listed, either in that section, or the rest of the map, specialize in enabling companies to collect their own data. The Big Data Landscape, then, is full of vendors that will help you do things with your data, and provide you with their own data. But all those providers assume you have your own data to do stuff with, so have got data collection sorted.

The awkward truth is that although most companies do have some of their own data, it is often not good data because it is not being collected properly. And most choose to invest in the rest of their data/analytics stack, without putting in place proper processes and systems to collect and store the good data in the first place. They might as well build houses without foundations. In this post, I’m going to explore:

  1. What makes good data?
  2. Strategies and techniques to systematically generate and collect good data
  3. The strong commercial imperative to collect data properly
  4. The strong moral imperative to collect data properly

12 January 2017  •  Diogo Pacheco

Looking back at 2016

looking-back-to-2016

With the start of 2017, we have decided to look back at our 2016 blog and our community Discourse posts that generated more engagement with our users.

More than ten thousand users spent a total of 548 hours reading our blog posts whilst on Discourse (which we only launched this year), 8700 unique users spent 424 hours reading and participating in the Snowplow community.

Let’s take a closer look at:

  1. Top 10 blog posts published in 2016
  2. Top 10 Discourse threads published in 2016

09 January 2017  •  Yali Sassoon

Snowplow Javascript Tracker 2.7.0 released


22 December 2016  •  Joshua Beemster

Factotum 0.4.0 released with support for constraints

We’re pleased to announce the 0.4.0 release of Snowplow’s DAG running tool Factotum! This release centers around making DAGs safer to run on distributed clusters by constraining the run to a specific host.

In the rest of this post we will cover:

  1. Constraining job runs
  2. Downloading and running Factotum
  3. Roadmap
  4. Contributing

20 December 2016  •  Anton Parkhomenko

Snowplow 86 Petra released

We are pleased to announce the release of Snowplow 86 Petra. This release introduces additional event de-duplication functionality for our Redshift load process, plus a brand new data model that makes it easier to get started with web data. This release also adds support for AWS’s newest regions: Ohio, Montreal and London.

Having exhausted the bird population, we needed a new set of names for our Snowplow releases. We have decided to name this release series after archaelogical sites, starting with Petra in Jordan.

Read on after the fold for:

  1. Synthetic deduplication
  2. New data model for web data
  3. Support for new regions
  4. Upgrading
  5. Roadmap
  6. Getting help

petra-jordan


12 December 2016  •  Joshua Beemster

SQL Runner 0.5.0 released

We are pleased to announce version 0.5.0 of SQL Runner. This release adds some powerful new features, including local and Consul-based remote locking to ensure that SQL Runner runs your playbooks as singletons.

  1. Locking your run
  2. Checking and deleting locks
  3. Running a single query
  4. Other changes
  5. Upgrading
  6. Getting help

15 November 2016  •  Alex Dean

Snowplow 85 Metamorphosis released with beta Apache Kafka support

We are pleased to announce the release of Snowplow 85 Metamorphosis. This release brings initial beta support for using Apache Kafka with the Snowplow real-time pipeline, as an alternative to Amazon Kinesis.

Metamorphosis is one of Franz Kafka’s most famous books, and an apt codename for this release, as our first step towards an implementation of the full Snowplow platform that can be run off the Amazon cloud, on-premise. (We’ll come up with a new non-ornithological codename series for R86 onwards.)

  1. Supporting Apache Kafka
  2. Scala Stream Collector and Kafka
  3. Stream Enrich and Kafka
  4. Kafka documentation
  5. Other changes
  6. Upgrading
  7. Roadmap
  8. Behind the scenes
  9. Getting help

kafka-metamorphosis


07 November 2016  •  Ed Lewis

Factotum 0.3.0 released with webhooks

We’re pleased to announce the 0.3.0 release of Snowplow’s DAG running tool Factotum! This release centers around making DAGs easier to create, monitor and reason about, including adding outbound webhooks to Factotum.

In the rest of this post we will cover:

  1. Improving the workflow when creating DAGs
  2. Improving job monitoring using webhooks
  3. Behaviors on task failure
  4. Extras
  5. Downloading and running Factotum
  6. Roadmap
  7. Contributing

03 November 2016  •  Idan Ben-Yaacov

3rd Snowplow Meetup Berlin in less than two weeks!

On the 16th of November 19:00 we are having another exciting Berlin meetup @ Betahaus. You’ll get a chance to hear all about Sauna, our new open-source product and listen to what our clients are building in the audience segmentation space with Snowplow data. Now for the cherry on top, the whole Snowplow team will be there.

Picture of Berlin


30 October 2016  •  Alex Dean

Asynchronous micro-services and Crunch Budapest 2016

At Snowplow we have been firm supporters of the Hungarian data and BI scene for several years, and so it was great to be invited to speak at the Crunch conference in Budapest earlier this month.

I gave a talk at Crunch on asynchronous micro-services and the unified log - a new twist on a theme that I have been developing in my book Unified Log Processing.

This blog post will briefly cover:

  1. Asynchronous micro-services and the unified log
  2. My Crunch conference highlights
  3. Some closing thoughts

Interested in Snowplow? Let’s get started.

  

More recent posts

27 October 2016  •  Yali Sassoon

The Snowplow Meetup New York Number 2 - a recap

23 October 2016  •  Alex Dean

Schema registries and Strata + Hadoop World NYC 2016

17 October 2016  •  Yali Sassoon

How Viewbix uses Snowplow to enable their customers to make data-driven decisions

12 October 2016  •  Yali Sassoon

Snowplow Python Tracker 0.8.0 released

08 October 2016  •  Joshua Beemster

Snowplow 84 Steller's Sea Eagle released with Elasticsearch 2.x support

07 October 2016  •  Yali Sassoon

Iglu 6 Ceres released with significant updates to Igluctl

03 October 2016  •  Ed Lewis

Kinesis Tee 0.1.0 released for Kinesis stream filtering and transformation

23 September 2016  •  Idan Ben-Yaacov

The third Snowplow Meetup London was all about Real-Time!

22 September 2016  •  Alex Dean

Introducing Sauna, a decisioning and response platform

15 September 2016  •  Yali Sassoon

Snowplow at Measurecamp London September 2016 - a recap

07 September 2016  •  Idan Ben-Yaacov

Second Snowplow Meetup NYC scheduled for September

06 September 2016  •  Anton Parkhomenko

Snowplow 83 Bald Eagle released with SQL Query Enrichment

02 September 2016  •  Idan Ben-Yaacov

Third Snowplow Meetup London scheduled for September

29 August 2016  •  Joshua Beemster

Snowplow Android Tracker 0.6.0 released with automatic crash tracking

17 August 2016  •  Ed Lewis

Snowplow Ruby Tracker 0.6.0 released

08 August 2016  •  Joshua Beemster

Snowplow 82 Tawny Eagle released with Kinesis Elasticsearch Service support

05 August 2016  •  Yali Sassoon

A roundup of recent Snowplow Meetups in Amsterdam, Berlin, London and Tel-Aviv

04 August 2016  •  Ronny Yabar

Snowplow Tracking CLI 0.1.0 released

31 July 2016  •  Anton Parkhomenko

Iglu Schema Registry 5 Scinde Dawk released

25 July 2016  •  Yali Sassoon

How we're reinventing digital analytics at Snowplow: presentation to the DA Hub Europe