You have probably seen some new names and faces around the Snowplow blog and GitHub repos recently – we are hugely excited to extend a warm (if somewhat belated) welcome to our three Snowplow summer interns! In this blog post we’ll introduce both interns to the Snowplow community, as well as giving a little more background on the projects they are working on.
This is the fourth instalment of our internship program for open source hackers and scientists – you can read more about our previous winter and last year’s summer and winter internship programs at those links.
Find out more about Anton, Justine and Vincent after the jump.
Anton Parkhomenko: Schema Guru, JSON Schema and Iglu
Anton is part way through a three-month remote Data Engineering internship at Snowplow. Anton divides his time between Krasnoyarsk in Siberia and Moscow.
Anton is an experienced software engineer and a Functional Programming enthusiast; for him the Snowplow internship is about getting his first professional experience in Scala, plus gaining exposure to Big Data technologies and open source project practices.
You have probably seen Anton’s work already, with his Schema Guru 0.1.0 and 0.2.0 releases. Schema Guru is a tool (CLI and web) allowing you to derive JSON Schemas from a set of JSON instances; it is already seeing heavy internal use at Snowplow to build Snowplow event dictionaries for customers.
Anton is working on his next Schema Guru release, which will auto-generate Snowplow-compatible Redshift table DDL and JSON Paths files from a set of JSON Schemas.
Justine Courty: Apache Spark, d3.js and marketing attribution
Justine joins us in the Snowplow office in London as a Data Science intern this summer. Justine’s internship has been experimenting with extending the Snowplow data pipeline to:
- Process enriched events in Spark, with a particular focus on aggregating user journeys based on the sequence of specific events in those journeys. (A class of analysis that is particularly badly suited to SQL.)
- Load the aggregates into a DynamoDB ‘serving layer’
- Visualize the data in innovative ways using D3.js
In particular, Justine has prototyped the above pipeline for marketing attribution pathways. You can see and interact with Justine’s visualization in her excellent blog post analyzing marketing attribution data with d3.js.
Justine has a wealth of data science and engineering experience prior to joining the Snowplow team. She completed a data analysis internship at SoundCloud earlier this year and completed her BSC in Biotechnology at Imperial College London, last year. Her final year project was “Computational 3D image analysis: software development towards understanding the molecular basis of torque generation by the bacterial flagellar motor”.
Vincent Ohprecio: analytics on write with Spark Streaming and AWS Lambda
Vincent is our third intern for the summer – he is part way through a four-month remote Data Engineering internship, based out of Vancouver Canada.
Vincent has had a long and rewarding first career in InfoSec (checkout his excellent blog to read more); he joins us this summer to get hands-on experience developing stream processing applications in Scala.
Vincent is now working on R&D for our new open-source analytics-on-write project, Icebucket. Stay tuned for upcoming posts explaining the concepts behind Icebucket!