Since AWS added ZSTD support for Redshift earlier this year, we have been very interested in applying it to our
atomic.events table for the potential reductions in used disk space (see issue #3435). Our tests have been successful: we’ve found the tradeoff in terms of performance to be negligible across a variety of query types; and we’ve found that applying ZSTD to
atomic.events typically leads to a ~60% reduction in size on disk.
Huge thanks to Mike Robins who led the charge on ZSTD support with his excellent RFC and the accompanying pull requests.
To help with the migration, we have written a migration script, along with the new table definition.
For more information on ZSTD compression, please check out the relevant AWS documentation.
Some websites with highly active bots have faced issues with the
domain_sessionidx column in Redshift, because of values higher than
32767 exceeding the upper bound of the
SMALLINT Redshift column. Using
SMALLINT here is technically a bug, because the underlying field in Snowplow is in fact a Java
Integer, with a range of -2147483648 to 2147483647.
To resolve this, we have updated the
domain_sessionidx column in Redshift to be a Redshift
INTEGER (see issue #1788). A Redshift
INTEGER supports the same value range as a Java
Late last year Amazon added a 24th field to the CloudFront log file format:
cs-protocol-version. As a result, rows found in the access logs of CloudFront distributions would fail enrichment as being unrecognized. This has been fixed in this release.
With some events failing validation due to URLs containing more than one
# character, we have now relaxed the parsing of those URLs (see issue #2893). This was rolled out in the real-time pipeline in R93, and is now coming to Spark Enrich in this release.
In our ongoing effort to benefit from the latest performance improvements in Spark, we have updated our Enrich and Shred jobs to run on Spark 2.2.0.
Support for Spark 2.2.0 was only introduced in EMR AMI 5.9.0, so you will need to update the AMI version used in EmrEtlRunner, as explained in the upgrade guide below.
Since the Enrich and Shred jobs are idempotent, we are now allowing overwrites of existing data for a particular run. This is especially useful during a transient failure so that YARN can retry a job multiple times.
This makes the
yarn.resourcemanager.am.max-attempts: "1" configuration settings, which we previously recommended, optional from now on.
There is a good discussion on the subject on our Discourse forum.
Finally, note that this release moves the web model to its own repository, snowplow/web-data-model.
This should allow us to evolve the web data model independently of Snowplow itself, accelerating the release cadence here.
Due to it not being possible to modify the compression of table columns in Redshift, a deep copy is required in order to migrate an already-existing
atomic.events table to ZSTD.
We recommend that you have at least 50% Redshift storage space remaining prior to upgrading your
atomic.events table. It may be the case that you have to temporarily resize your cluster and/or pause your pipeline in order to make the switch.
The resources are as follows:
atomic.eventstable to v0.9.0
The latest version of EmrEtlRunner is available from our Bintray here.
To use the latest job versions, make the following changes to your EmrEtlRunner configuration:
For a complete example, see our sample
We are now operating a mirror of Iglu Central on Google Cloud Platform, to maintain high availability in the case of a chronic AWS outage. To make use of this mirror, add the following registry to your Iglu resolver JSON file:
Upcoming Snowplow releases will include:
For more details on this release, as always do check out the release notes on GitHub.
If you have any questions or run into any problems, please visit our Discourse forum.