Google and Waze share their best practices for canary deployment using Spinnaker

On Monday, Eran Davidovich, a System Operations Engineer at Waze and Théo Chamley, Solutions Architect at Google Cloud shared their experience on using Spinnaker for canary deployments. Waze estimated that canary deployment helped them prevent a quarter of all incidents on their services.

What is Spinnaker?

Developed at Netflix, Spinnaker, is an open source, multi-cloud continuous delivery platform that helps developers to manage app deployments on different computing platforms including Google App Engine, Google Kubernetes Engine, AWS, Azure, and more.

This platform also enables you to implement advanced deployment methods like canary deployment. In this type of deployment, developers roll out the changes to a subset of users to analyze whether or not the code release provides the desired outcome. If this new code poses any risks, you can mitigate it before releasing the update to all users.

In April 2018, Google and Netflix introduced a new feature for Spinnaker called Kayenta using which you can create an automated canary analysis for your project. Though you can build your own canary deployment or other advanced deployment patterns, Spinnaker and Kayenta together are aimed at making it much easier and reliable. The tasks that Kayenta automates includes fetching user-configured metrics from their sources, running statistical tests, and providing an aggregating score for the canary. On the basis of the aggregated score and set limits for success, Kayenta automatically promotes or fails the canary, or triggers a human approval path.

Canary best practices

Check out the following best practices to ensure that your canary analyses are reliable and relevant:

Instead of comparing the canary against the production, compare it against a baseline. This is because many differences can skew the results of the analysis such as cache warmup time, heap size, load-balancing algorithms, and so on.

The canary should be run for enough time, at least 50 pieces of time-series data per metric, to ensure that the statistical analysis is relevant.

Choose metrics that represent different aspects of your applications’ health. Three aspects are very critical as per the SRE book, which includes latency, errors, and saturation.

You must put a standard set of reusable canary configs in place. This will come in handy for anyone in your team as a starting point and will also keep the canary configurations maintainable.