September 8, 2017

1000 words 5 mins read

Continuous Delivery for DC/OS with Spinnaker

Last fall our team (Mike Tweten, Trevin Teacutter, and Zameer Sura) started working on the problem of automating DC/OS deployments in a way that wouldn’t require multiple teams to duplicate effort and tie themselves directly to DC/OS APIs. While DC/OS certainly makes the act of deploying an application much easier than anything we’ve used previously, there are still many different ways you could choose to layer on deployment strategies and integrate with continuous delivery systems. Additionally, there may be lots of teams with very similar needs in this space and there are certainly more efficient uses of their time than making them all solve the same problems. Furthermore, while DC/OS is a good choice today we need to make sure that we don’t become locked in and can change our mind in the future without a big impact to those teams.

Being an engineering team, naturally our first approach was to try to write a service to manage a simple set of deployment strategies with an API that we could reimplement over other resource schedulers (like Kubernetes) in the future. However, after producing a prototype we really started to grasp that we were taking on a much bigger task than we should. Our API could handle some basic pre-built deployment workflows, but if it wasn’t flexible enough for teams we’d need to continue to adapt it. In addition, we would still have to work out how it would integrate with CI/CD tools like Jenkins, and we’d have to do a lot of that work again each time we wanted to support another resource scheduler as a deployment target.

At this point we decided to take a deeper look at Spinnaker, a continuous delivery platform that was open-sourced by Netflix. It had originally been developed to orchestrate AWS deployments and had expanded to other providers like Azure, OpenStack, and more recently Kubernetes was added as the first container based deployment target. Spinnaker allows for building deployment pipelines that can be triggered by things like Github commits or Jenkins builds and can then run stages to do things like start other Jenkins jobs, deploy new versions of an application, scale up or down applications, and disable old application versions. With these capabilities we would be able to use Spinnaker to provide a consistent deployment abstraction while still providing the flexibility to handle deployment workflows that we didn’t build ourselves.

The only minor problem was that it didn’t support DC/OS, our preferred deployment target. After taking some time to dive in and get familiar with the details it quickly became apparent that adding support for DC/OS would be much easier than continuing to build our own system, especially since we had already become familiar with the DC/OS APIs while developing our prototype. Not only would we then gain the flexibility to allow more deployment strategies than we originally anticipated, but it would also buy us integration with Jenkins and the ability to convert pipelines over to Kubernetes without much effort. We checked with the Spinnaker community to make sure that no one was already working on DC/OS support and also with Mesosphere (the company behind DC/OS) to make sure they weren’t planning to do it. When both of those came back negative we began working on the project in earnest, with the goal of building something that would benefit not only Cerner but also the entire Spinnaker and DC/OS communities.

This was the first time that any of us had tried to contribute significant changes to an established open-source project so there was some adjustment and learning along the way. We initially created our fork of the necessary repositories to our internal Github with the plan to finish the project before making anything visible publicly. Instead, we found that we wanted to more easily be able to share things with Mesosphere since they had agreed to give us any guidance we might need. It also became apparent that it was too easy to commit things like config that referenced internal resources, or comments that referred to our own JIRA issues. If we kept going that way we were only going to make more work for ourselves cleaning those things up when it came time to submit our changes. By moving our development to repos in Cerner’s Github organization we were able to solve both of these problems at once.

In spite of moving our code into the public Github we still maintained an insular development approach, determined to get everything just right before submitting a perfect polished gem to the upstream project. In hindsight, we should have been less concerned about trying to get everything right all at once and submitted more incremental changes. We thought it would be easier for everyone not to have to deal with our messy work in progress, but after our first pull request for one of the services ended up with over 130 files and a history of 100+ commits we realized that it’s too much to expect the maintainers to review so much at once, especially when they may not be that familiar with DC/OS. Ultimately, they wanted us to carve up that massive patch anyway so it would have been less work for us to do it from the start.

After completing this effort we’ve found that we get even more benefit from Spinnaker than we initially expected. Spinnaker also helps us manage deployments to multiple DC/OS clusters, use Chaos Monkey to test the resiliency of our applications, and even to deploy DC/OS itself on AWS. We’re pleased to announce that our contribution has been accepted by the Spinnaker maintainers and is now available for anyone to use. The Spinnaker community has been great to work with and helped us by answering questions and inviting us to participate in the discussions with other companies that have contributed major functionality to Spinnaker. This is an exciting opportunity to contribute something back to the open-source community and we can’t wait to see where it goes from here.