From building a Docker image on your laptop to running a containerised workload in production – what are the key challenges and considerations?
Four years on from our first experiments messing around with Docker images, we now run thousands of containers in our production data centres and have just started to scale out to Google Cloud. This felt like a good opportunity to pause and reflect on what has been an interesting, challenging but most importantly successful journey into the world of containers.
What are containers?
I won’t go into too much depth here, but if you are unfamiliar with the concept of containers, I recommend heading over to the Google Cloud site where they have some really useful information. I thought about writing up my own short definition, but that would probably end up being a few paragraphs of spiel and Google’s is far more concise then I could ever manage:
Containers are lightweight packages of your application code together with dependencies such as specific versions of programming language runtimes and libraries required to run your software services.
In other words, everything you need to run your application is part of a single unit (a container!). It has its own execution environment and operating system, which gives you the ability to deploy it in multiple different places with the exact same result e.g. your local laptop, a virtual machine in a data centre, or even a cluster of hundreds of machines managed by a container orchestration system.
Sounds great? I know! Containers are pretty cool technology and the concept has actually been around since the early noughties. However, it’s only really since Docker crashed into the industry in 2013 (providing users with a friendly way to build, manage and run containers) that we have seen wide spread adoption of container-based applications by tech companies.
Playing around with docker containers on your laptop may seem trivial (I’m sure lots of you have done this!), but when it comes to building, running, testing and deploying containers into production to serve real traffic, things become a little more complicated. How do we monitor the health of our workloads and alert when there are issues? How do we stop hackers taking advantage of security holes in this new way of running applications? How do we migrate from our already existing processes that are designed around virtual machines and up-skill our engineers? These are some of the many questions that we were asking ourselves back in 2018 when we set off on a fresh endeavour at Rightmove – ‘The Docker Project‘.
Scoping the initial project
It’s probably worth giving a quick overview of the state of our technology estate in 2018 when ‘The Docker Project’ was first kicked off. We were deep into the process of migrating our old monolithic applications into fresh new microservices. These microservices were predominately Java Spring Boot applications and they were being run on virtual machines in our own data centres. There were some major pain points that we were facing with this:
- The virtual machines used to run microservices needed to be provisioned and managed by our central operations team. Requests for new services went into a large backlog of “new microservice creation” jira tickets and teams were waiting in some situations for months for new virtual machines to be created.
- We were a Java house, and we were good at running Java applications, but what about other languages and technologies? We weren’t really in a position to support this on our virtual machines and so requests to use anything other than Java were usually declined.
- The sheer number of VMs across our data centres was becoming unwieldy. Management overhead was high as all these machines needed patching, thus impacting time taken to reach production.
- Application deployments were complicated scripts being run via Jenkins. They deployed in series (one at a time) which as you can imagine was quite painfully slow (most services had 9+ instances). Issues with deployments were consuming lots of engineer time to debug and often required intervention from operations engineers with sys admin skills.
- Applications needed manual intervention at the virtual machine level when something went wrong. The Java app needs restarting across the estate? This would require someone to log on to each virtual machine instance of the microservice one by one and perform a restart of the Java service. Not a fun job at 4am!
These were some of the key issues that we wanted to address with this new project, and so we decided on a mission statement:
Move to a containerised system whereby microservices can be created and deployed by their respective team.
Seems kind of simple right? As with all things, this is easier said than done, and we were about to embark on a somewhat turbulent journey. Probably the first question we needed answered though was how on earth are we going to deploy and manage containers in our data centres. Luckily, at least some of this complexity was taken care of for us…
An orchestra needs an orchestrator
Container orchestration automates the management of containerised workloads. There are many tools that allow you to do this, some of which you will have definitely heard of before (e.g. Docker Swarm, Kubernetes, Nomad). These tools exist to reduce the operational effort of running containers by automating the provisioning, scaling, loadbalancing and deployment of services. An orchestrator can be installed on a group of machines and they can be connected together to form a cluster. The orchestrator provides an API to interact with the cluster to perform actions such as deploying new workloads, adding extra machines to the cluster and monitoring the state of the cluster. Services that are deployed to the cluster are automatically spread across the available worker nodes for resiliency. Most container orchestrators allow you to deploy services in a declarative way. In essence this involves telling the orchestrator the state that you want, and it’s the orchestrators job to maintain that state.
For us, using a container orchestrator was key to achieving our mission statement, and so we needed to decide which one to use. The front runners here were Docker Swarm and Kubernetes. At the time, Docker Swarm was a lightweight container orchestrator which used familiar docker-compose files to configure services. It was also owned by Docker (the same provider as the underlying technology which we used to build images), and they offered commercial support for their swarm solution. We were aware Kubernetes would do everything we needed (and more), but the effort to run and manage Kubernetes clusters in our own data centres was much larger. Plus, for a company that was completely new to the world of containers, it made more sense to opt for the slightly less feature rich Docker Swarm as a first pass. The end goal was to move to Kubernetes at some point in the future once we had some more experience running containers in production.
Building the MVP
So it was all hands on deck to get Docker Swarm up and running in all our environments and deploy our first workload. For the MVP, we chose to migrate one of our most lightweight services – “static-map-generator.” This service provides (you guessed it) static maps for use on our search results page.
To avoid boring you with all the details, I thought it would be useful to list some of the biggest challenges we faced:
- We have a resilient ‘active-active-active’ three data centre architecture at Rightmove and this added some complication. We treat each data centre as its own failure domain and independent from one another so instead of creating one production Docker Swarm cluster, we had to create three – one per site! Swarm doesn’t offer mesh networking between separate clusters, therefore we continued to use our F5 load balancers for service-to-service communication and load balancing. This did have some benefits though, especially when it came to migrating percentages of traffic from old virtual machines to containers, as this could be done at the external load balancer level.
- Provisioning secrets was tricky, especially as we wanted application teams to have the ability to create and manage their own secrets. Swarm secrets didn’t meet our self-service requirements so we opted to run Hashicorp Vault instead. This was an extra piece of software to provision, update and manage and the design and implementation of this ended up pushing back our release date.
- Logging and monitoring was very different! We were used to writing logs to files on a virtual machine and then using a Logstash process to read these files and publish them to our logging pipeline (logs eventually get indexed into Elasticsearch after running through a queue). This was clearly going to be different with containers as they are designed to be treated as short lived and ephemeral. We had to familiarise ourselves with some new tools for collecting container logs (e.g. Filebeat and plumb this into our already existing pipeline. This included needing to make application level changes to write logs to standard out instead of a file, and updating some of our transformations and index templates to include extra information (e.g. container ID and docker worker host).
- Yaml, yaml, and more yaml!! We quickly realised that there was a hell of a lot of yaml configuration involved in deploying a single service. How were we going to make it easy for teams to configure their own services? We decided to design a repo to store configuration which allowed teams to use pre-existing docker-compose templates and just fill in some of the important configuration items that varied from service to service and between environments. To support this, we also wrote a custom deployment script which built these templates as part of an application deployment job in Jenkins and posted them to the Docker Swarm API.
- This was a massive cultural change. There were going to be big differences for teams when they started to deploy their services to our docker swarm clusters. Old habits for debugging production issues like ssh-ing into virtual machines would be long gone and we would need to educate our engineers, giving them a new toolbox to work with that leveraged the benefits of a containerised platform. This is often the hardest part of any change in a company!
With the MVP up and running, we started to pick out low risk services to start migrating to the new platform. One at a time we began to containerise services and deploy them to our swarm clusters, migrating production traffic and improving the platform and deployment processes as we went. This came with some extra challenges, usually because of specific application use cases which hadn’t been scoped in the initial MVP. A good example of this was our first application that required local storage. For this we needed to design and document the ability to attach distributed storage volumes to containers.
Two years on, and we have 97 services, consisting of 1177 containers, running in each swarm cluster (3500+ containers in total across all sites) in production. These are deployed across 9 worker nodes and 3 manager nodes per data centre. Thats roughly 36 machines to manage instead of the equivalent few thousand virtual machines that we would have been burdened with if we didn’t containerise our apps. In production, we serve more than half a billion requests a day from containers. Deployment times have drastically reduced, especially if a team decides to deploy services in parallel (2 or more containers at a time). Below is some data gathered from Jenkins for a recent service migration from virtual machine to docker swarm:
|Number of instances to deploy||Deploy job duration|
|Virtual Machine||9||16 mins|
|Docker Swarm||18||10 mins|
So we’re looking at a 35% reduction in time to deploy twice as many instances. This has vastly increased our productivity and decreased time to production.
A real success story of the project in general was when the COVID pandemic hit. Our container platform allowed us to quickly create new services to support projects such as our “online viewing” flow which helped customers to carry out viewings via video instead of in person. Without the ability to provision new services quickly in our swarm cluster, we may not have been able to react as quickly as we did!
So I guess the real question is, did we achieve our mission statement to “move to a containerised system whereby microservices can be created and deployed by their respective team”?
Well, the short answer is “not quite”. Although we’ve removed some of the overhead needed for operations to provision the infrastructure for a microservice, we haven’t removed it completely, meaning the concept of a “new microservice creation” jira ticket still exists in the backlog for operations. Instead of provisioning a virtual machine and everything that goes with it, the ticket requires making load balancer changes, creating vault access secrets in Docker Swarm and setting up alerting, all of which can still only be done by an operations engineer.
Over the last two years using Docker Swarm we’ve also run into a few of the limitations of the orchestrator:
- There is no autoscaling for containers, so if we need to update the number of replicas, it’s a manual change to the configuration.
- The operational effort to maintain and especially upgrade the clusters ourselves is high. As we reach capacity, new worker nodes need to be built by an operations engineer and added to the clusters.
- There are a lot of features of other container orchestration systems that we aren’t able to leverage with Docker Swarm. A good example is using a service mesh such as Istio which could allow us to remove some of the complexity from our application code by pulling out circuit breaking and auth token security into the service mesh layer.
As a company at this stage we are already deep into our container journey, and most of our engineers are familiar with the concepts that surround containerisation. Luckily, containers are cloud native (a benefit I didn’t mention at the beginning) and so at the end of 2019, a new project was on the horizon…
A new way of thinking
New projects call for new goals. In fact, we have quite a few motivation for our move to a hybrid cloud model, but here is the one which most relates to running containers:
Improve our agility by making it quicker and easier to provision temporary or permanent production environments for product experimentation, new products and services, 3rd party integrations, or to quickly counter competition.
In essence, this was going to involve moving to a managed container orchestration platform, and after a lengthy process choosing a cloud vendor, it was agreed that this would be Google’s “Kubernetes Engine“. Building the foundations for a hybrid cloud has to be the most challenging project of my career so far. It wasn’t just a case of creating an account and spinning up a Kubernetes cluster by clicking around in the user interface. We needed to consider how we were going to manage users, set up networking (including resilient connectivity between our data centres and google’s), provision all our infrastructure via predictable infrastructure as code pipelines, alert the relevant engineers when there are issues on the platform, and much more. All while getting to grips with brand new cloud computing concepts and technologies, and ensuring the platform is secure and following cloud best practices.
Our original mission statement for our move to containers was still valid, and this subsequent move to cloud was finally going to give us the power to achieve that goal. Here are some of the additional features we’ve managed to leverage by using Kubernetes in the cloud:
- We can deploy a single Kubernetes cluster across multiple zones (regional) in Google Cloud, removing the concept of a three site architecture. We now have just one production cluster instead of three.
- The management plane is abstracted away from us and automatically upgraded by Google with zero downtime. Upgrades now take a few hours instead of a few weeks.
- More than one container can be deployed within the same “pod” wrapper, giving us the abilities to use tools such as Istio (Anthos Service Mesh in Google Cloud). This is a service mesh which gives a bunch of features out of the box:
- Visibility of latency, error rates and inter application dependencies
- Ability for application teams to configure their own load balancer settings via yaml configuration
- Circuit breakers configurable via yaml
- Our pods automatically scale based on demand, and so do the nodes in our clusters, allowing us to reduce our costs in quiet periods.
- We can use a templating engine tool for deployment configuration, such as Helm, instead of writing and maintain our own custom deployment scripts.
- We can experiment! It takes roughly 10 minutes to spin up a new fully functional Kubernetes Engine cluster in the cloud, something that had never been viable for us in the past with Docker Swarm.
After close to a year building out and testing the new hybrid cloud platform, we have just gone live with our first two applications serving production traffic. If you’ve ever used the overseas page on rightmove.co.uk, the data for the typeahead functionality in the search bar is now being served from containers in Google Cloud, which amounts to roughly 700,000 requests a day!
All in all it’s been an interesting journey, and I’m sure this is just the beginning of the story for Rightmove and containers. For now though, its probably good we all grab a beer and celebrate!