Kubernetes Rolling Updates
Kubernetes Rolling Updates

On this post, I am going to focus on how we achieved zero downtime with our deployments on Kubernetes utilising rolling updates.

At Track.Health, we run our platform on AWS EKS and all our components are run as Kubernetes deployments. When we started out, the deployment process was as follows;

  1. Tag the deployment and push to AWS ECR(We were always using the tag name latest-<environement_name>
  2. Delete the existing deployment on Kubernetes (kubectl delete deployment <deployment name>
  3. Create the new deployment (kubectl apply -f <deployment yaml> )

Yea I know what you are thinking, why would you ever tag all your releases with the same tag. Well that was partly a miss on our side and it’s a learning process for us running production workloads on Kubernetes. 

As we have a multi-tenant platform, downtime is not something we could afford. Let’s be honest, in this day and age, sending out notifications of downtime to your customers doesn’t really reflect well on the technical prowess of your company.

We explored a few standard options like blue/green(red/black) deployments. As we are a start-up and with the need to keep costs at a minimum, blue/green deployments were not a feasible option as it would entail setting up a new cluster to which you then switch over.


Kubernetes rolling updates to the rescue!

With Kubernetes rolling updates, we were able to seamlessly achieve zero downtime with minimal changes to our current build pipeline.

So how did we go about doing it?

First off, we needed to fix the way we tagged our docker images that were being pushed to AWS ECR. What we ended up doing was to get the release version on our maven pom and use that as the tag because on each release we increment the release version. So it ends up like latest-release-1.0.0.

With that out of the way, the next thing we needed to do was remove the step where we do the delete deployment on kubernetes because with rolling updates, you do not need to delete because the old deployment is automatically deleted when the new one is up and ready to serve requests.

All our Kubernetes deployment yaml files are templated with helm and the place where we pass the tag we want to deployment is parameterised as follows;

 containers:
        - name: ms-account-management
          image: {{ .Values.image_repo_path }}:latest-{{ .Values.branch_name }}
          imagePullPolicy: Always

The image_repo_path is the AWS ECR repository path and the branch_name refers to the version of the maven pom on the respective module we release.

Finally, with the changes to our yaml file and passing in the correct image to deploy, with the kubectl apply -f <yaml_file>, the rolling updates are triggered automatically as we deploy a new docker tag on each release whereby the Kubernetes rolling updates kick in and the existing deployment is only taken down after the new deployment is up and running.

Now there are some best practices around this that we needed to do as well. For instance, you do not want your new deployment to come up before it is ready to serve requests. To overcome this issue, we made use of readinessProbe and livelinessProbe. We went with the most simplest probes that worked well for us so far which is given below;

          readinessProbe:
            tcpSocket:
              port: 8443
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            tcpSocket:
              port: 8443
            initialDelaySeconds: 15
            failureThreshold: 5
            periodSeconds: 20

With this in place, we make sure that our old deployment does not go down until the new deployment is ready to serve requests.

One other best practice we made sure to put in place was that with tagging each release on AWS ECR, we did not want to maintain all the old tags and wanted to only keep a certain number of releases in case we wanted to rollback. 


AWS ECR lifecycle policies to the rescue

We used the following lifecycle policy on all our ECR repos so that archiving is done automatically for us and we do not keep too many tags of our releases.

{
  "rules": [
    {
      "rulePriority": 1,
      "description": "Keep only five tagged images, expire all others",
      "selection": {
        "tagStatus": "tagged",
        "countType": "imageCountMoreThan",
        "tagPrefixList": ["latest"],
        "countNumber": 5
      },
      "action": {
        "type": "expire"
      }
    },
    {
      "rulePriority": 2,
      "description": "Keep only two untagged image, expire all others",
      "selection": {
        "tagStatus": "untagged",
        "countType": "imageCountMoreThan",
        "countNumber": 2
      },
      "action": {
        "type": "expire"
      }
    }
  ]
}

Then we run the following aws cli command to apply the lifecycle policy to our repositories;

aws ecr put-lifecycle-policy \
  --registry-id $REPOSITORY_ID \
    --repository-name $REPO_NAME  \
    --region=ap-southeast-2 \
    --lifecycle-policy-text "file://$LIFE_CYCLE_POLICY"

Rollback with ease

One of the useful features for us with rolling updates was that it enabled us to do rollbacks with ease. If we find issues with our current release, the process of rolling back to the previous release was just a matter of getting the AWS ECR tag version prior to the latest version and then running kubectl apply -f with that tag and everything else works as expected. 

To get the version prior to the latest one on AWS ECR, we utilised the following command;

LATEST_TAG=`aws ecr describe-images --repository-name $ECR_REPO_NAME --query 'sort_by(imageDetails,& imagePushedAt)[-2].imageTags[0]' --region ap-southeast-2 --output text`

The [-2] in that command gets the release prior to the latest release. [-1] refers to the latest release.

Finally, to deploy it to Kubernetes as a rolling update, we execute the following command;

kubectl set image deployment/<deployment_name> <container_name>=<ECR_REPO_URL>E:$LATEST_TAG -n $NAMESPACE

It has been an amazing learning experience and this is just the beginning. I am sure there are more improvements we can do to make the CI/CD process much more streamlined and with time we will get to where we want to be and evolve continuously.

Thank you for reading and if you have any comments/recommendations, do not hesitate to leave a comment which is always appreciated.

Leave a Reply

Your email address will not be published. Required fields are marked *