Implementing Horizontal Pod Autoscaling in Kubernetes [Tutorial]

17 min read

When we use Kubernetes deployments to deploy our pod workloads, it is simple to scale the number of replicas used by our applications up and down using the kubectl scale command. However, if we want our applications to automatically respond to changes in their workloads and scale to meet demand, then Kubernetes provides us with Horizontal Pod Autoscaling.

This article is an excerpt taken from the book Kubernetes on AWS written by Ed Robinson. In this book, you will start by learning about Kubernetes’ powerful abstractions – Pods and Services – that make managing container deployments easy.

Horizontal Pod Autoscaling allows us to define rules that will scale the numbers of replicas up or down in our deployments based on CPU utilization and optionally other custom metrics. Before we are able to use Horizontal Pod Autoscaling in our cluster, we need to deploy the Kubernetes metrics server; this server provides endpoints that are used to discover CPU utilization and other metrics generated by our applications.

In this article, you will learn how to use the horizontal pod autoscaling method to automatically scale your applications and to automatically provision and terminate EC2 instances.

Deploying the metrics server

Before we can make use of Horizontal Pod Autoscaling, we need to deploy the Kubernetes metrics server to our cluster. This is because the Horizontal Pod Autoscaling controller makes use of the metrics provided by the metrics.k8s.io API, which is provided by the metrics server.

While some installations of Kubernetes may install this add-on by default, in our EKS cluster we will need to deploy it ourselves.

There are a number of ways to deploy add-on components to your cluster:

If you are using helm to manage applications on your cluster, you could use the stable/metrics server chart.

For simplicity we are just going to deploy the metrics server manifests using kubectl.
I like to integrate deploying add-ons such as the metrics server and kube2iam with the process that provisions the cluster, as I see them as integral parts of the cluster infrastructure. But if you are going to use a tool like a helm to manage deploying applications to your cluster, then you might prefer to manage everything running on your cluster with the same tool. The decision you take really depends on the processes you and your team adopt for managing your cluster and the applications that run on it.
The metrics server is developed in the GitHub repository. You will find the manifests required to deploy it in the deploy directory of that repository.

Start by cloning the configuration from GitHub. The metrics server began supporting the authentication methods provided by EKS in version 0.0.3 so make sure the manifests you have use at least that version.

You will find a number of manifests in the deploy/1.8+ directory. The auth-reader.yaml and auth-delegator.yaml files configure the integration of the metrics server with the Kubernetes authorization infrastructure. The resource-reader.yaml file configures a role to give the metrics server the permissions to read resources from the API server, in order to discover the nodes that pods are running on. Basically, metrics-server-deployment.yaml and metrics-server-service.yaml define the deployment used to run the service itself and a service to be able to access it. Finally, the metrics-apiservice.yaml file defines an APIService resource that registers the metrics.k8s.io API group with the Kubernetes API server aggregation layer; this means that requests to the API server for the metrics.k8s.io group will be proxied to the metrics server service.

Deploying these manifests with kubectl is simple, just submit all of the manifests to the cluster with kubectl apply:

$ kubectl apply -f deploy/1.8+

You should see a message about each of the resources being created on the cluster.

If you are using a tool like Terraform to provision your cluster, you might use it to submit the manifests for the metrics server when you create your cluster.

Verifying the metrics server and troubleshooting

Before we continue, we should take a moment to check that our cluster and the metrics server are correctly configured to work together.

After the metrics server is running on your cluster and has had a chance to collect metrics from the cluster (give it a minute or so), you should be able to use the kubectl top command to see the resource usage of the pods and nodes in your cluster.

Start by running kubectl top nodes. If you see output like this, then the metrics server is configured correctly and is collecting metrics from your nodes:

$ kubectl top nodes
NAME             CPU(cores)   CPU%      MEMORY(bytes)   MEMORY%
ip-10-3-29-209   20m          1%        717Mi           19%
ip-10-3-61-119   24m          1%        1011Mi          28%

If you see an error message, then there are a number of troubleshooting steps you can follow.

You should start by describing the metrics server deployment and checking that one replica is available:

kubectl -n kube-system describe deployment metrics-server

If it is not, you should debug the created pod by running kubectl -n kube-system describe pod. Look at the events to see why the server is not available. Make sure that you are running at least version 0.0.3 of the metrics server.

If the metrics server is running correctly and you still see errors when running kubectl top, the issue is that the APIservice registered with the aggregation layer is not configured correctly. Check the events output at the bottom of the information returned when you run kubectl describe apiservice v1beta1.metrics.k8s.io.

One common issue is that the EKS control plane cannot connect to the metrics server service on port 443.

Autoscaling pods based on CPU usage

Once the metrics server has been installed into our cluster, we will be able to use the metrics API to retrieve information about CPU and memory usage of the pods and nodes in our cluster. Using the kubectl top command is a simple example of this.

The Horizontal Pod Autoscaler can also use this same metrics API to gather information about the current resource usage of the pods that make up a deployment.

Let’s look at an example of this; we are going to deploy a sample application that uses a lot of CPU under load, then configure a Horizontal Pod Autoscaler to scale up extra replicas of this pod to provide extra capacity when CPU utilization exceeds a target level.

The application we will be deploying as an example is a simple Ruby web application that can calculate the nth number in the Fibonacci sequence, this application uses a simple recursive algorithm, and is not very efficient (perfect for us to experiment with autoscaling). The deployment for this application is very simple. It is important to set resource limits for CPU because the target CPU utilization is based on a percentage of this limit:

deployment.yaml 
apiVersion: apps/v1 
kind: Deployment 
metadata: 
  name: fib 
  labels: 
    app: fib 
spec: 
  selector: 
    matchLabels: 
      app: fib 
  template: 
    metadata: 
      labels: 
        app: fib 
    spec: 
      containers: 
      - name: fib 
        image: errm/fib 
        ports: 
        - containerPort: 9292 
        resources: 
          limits: 
            cpu: 250m 
            memory: 32Mi

We are not specifying a number of replicas in the deployment spec; when we first submit this deployment to the cluster, the number of replicas will therefore default to 1. This is good practice when creating a deployment where we intend the replicas to be adjusted by a Horizontal Pod Autoscaler, because it means that if we use kubectl apply to update the deployment later, we won’t override the replica value the Horizonal Pod Autoscaler has set (inadvertently scaling the deployment down or up).

Let’s deploy this application to the cluster:

kubectl apply -f deployment.yaml

You could run kubectl get pods -l app=fib to check that the application started up correctly.

We will create a service, so we are able to access the pods in our deployment, requests will be proxied to each of the replicas, spreading the load:

service.yaml 
kind: Service 
apiVersion: v1 
metadata: 
  name: fib 
spec: 
  selector: 
    app: fib 
  ports: 
  - protocol: TCP 
    port: 80 
    targetPort: 9292

Submit the service manifest to the cluster with kubectl:

kubectl apply -f service.yaml

We are going to configure a Horizonal Pod Autoscaler to control the number of replicas in our deployment. The spec defines how we want the autoscaler to behave; we have defined here that we want the autoscaler to maintain between 1 and 10 replicas of our application and achieve a target average CPU utilization of 60, across those replicas.

When CPU utilization falls below 60%, then the autoscaler will adjust the replica count of the targeted deployment down; when it goes above 60%, replicas will be added:

hpa.yaml 
kind: HorizontalPodAutoscaler 
apiVersion: autoscaling/v2beta1 
metadata: 
  name: fib 
spec: 
  maxReplicas: 10 
  minReplicas: 1 
  scaleTargetRef: 
    apiVersion: app/v1 
    kind: Deployment 
    name: fib 
  metrics: 
  - type: Resource 
    resource: 
      name: cpu 
      targetAverageUtilization: 60

Create the autoscaler with kubectl:

kubectl apply -f hpa.yaml

The kubectl autoscale command is a shortcut to create a HorizontalPodAutoscaler. Running kubectl autoscale deployment fib --min=1 --max=10 --cpu-percent=60 would create an equivalent autoscaler.

Once you have created the Horizontal Pod Autoscaler, you can see a lot of interesting information about its current state with kubectl describe:

$ kubectl describe hpa fib    
Name:              fib
Namespace:         default
CreationTimestamp: Sat, 15 Sep 2018 14:32:46 +0100
Reference:         Deployment/fib
Metrics:           ( current / target )
  resource cpu:    0% (1m) / 60%
Min replicas:      1
Max replicas:      10
Deployment pods:   1 current / 1 desired

Now we have set up our Horizontal Pod Autoscaler, we should generate some load on the pods in our deployment to illustrate how it works. In this case, we are going to use the ab (Apache benchmark) tool to repeatedly ask our application to compute the thirtieth Fibonacci number:

load.yaml
apiVersion: batch/v1 
kind: Job 
metadata: 
  name: fib-load 
  labels: 
    app: fib 
    component: load 
spec: 
  template: 
    spec: 
      containers: 
      - name: fib-load 
        image: errm/ab 
        args: ["-n1000", "-c4", "fib/30"] 
      restartPolicy: OnFailure

This job uses ab to make 1,000 requests to the endpoint (with a concurrency of 4). Submit the job to the cluster, then observe the state of the Horizontal Pod Autoscaler:

kubectl apply -f load.yaml   
watch kubectl describe hpa fib

Once the load job has started to make requests, the autoscaler will scale up the deployment in order to handle the load:

Name:                   fib
Namespace:              default
CreationTimestamp: Sat, 15 Sep 2018 14:32:46 +0100
Reference:         Deployment/fib
Metrics:           ( current / target )
  resource cpu:    100% (251m) / 60%
Min replicas:      1
Max replicas:      10
Deployment pods:   2 current / 2 desired

Autoscaling pods based on other metrics

The metrics server provides APIs that the Horizontal Pod Autoscaler can use to gain information about the CPU and memory utilization of pods in the cluster.

It is possible to target a utilization percentage like we did for the CPU metric, or to target the absolute value as we have here for the memory metric:

hpa.yaml 
kind: HorizontalPodAutoscaler 
apiVersion: autoscaling/v2beta1 
metadata: 
  name: fib 
spec: 
  maxReplicas: 10 
  minReplicas: 1 
  scaleTargetRef: 
    apiVersion: app/v1 
    kind: Deployment 
    name: fib 
  metrics: 
  - type: Resource 
    resource: 
      name: memory 
      targetAverageValue: 20M

The Horizonal Pod Autoscaler also allows us to scale on other metrics provided by more comprehensive metrics systems. Kubernetes allows for metrics APIs to be aggregated for custom and external metrics.

Custom metrics are metrics other than CPU and memory that are associated with a pod. You might for example use an adapter that allows you to use metrics that a system like Prometheus has collected from your pods.

This can be very beneficial if you have more detailed metrics available about the utilization of your application, for example, a forking web server that exposes a count of busy worker processes, or a queue processing application that exposes metrics about the number of items currently enqueued.

External metrics adapters provide information about resources that are not associated with any object within Kubernetes, for example, if you were using an external queuing system, such as the AWS SQS service.

On the whole, it is simpler if your applications can expose metrics about resources that they depend on that use an external metrics adapter, as it can be hard to limit access to particular metrics, whereas custom metrics are tied to a particular Pod, so Kubernetes can limit access to only those users and processes that need to use them.

Autoscaling the cluster

The capabilities of Kubernetes Horizontal Pod Autoscaler allow us to add and remove pod replicas from our applications as their resource usage changes over time. However, this makes no difference to the capacity of our cluster. If our pod autoscaler is adding pods to handle an increase in load, then eventually we might run out of space in our cluster, and additional pods would fail to be scheduled. If there is a decrease in the load on our application and the pod autoscaler removes pods, then we are paying AWS for EC2 instances that will sit idle.

When we created our cluster in Chapter 7, A Production-Ready Cluster, we deployed the cluster nodes using an autoscaling group, so we should be able to use this to grow and shrink the cluster as the needs of the applications deployed to it change over time.

Autoscaling groups have built-in support for scaling the size of the cluster, based on the average CPU utilization of the instances. This, however, is not really suitable when dealing with a Kubernetes cluster because the workloads running on each node of our cluster might be quite different, so the average CPU utilization is not really a very good proxy for the free capacity of the cluster.

Thankfully, in order to schedule pods to nodes effectively, Kubernetes keeps track of the capacity of each node and the resources requested by each pod. By utilizing this information, we can automate scaling the cluster to match the size of the workload.

The Kubernetes autoscaler project provides a cluster autoscaler component for some of the main cloud providers, including AWS. The cluster autoscaler can be deployed to our cluster quite simply. As well as being able to add instances to our cluster, the cluster autoscaler is also able to drain the pods from and then terminate instances when the capacity of the cluster can be reduced.

Deploying the cluster autoscaler

Deploying the cluster autoscaler to our cluster is quite simple as it just requires a simple pod to be running. All we need for this is a simple Kubernetes deployment.

In order for the cluster autoscaler to update the desired capacity of our autoscaling group, we need to give it permissions via an IAM role. If you are using kube2iam, we will be able to specify this role for the cluster autoscaler pod via an appropriate annotation:

cluster_autoscaler.tf
data "aws_iam_policy_document" "eks_node_assume_role_policy" { 
  statement { 
    actions = ["sts:AssumeRole"] 
    principals { 
      type = "AWS" 
      identifiers = ["${aws_iam_role.node.arn}"] 
    } 
  } 
} 
 
resource "aws_iam_role" "cluster-autoscaler" { 
  name = "EKSClusterAutoscaler" 
  assume_role_policy = "${data.aws_iam_policy_document.eks_node_assume_role_policy.json}" 
} 
 
 
data "aws_iam_policy_document" "autoscaler" { 
  statement { 
    actions = [ 
      "autoscaling:DescribeAutoScalingGroups", 
      "autoscaling:DescribeAutoScalingInstances", 
      "autoscaling:DescribeTags", 
      "autoscaling:SetDesiredCapacity", 
      "autoscaling:TerminateInstanceInAutoScalingGroup" 
    ] 
    resources = ["*"] 
  } 
} 
 
resource "aws_iam_role_policy" "cluster_autoscaler" { 
  name = "cluster-autoscaler" 
  role = "${aws_iam_role.cluster_autoscaler.id}" 
  policy = "${data.aws_iam_policy_document.autoscaler.json}" 
}

In order to deploy the cluster autoscaler to our cluster, we will submit a deployment manifest using kubectl. We will use Terraform’s templating system to produce the manifest.

We create a service account that is used by the autoscaler to connect to the Kubernetes API:

cluster_autoscaler.tpl
--- 
apiVersion: v1 
kind: ServiceAccount 
metadata: 
  labels: 
    k8s-addon: cluster-autoscaler.addons.k8s.io 
    k8s-app: cluster-autoscaler 
  name: cluster-autoscaler 
  namespace: kube-system

The cluster autoscaler needs to read information about the current resource usage of the cluster, and needs to be able to evict pods from nodes that need to be removed from the cluster and terminated. Basically, cluster-autoscalerClusterRole provides the required permissions for these actions. The following is the code continuation for cluster_autoscaler.tpl:

--- 
apiVersion: rbac.authorization.k8s.io/v1beta1 
kind: ClusterRole 
metadata: 
  name: cluster-autoscaler 
  labels: 
    k8s-addon: cluster-autoscaler.addons.k8s.io 
    k8s-app: cluster-autoscaler 
rules: 
- apiGroups: [""] 
  resources: ["events","endpoints"] 
  verbs: ["create", "patch"] 
- apiGroups: [""] 
  resources: ["pods/eviction"] 
  verbs: ["create"] 
- apiGroups: [""] 
  resources: ["pods/status"] 
  verbs: ["update"] 
- apiGroups: [""] 
  resources: ["endpoints"] 
  resourceNames: ["cluster-autoscaler"] 
  verbs: ["get","update"] 
- apiGroups: [""] 
  resources: ["nodes"] 
  verbs: ["watch","list","get","update"] 
- apiGroups: [""] 
  resources: ["pods","services","replicationcontrollers","persistentvolumeclaims","persistentvolumes"] 
  verbs: ["watch","list","get"] 
- apiGroups: ["extensions"] 
  resources: ["replicasets","daemonsets"] 
  verbs: ["watch","list","get"] 
- apiGroups: ["policy"] 
  resources: ["poddisruptionbudgets"] 
  verbs: ["watch","list"] 
- apiGroups: ["apps"] 
  resources: ["statefulsets"] 
  verbs: ["watch","list","get"] 
- apiGroups: ["storage.k8s.io"] 
  resources: ["storageclasses"] 
  verbs: ["watch","list","get"] 
--- 
apiVersion: rbac.authorization.k8s.io/v1beta1 
kind: ClusterRoleBinding 
metadata: 
  name: cluster-autoscaler 
  labels: 
    k8s-addon: cluster-autoscaler.addons.k8s.io 
    k8s-app: cluster-autoscaler 
roleRef: 
  apiGroup: rbac.authorization.k8s.io 
  kind: ClusterRole 
  name: cluster-autoscaler 
subjects: 
  - kind: ServiceAccount 
    name: cluster-autoscaler 
    namespace: kube-system

Note that cluster-autoscaler stores state information in a config map, so needs permissions to be able to read and write from it. This role allows that. The following is the code continuation for cluster_autoscaler.tpl:

--- 
apiVersion: rbac.authorization.k8s.io/v1beta1 
kind: Role 
metadata: 
  name: cluster-autoscaler 
  namespace: kube-system 
  labels: 
    k8s-addon: cluster-autoscaler.addons.k8s.io 
    k8s-app: cluster-autoscaler 
rules: 
- apiGroups: [""] 
  resources: ["configmaps"] 
  verbs: ["create"] 
- apiGroups: [""] 
  resources: ["configmaps"] 
  resourceNames: ["cluster-autoscaler-status"] 
  verbs: ["delete","get","update"] 
--- 
apiVersion: rbac.authorization.k8s.io/v1beta1 
kind: RoleBinding 
metadata: 
  name: cluster-autoscaler 
  namespace: kube-system 
  labels: 
    k8s-addon: cluster-autoscaler.addons.k8s.io 
    k8s-app: cluster-autoscaler 
roleRef: 
  apiGroup: rbac.authorization.k8s.io 
  kind: Role 
  name: cluster-autoscaler 
subjects: 
  - kind: ServiceAccount 
    name: cluster-autoscaler 
    namespace: kube-system

Finally, let’s consider the manifest for the cluster autoscaler deployment itself. The cluster autoscaler pod contains a single container running the cluster autoscaler control loop. You will notice that we are passing some configuration to the cluster autoscaler as command-line arguments. Most importantly, the --node-group-auto-discovery flag allows the autoscaler to operate on autoscaling groups with the kubernetes.io/cluster/<cluster_name> tag. This is convenient because we don’t have to explicitly configure the autoscaler with our cluster autoscaling group.

If your Kubernetes cluster has nodes in more than one availability zone and you are running pods that rely on being scheduled to a particular zone (for example, pods that are making use of EBS volumes), it is recommended to create an autoscaling group for each availability zone that you plan to use. If you use one autoscaling group that spans several zones, then the cluster autoscaler will be unable to specify the availability zone of the instances that it launches.

Here is the code continuation for cluster_autoscaler.tpl:

--- 
apiVersion: extensions/v1beta1 
kind: Deployment 
metadata: 
  name: cluster-autoscaler 
  namespace: kube-system 
  labels: 
    app: cluster-autoscaler 
spec: 
  replicas: 1 
  selector: 
    matchLabels: 
      app: cluster-autoscaler 
  template: 
    metadata: 
      annotations: 
        iam.amazonaws.com/role: ${iam_role} 
      labels: 
        app: cluster-autoscaler 
    spec: 
      serviceAccountName: cluster-autoscaler 
      containers: 
        - image: k8s.gcr.io/cluster-autoscaler:v1.3.3 
          name: cluster-autoscaler 
          resources: 
            limits: 
              cpu: 100m 
              memory: 300Mi 
            requests: 
              cpu: 100m 
              memory: 300Mi 
          command: 
            - ./cluster-autoscaler 
            - --v=4 
            - --stderrthreshold=info 
            - --cloud-provider=aws 
            - --skip-nodes-with-local-storage=false 
            - --expander=least-waste 
            - --node-group-auto-discovery=asg:tag=kubernetes.io/cluster/${cluster_name} 
          env: 
            - name: AWS_REGION 
              value: ${aws_region} 
          volumeMounts: 
            - name: ssl-certs 
              mountPath: /etc/ssl/certs/ca-certificates.crt 
              readOnly: true 
          imagePullPolicy: "Always" 
      volumes: 
        - name: ssl-certs 
          hostPath: 
            path: "/etc/ssl/certs/ca-certificates.crt"

Finally, we render the templated manifest by passing in the variables for the AWS region, cluster name and IAM role, and submitting the file to Kubernetes using kubectl:

Here is the code continuation for cluster_autoscaler.tpl:

data "aws_region" "current" {} 
 
data "template_file" " cluster_autoscaler " { 
  template = "${file("${path.module}/cluster_autoscaler.tpl")}" 
 
  vars { 
    aws_region = "${data.aws_region.current.name}" 
    cluster_name = "${aws_eks_cluster.control_plane.name}" 
    iam_role = "${aws_iam_role.cluster_autoscaler.name}" 
  } 
} 
 
resource "null_resource" "cluster_autoscaler" { 
  trigers = { 
    manifest_sha1 = "${sha1("${data.template_file.cluster_autoscaler.rendered}")}" 
  } 
 
  provisioner "local-exec" { 
    command = "kubectl  
--kubeconfig=${local_file.kubeconfig.filename} apply -f -<<EOF\n${data.template_file.cluster_autoscaler.rendered}\nEOF" 
  } 
}

Thus, by understanding how Kubernetes assigns Quality of Service classes to your pods based on the resource requests and limits that you assign them, you can have precisely control how your pods are managed. By ensuring your critical applications, such as web servers and databases, run with the Guaranteed class, you can ensure that they will perform consistently and suffer minimal disruption when pods need to be rescheduled.

If you have enjoyed reading this post, head over to our book, Kubernetes on AWS, for tips on deploying and managing applications, keeping your cluster and applications secure, and ensuring that your whole system is reliable and resilient to failure