Full MERN Stack App: 0 to deployment on Kubernetes — part 6

In the sixth part, I will talk about configuring health and readiness checks and auto-scaling for our pods.

8 min readSep 14, 2019

Welcome back to the sixth part of the series. Today we will in-detail talk about setting up health checks, readiness checks, and auto-scaling. We are going to configure our NodeJS server to respond to health and readiness checks and our deployment to auto-scale when facing high load.

If you haven’t read my fifth part yet, please follow the link below.

Full MERN Stack App: 0 to deployment on Kubernetes — part 5

In the fifth part, I will talk about deploying our app on Kubernetes and using CI/CD pipelines for faster deployment.

medium.com

Why do we need checks?

Kubernetes can and will replace your pods in case they crash. In our deployment’s definition, we defined a number of replicas and Kubernetes will be responsible to always have that many of working pods running. But what if the pod does not crash but our app does? i.e. Our NodeJS server might crash because a request could not reach our MongoDB pod. If so, our pod would still be running but our app would not. As far as Kubernetes is concerned, the pod is still working but actually it is not.

What if both the pod and the app are running but it is receiving too much traffic at the moment and unable to respond to another request? Then Kubernetes should either redirect the request to another pod which is not under heavy load. What if there are no free pods available? Then we clearly have to replicate our pod to create more pods and then forward the new requests to these new-born pods. This is where the auto-scaling feature comes in. Even then, we can’t forward a request to a new-born pod right after it’s created. The pod might not have finished initializing yet. i.e. the container might not have started yet. We need to check the pod for readiness again before forwarding a request.

The need for health and readiness checks is trivial. We do not need to do these checks manually. Instead, we let Kubernetes to periodically ask our pods whether they are running and can serve requests and we configure our pods to respond with an answer.

Setting up health checks

Let’s set up a health check for our back-end first. Open the server deployment YAML file and add the following block under our container definition.

livenessProbe:
  httpGet:
    port: nodejs-port
    path: /health
  initialDelaySeconds: 15
  timeoutSeconds: 30

The livenessProbe parameter specifies a path and a port on our NodeJS server where it can query periodically to check if the server is running. We specify a minimum time to wait before checking for health as the pod needs some time to pull the image and deploy the container. We also specify a maximum time to wait for a request to respond. If the server does not respond to the request within the given time, Kubernetes will determine that the app is unhealthy. Now we configure our ExpressJS server to respond to this request.

app.get('/health', (req, res) => {
    // your health check logic goes here
    res.status(200).send();
});

We can run a full check-up on our NodeJS server inside the callback and then respond to the health request. For now, let’s just respond right away with status code 200 (which is the default HTTP code for OK) for health requests. At the minimum level, this works because say our ExpressJS server was crashed, then there won’t be any response so Kubernetes will know something has gone wrong with the app.

Setting up readiness checks

livenessProbe can be used to verify whether an app is running. It does not necessarily mean the app is ready to serve requests. The app might be already overwhelmed by the number of requests it has. In such a case, we need to prevent Kubernetes from forwarding any more requests to that pod. This can be achieved by setting up a readinessProbe. The readinessProbe indicates whether the container is ready to serve requests. If the readiness check fails, Kubernetes will not restart the pod but remove the pod from our service discovery so no future request is directed to that pod.

readinessProbe:
  httpGet:
    port: nodejs-port
    path: /ready
  initialDelaySeconds: 15
  timeoutSeconds: 30

On our server.js file,

app.get('/ready', (req, res) => {
    // your readiness check logic goes here
    res.status(200).send();
});

Your final cloudl-server-deployment.yml file would look like this.

cloudl-server-deployment.yml

Deploying the changes

Now we can deploy the changes. Since our deployment configuration changed, we cannot use our CI/CD pipeline to deploy this version. Remember that the CI/CD pipeline works only when our source code is changed. This scenario is similar to when we made our initial deployment of the app.

If you have our initial deployment still running on the cluster, delete it by running these commands.

kubectl delete pod/mongodb
kubectl delete svc/mongodb-service
kubectl delete deploy/cloudl-server-deployment
kubectl delete svc/cloudl-server-service

Then re-deploy our app using kubectl create command.

Using the watch command, we can continuously execute the kubectl get pods command and observe the changes in the response. Notice that mongodb pod’s READY value changes to 1/1 (1 pod is ready out of 1 replica) immediately after the STATUS value changed to Running but for the pod running the NodeJS app, READY value changes after few seconds even after its STATUS changed to Running. This is because of the liveness and readiness checks we configured. Even if the STATUS of the pod is running (liveness check passed), it is not considered to be ready to serve requests yet until the readiness check is passed which is delayed by 30 seconds. Around the 36th second mark, back-end pod’s READY value changes to 1/1 (readiness check passed).

Pod auto-scaling

Think about a scenario where our back-end pod is healthy but not ready because it is overwhelmed by the current number of requests. The only solution is to create more back-end pods to serve the additional requests. It is true that we can run more than one replica of our back-end pod right from the beginning but we cannot know this many pods would be enough to serve all our future traffic. We do not want to run too many pods as it would be expensive and a waste of resources when there is not much traffic to our app. The cost-effective approach is to make Kubernetes adapt to the moment and make more replicas when needed and remove unnecessary replicas when there is no high load.

Lucky for us, Kubernetes pod-scaling comes with built-in auto-scaling features. There are two ways to auto-scale pods.

Vertically auto-scale: Provide more computing power to current running pods by increasing their resources.
Horizontally auto-scale: Launch more pod replicas while keeping the same computing power.

Today we will talk about how to set up horizontal auto-scaling. Kubernetes can automatically scale a Deployment, a Replication Controller or a Replica Set. It can determine when to scale based on several metrics such as CPU usage and available memory. We specify how much resources we need for our app to run normally using the resources parameter in our deployment definition and Kubernetes will scale up or down to provide the required resources.

Horizontal pod auto-scaling

To demonstrate the auto-scaling in action, I am going to use another app because my back-end NodeJS app does nothing CPU intensive enough to trigger the auto-scaling. The process is the same so there is nothing to worry about.

This is the Dockerfile and the source code of the app I am going to use.

Dockerfile

index.php

You do not need to build and push a Docker image because this image is readily available on Google Container Registry. I added the files here so you have an idea of what is happening in the app.

Now let’s create a deployment definition for this app.

hpa-demo-deployment.yml

Notice that we are running 3 replicas of the app right from the start. The new resources parameter in the containers array is used to tell Kubernetes that our app needs at least 200m of CPU power (20% of a CPU core of the node) to run.

We also need a service to expose our deployment. Rather than creating a separate file, we can add the service’s definition in the same deployment definition file. Paste the following service definition at the bottom of the “hpa-demo-deployment.yml” file.

---
apiVersion: v1
kind: Service
metadata:
  name: hpa-demo
spec:
  ports:
    - port: 80
      protocol: TCP
  selector:
    app: hpa-demo
  type: NodePort

Now create another definition for our horizontal pod auto-scaler.

hpa.yml

We specify that we want to keep our node’s CPU utilization at 50%. Kubernetes will launch more pods or stop running pods between a maximum of 10 and a minimum of 1 of the pod named “hpa-demo” (the same name we used in the deployment definition) to maintain the node’s CPU utilization at 50%.

We would use a new cluster for this as our previous cluster only had a single node and it won’t be sufficient to run 10 pods. Let’s create a new cluster with 6 nodes.

This cluster is going to be expensive so make sure you delete this cluster after your work.

gcloud container clusters create hpa-demo \
  --scopes "cloud-platform" \
  --num-nodes 6 \
  --enable-basic-auth \
  --issue-client-certificate \
  --enable-ip-alias \
  --zone northamerica-northeast1-a

Let’s deploy these definitions and then apply some load to our deployment to see if auto-scaling gets triggered.

kubectl create -f hpa-demo-deployment.yml
kubectl create -f hpa.yml

Now that our deployment is running peacefully, let’s generate some load by continuously sending requests to our app. To do that, launch a busybox container in our cluster. busybox is a widely used image when we want terminal access inside the cluster. Using busybox’s terminal, we can make GET requests to our app.

kubectl run -it load-generator --image=busybox /bin/sh# inside the busybox terminalwhile true; do wget -q -O- http://hpa-demo.default.svc.cluster.local:80; done

This will keep sending GET requests to our app until we stop it. It would generate some CPU load inside our pod. Then we watch “kubectl get pods” command in our terminal to see our pods getting replicated at high load.

When there are 10 replicas running, stop the wget command in the busybox terminal. You will notice that the 10 replicas are still running even after the high load is stopped. That is because Kubernetes does not down-scale the pods right away after the high load is stopped. It waits until a few more minutes to see if it would receive another surge. If there is no high load within this period, it would start to scale down our pods until the minimum number of replicas is met.

Conclusion

In the next article which is the seventh part of the series, I am going to talk about how to deploy stateful apps on Kubernetes with the help of Google Cloud Storage and setting up a domain name for our web app. I hope this article was interesting and you will also read the seventh part.

PS: Claps are appreciated!