5.6 Observability & Maintenance

5.6.1 Health Checks

Applications running in service need to be healthy at all times so they are ready to receive traffic. K8s uses a process called health checks to test whether an application is alive. If there are any issues and the application is unhealthy, K8s will restart the process. Yet, checking only whether a process is up and running might be not sufficient. What if, e.g., a client wants to connect to a database and the connection cannot be established, even though the app is up and running? To solve more specific issues like this, health checks like a liveness probe or readiness probe can be used. If there have not been specified, K8s will use the default checks to test whether a process is running.

There are three primary types of health checks: TCP, exec, and HTTP. TCP health checks perform a GET request on container’s IP by verifying if a network socket is open and responsive. Exec health checks allow custom scripts or commands to be run within the container which can be specified using the exec field in the yaml definition file. Thirdly, a HTTP request checks the status of a web service by sending HTTP requests to specific endpoints, making them particularly suited for applications with HTTP interfaces. The choice of health check type depends on the nature of the application and the specific aspects to monitor. For the sake of this tutorial only the last type http will be covered.

5.6.1.1 Liveness Probe

The Kubelet of a node uses Liveness Probes to check whether an application runs fine or whether it is unable to make progress and its stuck on a broken state. For example, it could catch a deadlock, a database connection failure, etc. The Liveness Probe can restart a container accordingly. To use a Liveness Probe, an endpoint needs to be specified. The benefit of this is, that it is simple to define what it means for an application to be healthy just by defining a path.

5.6.1.2 Readiness Probe

Similar to a Liveness Probe, the Readniness Probe is used by the kubelet to check when the container is ready to start accepting traffic. A Pod is considered ready when all of its containers are ready. Its configuration is also done by specifying a path to what it means the application is healthy. A lot of frameworks, like e.g. springboot, actually provide a path to use.

Belows configuration shows a Deployment which includes a Liveness and a Readiness Probe. The image of the deployment is set up so its process is killed after a given number of seconds. This has been passed using environment variables such as seen in the script. Both, the Liveness and the Readiness Probe have the same parameters in the given example.

initialDelaySeconds: The probe will not be called until x seconds after all the containers in the Pod are created.
timeoutSeconds: The probe must respond within a x-second timeout and the HTTP status code must be equal to or greater than 200 and less than 400.
periodSeconds: The probe is called every x seconds by K8s
failureThreshold: The container will fail and restart if more than x consecutive probes fail

# healthchecks.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: backendflask-healthcheck
spec:
  replicas: 1
  selector:
    matchLabels:
      app: backendflask-healthcheck
  template:
    metadata:
      labels:
        app: backendflask-healthcheck
        environment: test
        tier: backend
        department: engineering
    spec:
      containers:
        - name: backendflask-healthcheck
          # check whether I have to change the backend app to do this.
          image: "seblum/mlops-public:backend-flask"
          imagePullPolicy: "Always"
          resources:
            limits:
              memory: "128Mi"
              cpu: "500m"
          # Specification of the the Liveness Probe
          livenessProbe:
            httpGet:
              # path of the url
              path: /liveness
              port: 5000
            # time the liveness probe starts after pod is started
            initialDelaySeconds: 5
            timeoutSeconds: 1
            failureThreshold: 3
            # period of time when the checks should be performed
            periodSeconds: 5
          # Specification of the Readiness Probe
          readinessProbe:
            httpGet:
              path: /readiness
              port: 5000
            initialDelaySeconds: 10
            # change to 1 seconds and see the pod not going to go ready
            timeoutSeconds: 3
            failureThreshold: 1
            periodSeconds: 5
          env:
            # variable for the container to be killed after 60 seconds
            - name: "KILL_IN_SECONDS"
              value: "60"
          ports:
            - containerPort: 5000
---
apiVersion: v1
kind: Service
metadata:
  name: backendflask-healthcheck
spec:
  type: NodePort
  selector:
    app: backendflask-healthcheck
  ports:
    - port: 80
      targetPort: 8080
      nodePort: 30000
---
apiVersion: v1
kind: Service
metadata:
  name: backendflask-healthcheck
spec:
  type: ClusterIP
  selector:
    app: backendflask-healthcheck
  ports:
    - port: 80
      targetPort: 8080