5.6 Observability & Maintenance
5.6.1 Health Checks
Applications running in service need to be healthy at all times so they are ready to receive traffic. K8s uses a process called health checks to test whether an application is alive. If there are any issues and the application is unhealthy, K8s will restart the process. Yet, checking only whether a process is up and running might be not sufficient. What if, e.g., a client wants to connect to a database and the connection cannot be established, even though the app is up and running? To solve more specific issues like this, health checks like a liveness probe or readiness probe can be used. If there have not been specified, K8s will use the default checks to test whether a process is running.
There are three primary types of health checks: TCP, exec, and HTTP. TCP health checks perform a GET request on container’s IP by verifying if a network socket is open and responsive. Exec health checks allow custom scripts or commands to be run within the container which can be specified using the exec
field in the yaml definition file. Thirdly, a HTTP request checks the status of a web service by sending HTTP requests to specific endpoints, making them particularly suited for applications with HTTP interfaces. The choice of health check type depends on the nature of the application and the specific aspects to monitor. For the sake of this tutorial only the last type http
will be covered.
5.6.1.1 Liveness Probe
The Kubelet of a node uses Liveness Probes to check whether an application runs fine or whether it is unable to make progress and its stuck on a broken state. For example, it could catch a deadlock, a database connection failure, etc. The Liveness Probe can restart a container accordingly. To use a Liveness Probe, an endpoint needs to be specified. The benefit of this is, that it is simple to define what it means for an application to be healthy just by defining a path.
5.6.1.2 Readiness Probe
Similar to a Liveness Probe, the Readniness Probe is used by the kubelet to check when the container is ready to start accepting traffic. A Pod is considered ready when all of its containers are ready. Its configuration is also done by specifying a path to what it means the application is healthy. A lot of frameworks, like e.g. springboot, actually provide a path to use.
Belows configuration shows a Deployment which includes a Liveness and a Readiness Probe. The image of the deployment is set up so its process is killed after a given number of seconds. This has been passed using environment variables such as seen in the script. Both, the Liveness and the Readiness Probe have the same parameters in the given example.
- initialDelaySeconds: The probe will not be called until x seconds after all the containers in the Pod are created.
- timeoutSeconds: The probe must respond within a x-second timeout and the HTTP status code must be equal to or greater than 200 and less than 400.
- periodSeconds: The probe is called every x seconds by K8s
- failureThreshold: The container will fail and restart if more than x consecutive probes fail
# healthchecks.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: backendflask-healthcheck
spec:
replicas: 1
selector:
matchLabels:
app: backendflask-healthcheck
template:
metadata:
labels:
app: backendflask-healthcheck
environment: test
tier: backend
department: engineering
spec:
containers:
- name: backendflask-healthcheck
# check whether I have to change the backend app to do this.
image: "seblum/mlops-public:backend-flask"
imagePullPolicy: "Always"
resources:
limits:
memory: "128Mi"
cpu: "500m"
# Specification of the the Liveness Probe
livenessProbe:
httpGet:
# path of the url
path: /liveness
port: 5000
# time the liveness probe starts after pod is started
initialDelaySeconds: 5
timeoutSeconds: 1
failureThreshold: 3
# period of time when the checks should be performed
periodSeconds: 5
# Specification of the Readiness Probe
readinessProbe:
httpGet:
path: /readiness
port: 5000
initialDelaySeconds: 10
# change to 1 seconds and see the pod not going to go ready
timeoutSeconds: 3
failureThreshold: 1
periodSeconds: 5
env:
# variable for the container to be killed after 60 seconds
- name: "KILL_IN_SECONDS"
value: "60"
ports:
- containerPort: 5000
---
apiVersion: v1
kind: Service
metadata:
name: backendflask-healthcheck
spec:
type: NodePort
selector:
app: backendflask-healthcheck
ports:
- port: 80
targetPort: 8080
nodePort: 30000
---
apiVersion: v1
kind: Service
metadata:
name: backendflask-healthcheck
spec:
type: ClusterIP
selector:
app: backendflask-healthcheck
ports:
- port: 80
targetPort: 8080