Book a Demo Start Instantly
how-to-efficiently-stress-test-kubernetes-pod-memory.jpg

Author:Yinghao Wang(Contributor of Chaos Mesh)

Editors:Ran Huang, Tom Dewan

Chaos Meshis a cloud-native Chaos Engineering platform that orchestrates chaos on Kubernetes environments. Among its various tools, Chaos Mesh providesStressChaos, which allows you to inject CPU and memory stress into your Pod. This tool can be useful when you test or benchmark a CPU-sensitive or memory-sensitive program and want to know its behavior under pressure.

However, as we tested and used StressChaos, we found some issues in usability and performance. For example, StressChaos uses far less memory than we configured.

To correct these issues, we developed a new set of tests. In this article, I’ll describe how we troubleshooted these issues and corrected them. This information might help you get the most out of StressChaos.

Injecting stress into a target Pod

Before you continue, you need toinstall Chaos Meshin your cluster.

To begin with, I’ll walk you through how to inject StressChaos into a target Pod. For demonstration purposes, I’ll usehello-kubernetes, a demo app managed byhelm charts. The first step is to clone thehello-kubernetesrepo and modify the chart to give it a resource limit.

git clone [https://github.com/paulbouwer/hello-kubernetes.git](https://github.com/paulbouwer/hello-kubernetes.git) code deploy/helm/hello-kubernetes/values.yaml # or whatever editor you prefer

Find the resources configuration, and modify it as follows:

资源:请求:记忆:“200 mi”限制:memory: "500Mi"

Before we inject StressChaos, let’s see how much memory the target Pod is currently consuming. Go into the Pod and start a shell. Enter the following command with the name of your Pod:

kubectl exec -it -n hello-kubernetes hello-kubernetes-hello-world-b55bfcf68-8mln6 -- /bin/sh

Display a summary of memory usage by entering:

/usr/src/app $ free -m /usr/src/app $ top

As you can see from the output below, the Pod is consuming 4,269 MB of memory:

/usr/src/app $ free -m used Mem: 4269 Swap: 0 /usr/src/app $ top Mem: 12742432K used PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND 1 0 node S 285m 2% 0 0% npm start 18 1 node S 284m 2% 3 0% node server.js 29 0 node S 1636 0% 2 0% /bin/sh 36 29 node R 1568 0% 3 0% top

topandfreecommands give similar answers, but those numbers don’t meet our expectations. We’ve limited the Pod’s memory usage to 500 MiBs, but now it seems to be using several GBs.

To figure out the cause, we can run a StressChaos experiment on the Pod and see what happens. Here’s the YAML file we use:

apiVersion: chaos-mesh.org/v1alpha1 kind: StressChaos metadata: name: mem-stress namespace: chaos-testing spec: mode: all selector: namespaces: - hello-kubernetes stressors: memory: workers: 4 size: 50MiB options: [""] duration: "1h"

Save the above file tomemory.yaml. Apply the chaos experiment by running:

~ kubectl apply -f memory.yaml stresschaos.chaos-mesh.org/mem-stress created

Let’s check the memory usage again:

used Mem: 4332 Swap: 0 Mem: 12805568K used PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND 54 50 root R 53252 0% 1 24% {stress-ng-vm} stress-ng --vm 4 --vm-keep --vm-bytes 50000000 57 52 root R 53252 0% 0 22% {stress-ng-vm} stress-ng --vm 4 --vm-keep --vm-bytes 50000000 55 53 root R 53252 0% 2 21% {stress-ng-vm} stress-ng --vm 4 --vm-keep --vm-bytes 50000000 56 51 root R 53252 0% 3 21% {stress-ng-vm} stress-ng --vm 4 --vm-keep --vm-bytes 50000000 18 1 node S 289m 2% 2 0% node server.js 1 0 node S 285m 2% 0 0% npm start 51 49 root S 41048 0% 0 0% {stress-ng-vm} stress-ng --vm 4 --vm-keep --vm-bytes 50000000 50 49 root S 41048 0% 2 0% {stress-ng-vm} stress-ng --vm 4 --vm-keep --vm-bytes 50000000 52 49 root S 41048 0% 0 0% {stress-ng-vm} stress-ng --vm 4 --vm-keep --vm-bytes 50000000 53 49 root S 41048 0% 3 0% {stress-ng-vm} stress-ng --vm 4 --vm-keep --vm-bytes 50000000 49 0 root S 41044 0% 0 0% stress-ng --vm 4 --vm-keep --vm-bytes 50000000 29 0 node S 1636 0% 3 0% /bin/sh 48 29 node R 1568 0% 1 0% top

You can see thatstress-nginstances are injected into the Pod. There is a 60 MiB rise in the Pod, which we didn’t expect. Thestress-ngdocumentationindicates that the increase should be 200 MiB (4 * 50 MiB).

Let’s increase the stress by changing the memory stress from 50 MiB to 3,000 MiB. This should break the Pod’s memory limit. I’ll delete the chaos experiment, modify the memory size, and reapply it.

And then, boom! The shell exits with code 137. A moment later, I reconnect to the container, and the memory usage returns to normal. Nostress-nginstances are found! What happened?

Why does StressChaos disappear?

In the above chaos experiment, we saw abnormal behaviors: the memory usage doesn’t add up, and the Shell exits. In this section, we are going to find out all the causes.

Kubernetes limits the container memory usage through thecgroupmechanism. To see the 500 MiB limit in our Pod, go to the container and enter:

/usr/src/app $ cat /sys/fs/cgroup/memory/memory.limit_in_bytes 524288000

The output is displayed in bytes and translates to 500 * 1024 * 1024 (500 MiB).

Requests are used only for scheduling where to place the Pod. The Pod does not have a memory limit or request, but it can be seen as the sum of all its containers.

So, we’ve been making mistakes since the very beginning.freeandtop不是“cgroup。”他们依靠/proc/meminfo(procfs) for data. Unfortunately,/proc/meminfoare old, so old that they predate cgroup. They provide you withhostmemory information instead of the container memory info.

Bearing that in mind, let’s start all over again and see what memory usage we get this time.

To get the “cgrouped” memory usage, enter:

/usr/src/app $ cat /sys/fs/cgroup/memory/memory.usage_in_bytes 39821312

Apply the 50 MiB StressChaos, and yield the following result:

/usr/src/app $ cat /sys/fs/cgroup/memory/memory.usage_in_bytes 93577216

That is about 51 MiB more memory usage than without StressChaos.

Next question: why did our shell exit? Exit code 137 indicates “failure as container received SIGKILL.” That leads us to check the Pod. Pay attention to the Pod state and events:

~ kubectl describe pods -n hello-kubernetes ...... Last State: Terminated Reason: Error Exit Code: 1 ...... Events: Type Reason Age From Message ---- ------ ---- ---- ------- ...... Warning Unhealthy 10m (x4 over 16m) kubelet Readiness probe failed: Get "http://10.244.1.19:8080/": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Normal Killing 10m (x2 over 16m) kubelet Container hello-kubernetes failed liveness probe, will be restarted ......

The events tell us why the shell crashed.hello-kuberneteshas a liveness probe. When the container memory is reaching the limit, the application starts to fail, and Kubernetes decides to terminate and restart the Pod. When the Pod restarts, StressChaos stops. By now, you can say that the StressChaos experiment works fine: it finds a vulnerability in your Pod. You could now fix it, and reapply the chaos.

Everything seems perfect now—except for one thing. Why do four 50 MiB vm workers result in 51 MiB in total? The answer will not reveal itself until we go into thestress-ng source code:

vm_bytes /= args->num_instances;

Oops! So the document is wrong. The multiple vm workers take up the total size specified, rather thanmmapthat much memory per worker. Finally, we get an answer for everything. In the following sections, we’ll discuss some other situations involving memory stress.

What if there was no liveness probe?

To figure out what happens if there is no liveness probe, let’s delete the probes and try again.

Find the following lines indeploy/helm/hello-kubernetes/templates/deployment.yamland delete them.

livenessProbe: httpGet: path: / port: http readinessProbe: httpGet: path: / port: http

After that, upgrade the deployment.

Interesting enough, in this scenario, the memory usage goes up continuously, and then drops sharply; it goes back and forth. What is happening now? Let’s check the kernel log. Pay attention to the last two lines.

/usr/src/app $ dmesg ... [189937.362908] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [189937.363092] [441060] 1000 441060 63955 3791 80 3030 988 node [189937.363145] [443265] 0 443265 193367 84215 197 9179 1000 stress-ng-vm ... [189937.363148] Memory cgroup out of memory: Kill process 443160 (stress-ng-vm) score 1272 or sacrifice child [189937.363186] Killed process 443160 (stress-ng-vm), UID 0, total-vm:773468kB, anon-rss:152704kB, file-rss:164kB, shmem-rss:0kB

It’s clear from the output that thestress-ng-vmprocess is being killed because of out of memory (OOM) errors.

如果processes can’t get the memory they want, they are very likely to fail. Instead of waiting for processes to crash, it’d be better if you kill some of them to release more memory. The OOM killer stops processes in an order and tries to recover the most memory while causing the least trouble. For detailed information on this process, seethis introductionto OOM killer.

Looking at the output above, you can see thatnode, our application process that should never be terminated, has anoom_score_adjof 988. That is quite dangerous, because it is the process with the highest score to get killed.

To stop the OOM killer from killing a specific process, you can try a simple trick. When you create a Pod, assign the Poda Quality of Service (QoS) class. Generally, if you create a Pod with precisely-specified resource requests, it is classified as aGuaranteedPod. OOM killers do not kill containers in a Guaranteed Pod if there are other options to kill. These options include non-GuaranteedPods andstress-ngworkers. A Pod with no resource requests is marked asBestEffort, and is likely to be killed by the OOM killer at first priority.

That’s all for the tour. When you inject StressChaos into your Pods, we have two suggestions:

  • Do not usefreeandtopto assess memory in containers.
  • Be careful when you assign resource limits to your Pod, and select the right QoS.

在未来,我们将创建一个更详细的压力Chaos document.

Dive deeper into Kubernetes memory management

Kubernetes tries to evict Pods that use too much memory (but not more memory than their limits). Kubernetes gets your Pod memory usage from/sys/fs/cgroup/memory/memory.usage_in_bytesand subtracts it by thetotal_inactive_fileline inmemory.stat.

Keep in mind that Kubernetesdoes notsupport swap. Even if you have a node with swap enabled, Kubernetes creates containers withswappiness=0, which means swap is actually disabled. That is mainly for performance concerns.

memory.usage_in_bytesequalsresident setpluscache, andtotal_inactive_fileis memory incachethat the OS can retrieve when the memory is running out.memory.usage_in_bytes - total_inactive_fileis calledworking_set. You can get thisworking_setvalue bykubectl top pod --containers. Kubernetes uses this value to decide whether or not to evict your Pods.

Kubernetes定期检查内存使用量。如果a container’s memory usage increases too quickly or the container cannot be evicted, the OOM killer is invoked. Kubernetes has its way of protecting its own process, so the OOM killer always picks the container. When a container is killed, it may or may not be restarted, depending on your restart policy. If it is killed, when you executekubectl describe pod , you will see it is restarted and the reason isOOMKilled.

Another thing worth mentioning is the kernel memory. Since v1.9, Kubernetes enables kernel memory support by default. It is also a feature of cgroup memory subsystems. You can limit container kernel memory usage. Unfortunately, this causes a cgroup leak on kernel versions up to v4.2. You can either upgrade your kernel to v4.3 or disable the feature.

How we implement StressChaos

StressChaos is a simple way to test your container’s behavior when it is low on memory. StressChaos utilizes a powerful tool namedstress-ngto allocate memory and continue writing into the allocated memory. Because containers have memory limits and container limits are bound to a cgroup, we must find a way to runstress-ngin a specific cgroup. Luckily, this part is easy. With enough privileges, we can assign any process to any cgroup by writing to files in/sys/fs/cgroup/.

如果you are interested in Chaos Mesh and would like to help us improve it, you’re welcome to joinour Slack channel(#project-chaos-mesh)! Or submit your pull requests or issues to ourGitHub repository.

Keep reading:
Develop a Daily Reporting System for Chaos Mesh to Improve System Resilience
Implementing Chaos Engineering in K8s: Chaos Mesh Principle Analysis and Control Plane Development


Book a Demo


Have questions? Let us know how we can help.

Contact Us
TiDB Dedicated

TiDB Dedicated

A fully-managed cloud DBaaS for predictable workloads

TiDB Dedicated

TiDB Serverless

A fully-managed cloud DBaaS for auto-scaling workloads