Bringing Back Kubernetes

By fawzi

April 10, 2019

NOMAD CoE

Bringing Back Kubernetes

The problem

Shortly after having had my last day of work some issues came up and the kubernetes cluster went down. I began looking into it (a bit half heartily) and I began noticing some strange random issues. Monitoring the kube-system namespace I saw that flannel issues seemed to crop up quickly. There are several issues open regarding flannel failing #963, 1076,… and initially the real culprit wasn’t clear.

Understanding the problem was made more difficult by the fact that our cluster is a bit of a mess. Having clean machines is not as easy as in a cloud environment, and the nodes were administrated by at least 4 persons (MPCDF, Harsha, Alfonso, and me, some machines even more) working one after the other independently, with little communication, and mostly having already left. This even lead to having a heterogeneous cluster, where not all machines have the same software or even kernel version. This should hopefully be solved by the clean setup of the new machines used in the new cluster (something I had discussed and considered important before leaving).

Anyway, after several reinstalls of kubernetes, I finally managed to track down the problem to bugs in the kmem implementation in the CentOS 7.x kernel. CentOS declares kmem available, but its implementation is buggy. This is a serious problem, and several people had issues with it, one can find plenty of issues likely related to this (but often without the solution).

Flannel deployment uses limits kernel memory, and kubernetes uses docker to enforce them. Docker sees that the kernel supports it, and tries to set the limit. Unfortunately, bugs in the kernel mean that tracking these resources leads to leaks and other sorts of instabilities in the kernel. Corruption of the kernel affects the whole system and can lead to all sorts of strange behavior. This can even lead to failure to reboot (lost one node this way).

This is a kernel issue, but one that cannot be ignored due to its severity and workarounds for it came up at various levels. Kubernetes in opencontainers issue 1725 added a flag to compile its interface to the container provider without kmem in pull 1921 and fixed up static uses in pull 1938. These fixes need that the package to set an environment variable when compiling (or self compiling). This hasn’t been done by CentOS yet.

Docker itself with 18.09.1 tries to detect broken kernels and avoid using kmem on them.

Docker

I did try to update to 18.09.4 (that just became available), from the recently installed 18.09.0, but avoiding the just released kubernetes 14.0.0:

 yum install docker-ce-cli docker-ce kubelet-1.13.4 kubeadm-1.13.4 kubectl-1.13.4 kubernetes-cni-0.6.0 --disableexcludes=kubernetes

But I did encounter another flannel issue, maybe related to how the cluster was upgraded, but to be safe I decided to switch back to docker from the newer community edition.

kubeadm reset 
systemctl stop kubelet
systemctl disable kubelet
systemctl stop docker
systemctl disable docker
yum remove docker-ce docker-ce-cli kubelet kubeadm kubectl --disableexcludes=kubernetes
iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X
yum install docker kubelet-1.13.4 kubeadm-1.13.4 kubectl-1.13.4 kubernetes-cni-0.6.0 --disableexcludes=kubernetes

Originally I had switched from docker to docker-ce because the latest kubernetes did require a newer docker, and the required patch for the older docker hadn’t been in the released package. Now docker has been patched, and in the excellent Mesos issue 2018-0006 it was declared as tested free of the issue. Mesos is an alternative to Kubernetes, but its description of the issue (which I found after having basically solved it) is by far the most clear and comprehensive.

The current docker package uses overlay2 as storage driver by default. This is easier to set up and uses less memory when running the same image several times (file reuse is transparent to the OS), still just like in the old dev install I preferred using the production tested devicemapper with dedicated lvm volume.

This means either editing

/etc/sysconfig/docker-storage
/etc/sysconfig/docker-storage-setup

(taking care that they are not in use, otherwise changes will be lost), or editing docker.service:

emacs -nw /usr/lib/systemd/system/docker.service

and using the original /etc/docker/daemon.json

{
    "storage-driver": "devicemapper",
    "storage-opts": [
    "dm.fs=xfs",
    "dm.thinpooldev=/dev/mapper/system-docker",
    "dm.use_deferred_removal=true",
    "dm.use_deferred_deletion=true",
    "dm.basesize=15G"
    ],
    "group": "dockerroot" # this last line added down below due to the ownership problem
}

On one node the lvm volume was disabled due to errors, and I restored it

lvremove system/docker
lvcreate --wipesignatures y -n docker system -l 95%VG
lvcreate --wipesignatures y -n dockermeta system -l 1%VG
lvconvert -y --zero n -c 512K --thinpool system/docker --poolmetadata system/dockermeta

Miscellaneous tweaks

It is worth noting some tweaks on the nodes that were already there.

I had disabled swap

# disable swap:
swapoff -a
# comment out swap lines (the command should do it, but I prefer manual edit)
# sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab
vi /etc/fstab

There are entries in the crontab to get rid of unused docker images:

# docker remove unused containers, images, volumes
0 2 * * * /usr/bin/docker system prune -f
# docker prune every unused image
1 2 * * 6 /usr/bin/docker image prune -af

A log rotate entry to avoid filling up /var with container log, /etc/logrotate.d/docker-container:

/var/lib/docker/containers/*/*.log {
  rotate 7
  daily
  compress
  size=1M
  missingok
  delaycompress
  copytruncate
}

Finally Harsha had added /etc/sysctl.d/100-nomad-flink.conf

# Disable response to broadcasts.
# You don't want yourself becoming a Smurf amplifier.
net.ipv4.icmp_echo_ignore_broadcasts = 1
# enable route verification on all interfaces
net.ipv4.conf.all.rp_filter = 1
# enable ipV6 forwarding
#net.ipv6.conf.all.forwarding = 1
# increase the number of possible inotify(7) watches
fs.inotify.max_user_watches = 65536
# avoid deleting secondary IPs on deleting the primary IP
net.ipv4.conf.default.promote_secondaries = 1
net.ipv4.conf.all.promote_secondaries = 1
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.ipv4.tcp_rmem = 4096    87380  67108864
net.ipv4.tcp_wmem = 4096    16384  67108864
kernel.core_pattern = core.%h.%p
kernel.panic_on_oops = 1
kernel.keys.root_maxkeys = 512
kernel.keys.maxkeys = 512
#
kernel.sysrq = 1
kernel.dmesg_restrict = 1
vm.overcommit_memory = 1
fs.protected_hardlinks = 1
fs.protected_symlinks = 1
#
net.core.somaxconn = 256
#
vm.max_map_count=262144

changing some of the kernel limits (for example to run elasticsearch).

Kubernetes

With a working docker kubernetes needed to be reinstalled:

reboot
# and then after it comes back up
systemctl enable docker && systemctl start docker
systemctl enable kubelet && systemctl start kubelet
kubeadm reset

After that on the master node

kubeadm init --pod-network-cidr=10.244.0.0/16

and on the nodes the join command printed by the init command

Finally (with coredns up and running) installation of the flannel network

# get & install flannel
curl https://raw.githubusercontent.com/coreos/flannel/bc79dd1505b0c8681ece4de4c0d86c5cd2643275/Documentation/kube-flannel.yml > kube-flannel.yml
kubectl apply -f kube-flannel.yml

With that kubernetes was finally back up and running, and it seems in a stable way.

NOMAD deploy

With a stably working kubernetes then the nomad services should be installed. To simplify it, I changed the deploy scripts so that they are connected and do not rely on a self hosted Sonatype Nexus Docker Registry, but on MPCDF Gitlab (that now supports Docker registry).