KEMBAR78
[Podman Special Event] Kubernetes in Rootless Podman | PDF
Kubernetes in Rootless Podman
Akihiro Suda, NTT
Podman Special Event 〜 OpenShift Lounge+ TALKs 〜 (Nov 16, 2023)
• A technique to run container runtimes as a non-root user
• Available for LXC, Docker, Podman, containerd, etc.
• Mitigates potential vulnerabilities of container runtimes
– Even if it gets compromised, it will not affect files and processes owned by
other user IDs
– Less chance of having stealth malware, as the kernel, firmware, etc., are
protected
– No ARP spoofing/DNS spoofing in the physical network
https://blog.aquasec.com/dns-spoofing-kubernetes-clusters
2
Rootless containers
• Implemented by using User Namespaces
– A feature of the Linux kernel
– Maps the root in the UserNS to a non-root user outside the UserNS
– dnf, apt-get, etc. just work, because they think they are running as the root
3
Rootless containers
Outside
UserNS
Inside
UserNS
UID=0 (root) 1000
• Began in 2018 https://twitter.com/_AkihiroSuda_/status/1019570064385642498
– As old as Rootless Docker (pre-release at that time) and Rootless
Podman
• The changes to Kubernetes was merged in Kubernetes v1.22
(Aug 2021)
– Feature gate: “KubeletInUserNamespace” (Alpha)
4
Rootless Kubernetes
• Slightly misnomer; it refers to running all the node components
(kubelet, kube-proxy, CRI, CNI, OCI) in UserNS
• Root-in-UserNS is similar to the root, but has no permission for:
– some sysctls
– dmesg
• The feature gate allows ignoring these permission errors
https://github.com/search?q=repo%3Akubernetes%2Fkubernetes%20KubeletInUserNamespace&type=code
5
KubeletInUserNamespace feature gate
The easiest way to run Rootless Kubernetes today is to wrap a
Kubernetes node in a Rootless container (such as Rootless Podman)
• kind
• minikube
• Usernetes (Gen2)
6
How to run Rootless Kubernetes
• https://kind.sigs.k8s.io/
• The most typical way to run Kubernetes in Docker (and in Podman)
• Supports multi-node, but only on a single host
– 1 kind container = 1 Kubernetes node
• Not intended to be used for production environments
7
kind (Kubernetes in Docker)
• A few of steps needs to be executed by the root
– These steps are needed for minikube, Usernetes, etc. too
8
kind (Kubernetes in Docker): Usage
# Allow limiting CPU, memory, etc. via cgroups
cat <<EOF | sudo tee 
/etc/systemd/system/user@.service.d/delegate.conf
[Service]
Delegate=cpu cpuset io memory pids
EOF
sudo systemctl daemon-reload
Needs cgroup v2
(RHEL >= 9, etc.)
• A few of steps needs to be executed by the root
– These steps are needed for minikube, Usernetes, etc. too
9
kind (Kubernetes in Docker): Usage
# Load extra kernel modules
cat <<EOF | sudo tee /etc/modules-load.d/iptables.conf
ip6_tables
ip6table_nat
ip_tables
iptable_nat
EOF
systemctl restart systemd-modules-load.service
• https://kind.sigs.k8s.io/docs/user/rootless/
10
kind (Kubernetes in Docker): Usage
export KIND_EXPERIMENTAL_PROVIDER=podman
kind create cluster
kubectl get pods -A
• https://minikube.sigs.k8s.io/
• Originally designed for running Kubernetes in VM
• Supports kind-like mode too
11
minikube
• https://minikube.sigs.k8s.io/docs/drivers/podman/
• Make sure to set “rootless” property, otherwise minikube executes
podman with sudo
12
minikube: Usage
minikube config set rootless true
minikube start --driver=podman --container-runtime=crio
kubectl get pods -A
• https://github.com/rootless-containers/usernetes
• Rootless Kubernetes, since 2018
– Gen 1 (2018-2023): “The hard way”
– Gen 2 (2023-): depends on Rootless (Docker|Podman|nerdctl) for
simplicity
• Supports real multi-node clusters with VXLAN
13
Usernetes
Gen 1 (2018-2023) Gen 2 (2023-)
Host dependency RootlessKit Rootless Docker,
Rootless Podman, or
Rootless nerdctl
(contaiNERD CTL)
Supports kubeadm No Yes
Supports multi-node
(multi-host)
Yes, but practically No,
due to complexity
Yes
Supports hostPath
volumes
Yes Yes, for most paths,
but needs an extra config
14
Usernetes: Gen 1 vs Gen 2
”The hard way”
Similar to `kind` and minikube,
but supports real multi-node
Physical network
192.168.123.0/24
15
Usernetes (Gen 2): How it works
Host
192.168.123.1
Non-root user
Podman
10.100.45.3
Kubernetes
(control plane)
6443/tcp
(kube-apiserver)
10250/tcp
(kubelet)
8472/udp
(flannel)
Host
192.168.123.2
Non-root user
Podman
10.100.56.3
Kubernetes
(worker)
10250/tcp
(kubelet)
8472/udp
(flannel)
Flannel
10.244.0.0/16
Physical network
192.168.123.0/24
16
Usernetes (Gen 2): How it works
Host
192.168.123.1
Non-root user
Podman
10.100.45.3
Kubernetes
(control plane)
6443/tcp
(kube-apiserver)
10250/tcp
(kubelet)
8472/udp
(flannel)
Host
192.168.123.2
Non-root user
Podman
10.100.56.3
Kubernetes
(worker)
10250/tcp
(kubelet)
8472/udp
(flannel)
Flannel
10.244.0.0/16
# Dirty workaround
ip addr add 192.168.123.1 dev eth0
# Dirty workaround
ip addr add 192.168.123.2 dev eth0
17
Usernetes (Gen 2): Usage
# Bootstrap the first node
make up
make kubeadm-init
make install-flannel
# Enable kubectl
make kubeconfig
export KUBECONFIG=$(pwd)/kubeconfig
kubectl get pods -A
# Multi-node
make join-command
scp join-command another-host:~/usernetes
ssh another-host make -C ~/usernetes up kubeadm-join
Set `CONTAINER_ENGINE=podman`
if multiple container engines are
installed on the host
Multi-tenancy using multiple user IDs and multiple TCP ports
• A single host will be able to join multiple clusters
18
Future works
Host
192.168.123.1
UID 1000
Podman
10.100.45.3
Kubernetes
(control plane)
6443/tcp
(kube-apiserver)
10250/tcp
(kubelet)
8472/udp
(flannel)
UID 2000
Podman
10.200.45.3
Kubernetes
(control plane)
6443/tcp
(kube-apiserver)
10250/tcp
(kubelet)
8472/udp
(flannel)
10001/tcp 10002/tcp 10003/udp 20001/tcp 20002/tcp 20003/udp
Promote “KubeletInUserNamespace” gate from alpha to beta (and
then GA)
• The blocker was how to test the gate in the upstream CI
• WIP: https://github.com/kubernetes/test-infra/pull/31085
– Spawns rootless `kind` machines using Google Compute Engine
19
Future works
Eliminate the overhead of user-mode TCP/IP
(slirp4netns, RootlessKit, and pasta)
• POC: https://github.com/rootless-containers/bypass4netns
• Captures socket-related syscalls in containers using seccomp_unotify(2),
and replaces the socket FDs with ones that are created in the host
network namespace
• Unsolved question: how to support VXLAN?
VXLAN is implemented in the kernel, so VXLAN calls cannot be captured
with seccomp_unotify(2)
20
Future works
Support running okd (OpenShift) in Rootless Podman
• Help wanted from the OpenShift community
21
Future works

[Podman Special Event] Kubernetes in Rootless Podman

  • 1.
    Kubernetes in RootlessPodman Akihiro Suda, NTT Podman Special Event 〜 OpenShift Lounge+ TALKs 〜 (Nov 16, 2023)
  • 2.
    • A techniqueto run container runtimes as a non-root user • Available for LXC, Docker, Podman, containerd, etc. • Mitigates potential vulnerabilities of container runtimes – Even if it gets compromised, it will not affect files and processes owned by other user IDs – Less chance of having stealth malware, as the kernel, firmware, etc., are protected – No ARP spoofing/DNS spoofing in the physical network https://blog.aquasec.com/dns-spoofing-kubernetes-clusters 2 Rootless containers
  • 3.
    • Implemented byusing User Namespaces – A feature of the Linux kernel – Maps the root in the UserNS to a non-root user outside the UserNS – dnf, apt-get, etc. just work, because they think they are running as the root 3 Rootless containers Outside UserNS Inside UserNS UID=0 (root) 1000
  • 4.
    • Began in2018 https://twitter.com/_AkihiroSuda_/status/1019570064385642498 – As old as Rootless Docker (pre-release at that time) and Rootless Podman • The changes to Kubernetes was merged in Kubernetes v1.22 (Aug 2021) – Feature gate: “KubeletInUserNamespace” (Alpha) 4 Rootless Kubernetes
  • 5.
    • Slightly misnomer;it refers to running all the node components (kubelet, kube-proxy, CRI, CNI, OCI) in UserNS • Root-in-UserNS is similar to the root, but has no permission for: – some sysctls – dmesg • The feature gate allows ignoring these permission errors https://github.com/search?q=repo%3Akubernetes%2Fkubernetes%20KubeletInUserNamespace&type=code 5 KubeletInUserNamespace feature gate
  • 6.
    The easiest wayto run Rootless Kubernetes today is to wrap a Kubernetes node in a Rootless container (such as Rootless Podman) • kind • minikube • Usernetes (Gen2) 6 How to run Rootless Kubernetes
  • 7.
    • https://kind.sigs.k8s.io/ • Themost typical way to run Kubernetes in Docker (and in Podman) • Supports multi-node, but only on a single host – 1 kind container = 1 Kubernetes node • Not intended to be used for production environments 7 kind (Kubernetes in Docker)
  • 8.
    • A fewof steps needs to be executed by the root – These steps are needed for minikube, Usernetes, etc. too 8 kind (Kubernetes in Docker): Usage # Allow limiting CPU, memory, etc. via cgroups cat <<EOF | sudo tee /etc/systemd/system/user@.service.d/delegate.conf [Service] Delegate=cpu cpuset io memory pids EOF sudo systemctl daemon-reload Needs cgroup v2 (RHEL >= 9, etc.)
  • 9.
    • A fewof steps needs to be executed by the root – These steps are needed for minikube, Usernetes, etc. too 9 kind (Kubernetes in Docker): Usage # Load extra kernel modules cat <<EOF | sudo tee /etc/modules-load.d/iptables.conf ip6_tables ip6table_nat ip_tables iptable_nat EOF systemctl restart systemd-modules-load.service
  • 10.
    • https://kind.sigs.k8s.io/docs/user/rootless/ 10 kind (Kubernetesin Docker): Usage export KIND_EXPERIMENTAL_PROVIDER=podman kind create cluster kubectl get pods -A
  • 11.
    • https://minikube.sigs.k8s.io/ • Originallydesigned for running Kubernetes in VM • Supports kind-like mode too 11 minikube
  • 12.
    • https://minikube.sigs.k8s.io/docs/drivers/podman/ • Makesure to set “rootless” property, otherwise minikube executes podman with sudo 12 minikube: Usage minikube config set rootless true minikube start --driver=podman --container-runtime=crio kubectl get pods -A
  • 13.
    • https://github.com/rootless-containers/usernetes • RootlessKubernetes, since 2018 – Gen 1 (2018-2023): “The hard way” – Gen 2 (2023-): depends on Rootless (Docker|Podman|nerdctl) for simplicity • Supports real multi-node clusters with VXLAN 13 Usernetes
  • 14.
    Gen 1 (2018-2023)Gen 2 (2023-) Host dependency RootlessKit Rootless Docker, Rootless Podman, or Rootless nerdctl (contaiNERD CTL) Supports kubeadm No Yes Supports multi-node (multi-host) Yes, but practically No, due to complexity Yes Supports hostPath volumes Yes Yes, for most paths, but needs an extra config 14 Usernetes: Gen 1 vs Gen 2 ”The hard way” Similar to `kind` and minikube, but supports real multi-node
  • 15.
    Physical network 192.168.123.0/24 15 Usernetes (Gen2): How it works Host 192.168.123.1 Non-root user Podman 10.100.45.3 Kubernetes (control plane) 6443/tcp (kube-apiserver) 10250/tcp (kubelet) 8472/udp (flannel) Host 192.168.123.2 Non-root user Podman 10.100.56.3 Kubernetes (worker) 10250/tcp (kubelet) 8472/udp (flannel) Flannel 10.244.0.0/16
  • 16.
    Physical network 192.168.123.0/24 16 Usernetes (Gen2): How it works Host 192.168.123.1 Non-root user Podman 10.100.45.3 Kubernetes (control plane) 6443/tcp (kube-apiserver) 10250/tcp (kubelet) 8472/udp (flannel) Host 192.168.123.2 Non-root user Podman 10.100.56.3 Kubernetes (worker) 10250/tcp (kubelet) 8472/udp (flannel) Flannel 10.244.0.0/16 # Dirty workaround ip addr add 192.168.123.1 dev eth0 # Dirty workaround ip addr add 192.168.123.2 dev eth0
  • 17.
    17 Usernetes (Gen 2):Usage # Bootstrap the first node make up make kubeadm-init make install-flannel # Enable kubectl make kubeconfig export KUBECONFIG=$(pwd)/kubeconfig kubectl get pods -A # Multi-node make join-command scp join-command another-host:~/usernetes ssh another-host make -C ~/usernetes up kubeadm-join Set `CONTAINER_ENGINE=podman` if multiple container engines are installed on the host
  • 18.
    Multi-tenancy using multipleuser IDs and multiple TCP ports • A single host will be able to join multiple clusters 18 Future works Host 192.168.123.1 UID 1000 Podman 10.100.45.3 Kubernetes (control plane) 6443/tcp (kube-apiserver) 10250/tcp (kubelet) 8472/udp (flannel) UID 2000 Podman 10.200.45.3 Kubernetes (control plane) 6443/tcp (kube-apiserver) 10250/tcp (kubelet) 8472/udp (flannel) 10001/tcp 10002/tcp 10003/udp 20001/tcp 20002/tcp 20003/udp
  • 19.
    Promote “KubeletInUserNamespace” gatefrom alpha to beta (and then GA) • The blocker was how to test the gate in the upstream CI • WIP: https://github.com/kubernetes/test-infra/pull/31085 – Spawns rootless `kind` machines using Google Compute Engine 19 Future works
  • 20.
    Eliminate the overheadof user-mode TCP/IP (slirp4netns, RootlessKit, and pasta) • POC: https://github.com/rootless-containers/bypass4netns • Captures socket-related syscalls in containers using seccomp_unotify(2), and replaces the socket FDs with ones that are created in the host network namespace • Unsolved question: how to support VXLAN? VXLAN is implemented in the kernel, so VXLAN calls cannot be captured with seccomp_unotify(2) 20 Future works
  • 21.
    Support running okd(OpenShift) in Rootless Podman • Help wanted from the OpenShift community 21 Future works