2.3.3. Run container using Kubernetes¶
This section describes how to install a minimal Kubernetes cluster (one pod, running on the same node as the controller), while prodiving SR-IOV interfaces to the container. It illustrates the deployment of Virtual Service Router within this cluster. It has been tested on Ubuntu 20.04.
If you are already familiar with Kubernetes or if you already have a Kubernetes cluster deployed, you may want to skip the Kubernetes installation procedure and focus on Deploy Virtual Service Router into the cluster.
Note
To simplify the documentation, we assume that all commands are run by the root user.
Kubernetes installation¶
Memory configuration¶
To run Kubernetes, you need to disable swap:
# swapoff -a
Note
The kubelet agent running on each node fails to start if swap is enabled. Indeed it currently can’t guarantee that a pod requesting a given amount of memory will never swap during its lifecycle and thus does not know how to enforce the memory limitation as swap does not account for memory. For now, the kubelet agent avoids this issue by requiring that swap is disabled.
The Virtual Service Router pod requires hugepages to run (the recommended value is 8GB per pod). If you need to spawn multiple Virtual Service Router on the same node, you may allocate more, let’s say 16GB per node for example:
# echo 16 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
# echo 16 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
# umount -f /dev/hugepages
# mkdir -p /dev/hugepages
# mount -t hugetlbfs none /dev/hugepages
Install docker¶
Install the package and configure the daemon to use systemd as cgroup manager:
# apt update
# apt install -y docker.io
# systemctl enable docker
# cat > /etc/docker/daemon.json <<EOF
{
"exec-opts": ["native.cgroupdriver=systemd"]
}
EOF
# systemctl restart docker
Install Kubernetes packages¶
First install the apt-transport-https package, which will allow us to use http and https in Ubuntu repositories:
# apt install -y apt-transport-https curl
Next, add the Kubernetes signing key:
# curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
Next, add the Kubernetes package repository. At the time of this writing, Ubuntu 16.04 Xenial is the latest Kubernetes repository available:
# echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" > /etc/apt/sources.list.d/kubernetes.list
# apt update
Now install Kubernetes:
# apt install -y kubeadm kubectl kubelet
Note
At the time of this writing, the Kubernetes version is 1.24.0.
Once the packages are installed, put them on hold as a Kubernetes update is way more subtle than a simple update of the packages provided by the distribution:
# apt-mark hold kubeadm kubectl kubelet
Configure containerd as Kubernetes Container Runtime¶
Load required modules:
# modprobe overlay
# modprobe br_netfilter
# cat > /etc/modules-load.d/containerd.conf <<EOF
overlay
br_netfilter
EOF
Enable required sysctl:
# cat <<EOF | tee /etc/sysctl.d/99-kubernetes-cri.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
# sysctl --system
Extend lock memory limit in pods:
# mkdir -p /etc/systemd/system/containerd.service.d
# cat <<EOF | tee /etc/systemd/system/containerd.service.d/override.conf
[Service]
LimitMEMLOCK=4194304
LimitNOFILE=1048576
EOF
# avoid "too many open files" error in pods
sysctl fs.inotify.max_user_instances=2048
sysctl fs.inotify.max_user_watches=1048576
# systemctl daemon-reload
# systemctl restart containerd
Alter the default kubelet configuration to allow the modification
of network-related sysctl, to use systemd as cgroup manager and to enable the
CPU manager and the Topology manager, in
/etc/systemd/system/kubelet.service.d/10-kubeadm.conf
:
# Note: This dropin only works with kubeadm and kubelet v1.11+
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml --allowed-unsafe-sysctls=net.* --cgroup-driver=systemd --reserved-cpus=0-3 --cpu-manager-policy=static --topology-manager-policy=best-effort --feature-gates=IPv6DualStack=true"
# This is a file that "kubeadm init" and "kubeadm join" generates at runtime, populating the KUBELET_KUBEADM_ARGS variable dynamically
EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
# This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably, the user should use
# the .NodeRegistration.KubeletExtraArgs object in the configuration files instead. KUBELET_EXTRA_ARGS should be sourced from this file.
EnvironmentFile=-/etc/default/kubelet
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS
Then reload the configuration and restart the daemon:
# systemctl daemon-reload
# systemctl restart kubelet
Note
The static CPU Manager policy enables the dedication of cores to a
pod. The best-effort Topology Manager policy will try conciliate the network
devices and the dedicated cores that might be allocated to a pod so that
these resources are allocated from the same NUMA node. We also reserve a few
cores to kubernetes housekeeping daemons and processes with the
--reserved-cpus
argument.
See also
the CPU management policies and the Topology manager pages from Kubernetes documentation.
Cluster initialization¶
Initialize the cluster while providing the desired subnet for internal communication:
# kubeadm init --pod-network-cidr=10.244.0.0/16 --cri-socket unix:///run/containerd/containerd.sock
# mkdir /root/.kube
# cp /etc/kubernetes/admin.conf /root/.kube/config
# chown $(id -u):$(id -g) /root/.kube/config
By default, your cluster will not schedule Pods on the controller node for security reasons. In our example of a single-machine Kubernetes cluster, we allow scheduling pods on the controller node:
# kubectl taint nodes --all node-role.kubernetes.io/master-
If using Kubernetes version >= 1.24, you also need the following command to
remove the control-plane
taint:
# kubectl taint nodes --all node-role.kubernetes.io/control-plane-
Network configuration¶
In this subsection we will install several network plugins for Kubernetes. These are plugins that provide networking to the pods deployed in the cluster.
We will use:
flannel: a basic networking plugin leveraging veth interfaces to provide the default pod connectivity;
sriov-cni: to actually plumb host VFs into pods;
sriov-network-device-plugin: to allocate host resources to the pods;
multus: a CNI meta-plugin enabling the multiplexing of several other plugins required to make the previous three plugins work together.
Install flannel plugin¶
Flannel is responsible for allocating a subnet lease to each host out of a larger, preconfigured address space. In Kubernetes, each pod has a unique, routable IP inside the cluster. The transport between pods from different nodes is assured by a VXLAN overlay.
Download and install the YAML file describing this network provider:
# wget https://raw.githubusercontent.com/flannel-io/flannel/v0.13.1-rc1/Documentation/kube-flannel.yml
# kubectl apply -f kube-flannel.yml
This deploys the configuration and the daemonset that runs the flanneld
binary
on each node.
Note
This documentation has been tested with flannel v0.13.1-rc1.
Install multus plugin¶
Multus is a meta-plugin that can leverage several other CNI plugins simultaneously. We use it here to be able to provide a sriov-type configuration to a SR-IOV device allocated by the sriov-network-device-plugin. It can also provide additional network connectivity through other CNI plugins. When installed, it automatically includes the existing flannel plugin in its configuration so that it provides this default connectivity to all pods in addition to explicitely defined network interfaces. This default connectivity is required by the cluster so that a pod can reach the external world (other pods on the same or on another node, internet resources …).
Build the CNI:
# git clone https://github.com/intel/multus-cni.git
# cd multus-cni/
# git reset --hard v3.8
# ./hack/build-go.sh
Install the plugin binary:
# cp bin/multus /opt/cni/bin
Install the daemonset:
# kubectl create -f images/multus-daemonset.yml
Note
This documentation has been tested with multus v3.8.
Install sriov plugin¶
The purpose of the sriov-cni plugin is to configure the VF allocated to the containers.
# git clone https://github.com/intel/sriov-cni.git
# cd sriov-cni
# git reset --hard v2.6.1
# make build
# cp build/sriov /opt/cni/bin
Note
This documentation has been tested with sriov-cni v2.6.1.
Configure the NICs¶
In this example, we want to pass to the pod the two Intel NICs ens787f0
and
ens804f0
, that have the PCI addresses 0000:81:00.0 and 0000:83:00.0 respectively.
We will create a Virtual Function for each interface (PCI addresses 0000:81:10.0
and 0000:83:10.0) and bind them to the vfio-pci
driver.
Set the PF devices up and create the desired number of VFs for each NIC:
Note
This subsection applies to Intel network devices (ex: Niantic, Fortville). Other devices like Nvidia Mellanox NICs require different operations, which are not detailed in this document.
# ip link set ens787f0 up
# echo 1 > /sys/class/net/ens787f0/device/sriov_numvfs
# ip link set ens804f0 up
# echo 1 > /sys/class/net/ens804f0/device/sriov_numvfs
Bind the VFs devices to the vfio-pci driver:
# echo 0000:81:10.0 > /sys/bus/pci/devices/0000\:81\:10.0/driver/unbind
# echo vfio-pci > /sys/bus/pci/devices/0000\:81\:10.0/driver_override
# echo 0000:81:10.0 > /sys/bus/pci/drivers_probe
# echo 0000:83:10.0 > /sys/bus/pci/devices/0000\:83\:10.0/driver/unbind
# echo vfio-pci > /sys/bus/pci/devices/0000\:83\:10.0/driver_override
# echo 0000:83:10.0 > /sys/bus/pci/drivers_probe
# modprobe vfio-pci
Load the vhost-net driver, to enable the vhost networking backend, required for the exception path:
# modprobe vhost-net
See also
Depending on your system, additional configuration may be required. See Providing physical devices or virtual functions to the container paragraph.
Create Kubernetes networking resources¶
To pass the VF interfaces to the pod, we need to declare networking resources via the sriov-network-device plugin.
To do so, we create the following config-map-dpdk.yaml
file:
apiVersion: v1
kind: ConfigMap
metadata:
name: sriovdp-config
namespace: kube-system
data:
config.json: |
{
"resourceList": [
{
"resourceName": "intel_sriov_nic_vsr1",
"resourcePrefix": "intel.com",
"selectors": {
"vendors": ["8086"],
"devices": ["10ed"],
"drivers": ["vfio-pci"],
"pfNames": ["ens787f0"],
"needVhostNet": true
}
},
{
"resourceName": "intel_sriov_nic_vsr2",
"resourcePrefix": "intel.com",
"selectors": {
"vendors": ["8086"],
"devices": ["10ed"],
"drivers": ["vfio-pci"],
"pfNames": ["ens804f0"],
"needVhostNet": true
}
}
]
}
This configuration file declares two different resource types, one for VF 0 of
the Intel NIC ens787f0
and one for VF 0 of the Intel NIC ens804f0
.
The selector keywords vendor
and devices
match the hexadecimal ID that can
be found in /sys/bus/pci/devices/<PCI_ID>/
of the VF devices.
The "needVhostNet": true
directive instructs the sriov-network-device-plugin
to mount the /dev/vhost-net
device alongside the VF devices into the pods. This
is required for all interfaces that will be handled by the fast path.
See also
The detailed explanation of the syntax can be found on the plugin website
Now deploy this ConfigMap into the cluster:
# kubectl apply -f config-map-dpdk.yaml
Finally, we deploy the YAML file describing the daemonset enabling the SR-IOV
network device plugin on all worker nodes. It leverages the ConfigMap that we
just installed and the SR-IOV device plugin image reachable by all nodes. This
file named sriovdp-daemonset.yaml
has the following content:
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: sriov-device-plugin
namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: kube-sriov-device-plugin-amd64
namespace: kube-system
labels:
tier: node
app: sriovdp
spec:
selector:
matchLabels:
name: sriov-device-plugin
template:
metadata:
labels:
name: sriov-device-plugin
tier: node
app: sriovdp
spec:
hostNetwork: true
hostPID: true
nodeSelector:
beta.kubernetes.io/arch: amd64
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
serviceAccountName: sriov-device-plugin
containers:
- name: kube-sriovdp
image: ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:v3.3.2
imagePullPolicy: Always
args:
- --log-dir=sriovdp
- --log-level=10
securityContext:
privileged: true
resources:
requests:
cpu: "250m"
memory: "40Mi"
limits:
cpu: 1
memory: "200Mi"
volumeMounts:
- name: devicesock
mountPath: /var/lib/kubelet/
readOnly: false
- name: log
mountPath: /var/log
- name: config-volume
mountPath: /etc/pcidp
- name: device-info
mountPath: /var/run/k8s.cni.cncf.io/devinfo/dp
volumes:
- name: devicesock
hostPath:
path: /var/lib/kubelet/
- name: log
hostPath:
path: /var/log
- name: device-info
hostPath:
path: /var/run/k8s.cni.cncf.io/devinfo/dp
type: DirectoryOrCreate
- name: config-volume
configMap:
name: sriovdp-config
items:
- key: config.json
path: config.json
# kubectl apply -f sriovdp-daemonset.yaml
Note
We use the v3.3.2 stable version of sriov-network-device-plugin.
Create the following multus-sriov-dpdk.yaml
file to integrate the SR-IOV
plugins into the multus environment and then deploy it:
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: multus-intel-sriov-nic-vsr1
annotations:
k8s.v1.cni.cncf.io/resourceName: intel.com/intel_sriov_nic_vsr1
spec:
config: '{
"type": "sriov",
"cniVersion": "0.3.1",
"name": "sriov-intel-nic-vsr1",
"trust": "on",
"spoofchk": "off"
}'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: multus-intel-sriov-nic-vsr2
annotations:
k8s.v1.cni.cncf.io/resourceName: intel.com/intel_sriov_nic_vsr2
spec:
config: '{
"type": "sriov",
"cniVersion": "0.3.1",
"name": "sriov-intel-nic-vsr2",
"trust": "on",
"spoofchk": "off"
}'
# kubectl apply -f multus-sriov-dpdk.yaml
Deploy Virtual Service Router into the cluster¶
Authenticate to 6WIND container registry¶
The Docker image is available at:
https://download.6wind.com/vsr/<arch>-ce/<version>
.
First, create a Kubernetes secret to authenticate to the 6WIND registry, using the credentials provided by 6WIND support:
# kubectl create secret docker-registry regcred \
--docker-server=download.6wind.com \
--docker-username=<login> --docker-password=<password>
Create the Virtual Service Router pod template¶
The pod is declared in a YAML file describing its properties and the way it is deployed.
For this example, we use the following file named vsr.yaml
:
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: 6windgate-sysctl
spec:
seLinux:
rule: RunAsAny
runAsUser:
rule: RunAsAny
fsGroup:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
volumes:
- '*'
allowedUnsafeSysctls:
- net.*
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: 6windgate-role
rules:
- apiGroups: ['policy']
resources: ['podsecuritypolicies']
verbs: ['use']
resourceNames: ['6windgate-sysctl']
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: 6windgate-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: 6windgate-role
subjects:
- kind: ServiceAccount
name: 6windgate-user
namespace: default
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: 6windgate-user
namespace: default
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: vsr
name: vsr
spec:
replicas: 1
selector:
matchLabels:
app: fastpath
template:
metadata:
labels:
app: fastpath
annotations:
k8s.v1.cni.cncf.io/networks: multus-intel-sriov-nic-vsr1,multus-intel-sriov-nic-vsr2
spec:
restartPolicy: Always
serviceAccountName: 6windgate-user
securityContext:
sysctls:
- name: net.ipv4.conf.default.disable_policy
value: "1"
- name: net.ipv4.ip_local_port_range
value: "30000 40000"
- name: net.ipv4.ip_forward
value: "1"
- name: net.ipv6.conf.all.forwarding
value: "1"
containers:
- image: download.6wind.com/vsr/x86_64-ce/3.5:3.5.1
imagePullPolicy: IfNotPresent
name: vsr
resources:
limits:
cpu: "4"
memory: "2Gi"
hugepages-1Gi: 8Gi
intel.com/intel_sriov_nic_vsr1: '1'
intel.com/intel_sriov_nic_vsr2: '1'
requests:
cpu: "4"
memory: "2Gi"
hugepages-1Gi: 8Gi
intel.com/intel_sriov_nic_vsr1: '1'
intel.com/intel_sriov_nic_vsr2: '1'
env:
securityContext:
capabilities:
add: ["NET_ADMIN", "SYS_ADMIN", "SYS_NICE", "IPC_LOCK", "NET_BROADCAST", "AUDIT_WRITE"]
volumeMounts:
- mountPath: /dev/hugepages
name: hugepage
- mountPath: /dev/shm
name: shm
- mountPath: /dev/net/tun
name: net
- mountPath: /dev/ppp
name: ppp
- mountPath: /sys/fs/cgroup
name: cgroup
- mountPath: /tmp
name: tmp
- mountPath: /run
name: run
- mountPath: /run/lock
name: run-lock
stdin: true
tty: true
imagePullSecrets:
- name: regcred
volumes:
- emptyDir:
medium: HugePages
sizeLimit: 8Gi
name: hugepage
- name: shm
emptyDir:
sizeLimit: "512Mi"
medium: "Memory"
- hostPath:
path: /dev/net/tun
type: ""
name: net
- hostPath:
path: /dev/ppp
type: ""
name: ppp
- hostPath:
path: /sys/fs/cgroup
type: ""
name: cgroup
- emptyDir:
sizeLimit: "200Mi"
medium: "Memory"
name: tmp
- emptyDir:
sizeLimit: "200Mi"
medium: "Memory"
name: run
- emptyDir:
sizeLimit: "200Mi"
medium: "Memory"
name: run-lock
Note
If the IOMMU of your server is disabled (which is not advised),
the CAP_SYS_RAWIO
capability must also be enabled when using
a VF or PF interface in the container.
See Kernel configuration paragraph.
This file contains one pod named vsr
that contains the Virtual Service Router container.
The pod is started with reduced capabilities and has both SR-IOV and flannel
network interfaces attached.
This file also defines the following Kubernetes resources:
a
PodSecurityPolicy
named6windgate-sysctl
allowing the modification of the net.* sysctlsa
ServiceAccount
user6windgate-user
in the default namespacea
ClusterRole
named6windgate-role
granted with the usage of the items listed in thePodSecurityPolicy
a
ClusterRoleBinding
, giving theServiceAccount
user theClusterRole
previously defined
The pod template declares as an annotation the network resource attachments from multus that it requests. You can add more than one attachment if you need several interfaces.
The ServiceAccount 6windgate-user
is then used to instantiate our pods so that
they can set the net.* sysctls required by the Virtual Service Router in their
SecurityContext
.
The following syctls are set:
Name |
Role |
---|---|
net.ipv4.conf.default.disable_policy |
Allow the delivery of IPSec packets aiming at a local IP address |
net.ipv4.ip_local_port_range |
Define the local port ranges used for TCP and UDP |
net.ipv4.ip_forward |
Allow packet forwarding for IPv4 |
net.ipv6.conf.all.forwarding |
Allow packet forwarding for IPv6 |
Next the container itself is described. The pod has only one container, that runs the Virtual Service Router image fetched from 6WIND docker repository.
Then the container declares the resources that it requests. The Kubernetes
cluster will then find a node satisfying these requirements to schedule the pod.
By setting the limits equal to the requests, we ask for this exact amount of
resources. For example for the vsr
pod:
cpu
: number of vCPUs resources to be allocated. This reservation does not necessarily imply a CPU pinning. The cluster provides an amount of CPU cycles from all the host’s CPUs that is equivalent to 4 logical CPUs.memory
: the pod requests 2GB of RAM.hugepages-1Gi
: the pod requests 8GB of memory from 1GB-hugepages. You can use hugepages of 2MB instead, provided that you allocate them at boot time and use the keywordhugepages-2Mi
instead.
Note
Setting the CPU and memory limits respectively equal to the CPU and
memory requests makes the pod qualify for the Guaranteed
QoS class.
See the configuration of the QoS for pods
page of Kubernetes documentation.
Warning
Make sure to request enough CPU resources to cover the CPU cores that you plan to configure in your fast-path.env and the processes of the control plane. Otherwise your fast path cores will be throttled resulting in a huge performance degradation.
The following capabilities are granted to the container:
Capability |
Role |
---|---|
CAP_SYS_ADMIN |
eBPF exception path, VRF, tcpdump |
CAP_NET_ADMIN |
General Linux networking |
CAP_IPC_LOCK |
Memory allocation for DPDK |
CAP_SYS_NICE |
Get NUMA information from memory |
CAP_NET_BROADCAST |
VRF support (notifications) |
CAP_AUDIT_WRITE |
Write records to kernel auditing log |
Finally we make sure to mount the following paths from the host into the container:
Path |
Role |
---|---|
/dev/hugepages, /dev/shm |
The fastpath requires access to hugepages for its shared memory |
/dev/net/tun |
Required for FPVI interfaces (exception path) |
/dev/ppp |
Required for PPP configuration |
/sys/fs/cgroup |
Required by systemd, can also be set only in the docker image |
/tmp |
Mount /tmp as tmpfs, may be required by some applications using the O_TMPFILE open flag |
/run |
Required by systemd |
/run/lock |
Required by systemd |
The /dev/ppp
device only exists on the node if the following kernel
module is loaded:
# modprobe ppp_generic
Deploy the pod¶
# kubectl apply -f vsr.yaml
You can then see your pod running:
# kubectl get pods
NAME READY STATUS RESTARTS AGE
vsr-55d6f69dcc-2tl76 1/1 Running 0 6m17s
You can get information about the running pod with:
$ kubectl describe pod vsr-55d6f69dcc-2tl76
Note
The pod can be deleted with kubectl delete -f vsr.yaml
.
Connect to the pod¶
You can connect to the pod command line interface with the command
kubectl exec -it <pod name> -- login
. For example:
$ kubectl exec vsr-55d6f69dcc-2tl76 -it -- login