2.3.3. Run container using Kubernetes

This section describes how to install a minimal Kubernetes cluster (one pod, running on the same node as the controller), while prodiving SR-IOV interfaces to the container. It illustrates the deployment of Virtual Service Router within this cluster. It has been tested on Ubuntu 20.04.

If you are already familiar with Kubernetes or if you already have a Kubernetes cluster deployed, you may want to skip the Kubernetes installation procedure and focus on Deploy Virtual Service Router into the cluster.

Note

To simplify the documentation, we assume that all commands are run by the root user.

Kubernetes installation

Memory configuration

To run Kubernetes, you need to disable swap:

# swapoff -a

Note

The kubelet agent running on each node fails to start if swap is enabled. Indeed it currently can’t guarantee that a pod requesting a given amount of memory will never swap during its lifecycle and thus does not know how to enforce the memory limitation as swap does not account for memory. For now, the kubelet agent avoids this issue by requiring that swap is disabled.

The Virtual Service Router pod requires hugepages to run (the recommended value is 8GB per pod). If you need to spawn multiple Virtual Service Router on the same node, you may allocate more, let’s say 16GB per node for example:

# echo 16 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
# echo 16 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
# umount -f /dev/hugepages
# mkdir -p /dev/hugepages
# mount -t hugetlbfs none /dev/hugepages

Install docker

Install the package and configure the daemon to use systemd as cgroup manager:

# apt update
# apt install -y docker.io
# systemctl enable docker
# cat > /etc/docker/daemon.json <<EOF
{
  "exec-opts": ["native.cgroupdriver=systemd"]
}
EOF
# systemctl restart docker

Install Kubernetes packages

First install the apt-transport-https package, which will allow us to use http and https in Ubuntu repositories:

# apt install -y apt-transport-https curl

Next, add the Kubernetes signing key:

# curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -

Next, add the Kubernetes package repository. At the time of this writing, Ubuntu 16.04 Xenial is the latest Kubernetes repository available:

# echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" > /etc/apt/sources.list.d/kubernetes.list
# apt update

Now install Kubernetes:

# apt install -y kubeadm kubectl kubelet

Note

At the time of this writing, the Kubernetes version is 1.24.0.

Once the packages are installed, put them on hold as a Kubernetes update is way more subtle than a simple update of the packages provided by the distribution:

# apt-mark hold kubeadm kubectl kubelet

Configure containerd as Kubernetes Container Runtime

Load required modules:

# modprobe overlay
# modprobe br_netfilter

# cat > /etc/modules-load.d/containerd.conf <<EOF
overlay
br_netfilter
EOF

Enable required sysctl:

# cat <<EOF | tee /etc/sysctl.d/99-kubernetes-cri.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
# sysctl --system

Extend lock memory limit in pods:

# mkdir -p /etc/systemd/system/containerd.service.d
# cat <<EOF | tee /etc/systemd/system/containerd.service.d/override.conf
[Service]
LimitMEMLOCK=4194304
LimitNOFILE=1048576
EOF

# avoid "too many open files" error in pods
sysctl fs.inotify.max_user_instances=2048
sysctl fs.inotify.max_user_watches=1048576

# systemctl daemon-reload
# systemctl restart containerd

Alter the default kubelet configuration to allow the modification of network-related sysctl, to use systemd as cgroup manager and to enable the CPU manager and the Topology manager, in /etc/systemd/system/kubelet.service.d/10-kubeadm.conf:

# Note: This dropin only works with kubeadm and kubelet v1.11+
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml --allowed-unsafe-sysctls=net.* --cgroup-driver=systemd --reserved-cpus=0-3 --cpu-manager-policy=static --topology-manager-policy=best-effort --feature-gates=IPv6DualStack=true"
# This is a file that "kubeadm init" and "kubeadm join" generates at runtime, populating the KUBELET_KUBEADM_ARGS variable dynamically
EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
# This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably, the user should use
# the .NodeRegistration.KubeletExtraArgs object in the configuration files instead. KUBELET_EXTRA_ARGS should be sourced from this file.
EnvironmentFile=-/etc/default/kubelet
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS

Then reload the configuration and restart the daemon:

# systemctl daemon-reload
# systemctl restart kubelet

Note

The static CPU Manager policy enables the dedication of cores to a pod. The best-effort Topology Manager policy will try conciliate the network devices and the dedicated cores that might be allocated to a pod so that these resources are allocated from the same NUMA node. We also reserve a few cores to kubernetes housekeeping daemons and processes with the --reserved-cpus argument.

See also

the CPU management policies and the Topology manager pages from Kubernetes documentation.

Cluster initialization

Initialize the cluster while providing the desired subnet for internal communication:

# kubeadm init --pod-network-cidr=10.244.0.0/16 --cri-socket unix:///run/containerd/containerd.sock
# mkdir /root/.kube
# cp /etc/kubernetes/admin.conf /root/.kube/config
# chown $(id -u):$(id -g) /root/.kube/config

By default, your cluster will not schedule Pods on the controller node for security reasons. In our example of a single-machine Kubernetes cluster, we allow scheduling pods on the controller node:

# kubectl taint nodes --all node-role.kubernetes.io/master-

If using Kubernetes version >= 1.24, you also need the following command to remove the control-plane taint:

# kubectl taint nodes --all node-role.kubernetes.io/control-plane-

Network configuration

In this subsection we will install several network plugins for Kubernetes. These are plugins that provide networking to the pods deployed in the cluster.

We will use:

  • flannel: a basic networking plugin leveraging veth interfaces to provide the default pod connectivity;

  • sriov-cni: to actually plumb host VFs into pods;

  • sriov-network-device-plugin: to allocate host resources to the pods;

  • multus: a CNI meta-plugin enabling the multiplexing of several other plugins required to make the previous three plugins work together.

Install golang

The golang package is required to build multus and sriov-cni.

# apt install golang

Install flannel plugin

Flannel is responsible for allocating a subnet lease to each host out of a larger, preconfigured address space. In Kubernetes, each pod has a unique, routable IP inside the cluster. The transport between pods from different nodes is assured by a VXLAN overlay.

Download and install the YAML file describing this network provider:

# wget https://raw.githubusercontent.com/flannel-io/flannel/v0.13.1-rc1/Documentation/kube-flannel.yml
# kubectl apply -f kube-flannel.yml

This deploys the configuration and the daemonset that runs the flanneld binary on each node.

Note

This documentation has been tested with flannel v0.13.1-rc1.

Install multus plugin

Multus is a meta-plugin that can leverage several other CNI plugins simultaneously. We use it here to be able to provide a sriov-type configuration to a SR-IOV device allocated by the sriov-network-device-plugin. It can also provide additional network connectivity through other CNI plugins. When installed, it automatically includes the existing flannel plugin in its configuration so that it provides this default connectivity to all pods in addition to explicitely defined network interfaces. This default connectivity is required by the cluster so that a pod can reach the external world (other pods on the same or on another node, internet resources …).

Build the CNI:

# git clone https://github.com/intel/multus-cni.git
# cd multus-cni/
# git reset --hard v3.8
# ./hack/build-go.sh

Install the plugin binary:

# cp bin/multus /opt/cni/bin

Install the daemonset:

# kubectl create -f images/multus-daemonset.yml

Note

This documentation has been tested with multus v3.8.

Install sriov plugin

The purpose of the sriov-cni plugin is to configure the VF allocated to the containers.

# git clone https://github.com/intel/sriov-cni.git
# cd sriov-cni
# git reset --hard v2.6.1
# make build
# cp build/sriov /opt/cni/bin

Note

This documentation has been tested with sriov-cni v2.6.1.

Configure the NICs

In this example, we want to pass to the pod the two Intel NICs ens787f0 and ens804f0, that have the PCI addresses 0000:81:00.0 and 0000:83:00.0 respectively. We will create a Virtual Function for each interface (PCI addresses 0000:81:10.0 and 0000:83:10.0) and bind them to the vfio-pci driver.

Set the PF devices up and create the desired number of VFs for each NIC:

Note

This subsection applies to Intel network devices (ex: Niantic, Fortville). Other devices like Nvidia Mellanox NICs require different operations, which are not detailed in this document.

# ip link set ens787f0 up
# echo 1 > /sys/class/net/ens787f0/device/sriov_numvfs

# ip link set ens804f0 up
# echo 1 > /sys/class/net/ens804f0/device/sriov_numvfs

Bind the VFs devices to the vfio-pci driver:

# echo 0000:81:10.0 > /sys/bus/pci/devices/0000\:81\:10.0/driver/unbind
# echo vfio-pci > /sys/bus/pci/devices/0000\:81\:10.0/driver_override
# echo 0000:81:10.0 > /sys/bus/pci/drivers_probe

# echo 0000:83:10.0 > /sys/bus/pci/devices/0000\:83\:10.0/driver/unbind
# echo vfio-pci > /sys/bus/pci/devices/0000\:83\:10.0/driver_override
# echo 0000:83:10.0 > /sys/bus/pci/drivers_probe

# modprobe vfio-pci

Load the vhost-net driver, to enable the vhost networking backend, required for the exception path:

# modprobe vhost-net

See also

Depending on your system, additional configuration may be required. See Providing physical devices or virtual functions to the container paragraph.

Create Kubernetes networking resources

To pass the VF interfaces to the pod, we need to declare networking resources via the sriov-network-device plugin.

To do so, we create the following config-map-dpdk.yaml file:

apiVersion: v1
kind: ConfigMap
metadata:
  name: sriovdp-config
  namespace: kube-system
data:
  config.json: |
    {
        "resourceList": [
            {
                "resourceName": "intel_sriov_nic_vsr1",
                "resourcePrefix": "intel.com",
                "selectors": {
                    "vendors": ["8086"],
                    "devices": ["10ed"],
                    "drivers": ["vfio-pci"],
                    "pfNames": ["ens787f0"],
                    "needVhostNet": true
                }
            },
            {
                "resourceName": "intel_sriov_nic_vsr2",
                "resourcePrefix": "intel.com",
                "selectors": {
                    "vendors": ["8086"],
                    "devices": ["10ed"],
                    "drivers": ["vfio-pci"],
                    "pfNames": ["ens804f0"],
                    "needVhostNet": true
                }
            }
        ]
    }

This configuration file declares two different resource types, one for VF 0 of the Intel NIC ens787f0 and one for VF 0 of the Intel NIC ens804f0.

The selector keywords vendor and devices match the hexadecimal ID that can be found in /sys/bus/pci/devices/<PCI_ID>/ of the VF devices.

The "needVhostNet": true directive instructs the sriov-network-device-plugin to mount the /dev/vhost-net device alongside the VF devices into the pods. This is required for all interfaces that will be handled by the fast path.

See also

The detailed explanation of the syntax can be found on the plugin website

Now deploy this ConfigMap into the cluster:

# kubectl apply -f config-map-dpdk.yaml

Finally, we deploy the YAML file describing the daemonset enabling the SR-IOV network device plugin on all worker nodes. It leverages the ConfigMap that we just installed and the SR-IOV device plugin image reachable by all nodes. This file named sriovdp-daemonset.yaml has the following content:

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: sriov-device-plugin
  namespace: kube-system

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kube-sriov-device-plugin-amd64
  namespace: kube-system
  labels:
    tier: node
    app: sriovdp
spec:
  selector:
    matchLabels:
      name: sriov-device-plugin
  template:
    metadata:
      labels:
        name: sriov-device-plugin
        tier: node
        app: sriovdp
    spec:
      hostNetwork: true
      hostPID: true
      nodeSelector:
        beta.kubernetes.io/arch: amd64
      tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule
      serviceAccountName: sriov-device-plugin
      containers:
      - name: kube-sriovdp
        image: ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:v3.3.2
        imagePullPolicy: Always
        args:
        - --log-dir=sriovdp
        - --log-level=10
        securityContext:
          privileged: true
        resources:
          requests:
            cpu: "250m"
            memory: "40Mi"
          limits:
            cpu: 1
            memory: "200Mi"
        volumeMounts:
        - name: devicesock
          mountPath: /var/lib/kubelet/
          readOnly: false
        - name: log
          mountPath: /var/log
        - name: config-volume
          mountPath: /etc/pcidp
        - name: device-info
          mountPath: /var/run/k8s.cni.cncf.io/devinfo/dp
      volumes:
        - name: devicesock
          hostPath:
            path: /var/lib/kubelet/
        - name: log
          hostPath:
            path: /var/log
        - name: device-info
          hostPath:
            path: /var/run/k8s.cni.cncf.io/devinfo/dp
            type: DirectoryOrCreate
        - name: config-volume
          configMap:
            name: sriovdp-config
            items:
            - key: config.json
              path: config.json
# kubectl apply -f sriovdp-daemonset.yaml

Note

We use the v3.3.2 stable version of sriov-network-device-plugin.

Create the following multus-sriov-dpdk.yaml file to integrate the SR-IOV plugins into the multus environment and then deploy it:

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: multus-intel-sriov-nic-vsr1
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/intel_sriov_nic_vsr1
spec:
  config: '{
  "type": "sriov",
  "cniVersion": "0.3.1",
  "name": "sriov-intel-nic-vsr1",
  "trust": "on",
  "spoofchk": "off"
}'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: multus-intel-sriov-nic-vsr2
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/intel_sriov_nic_vsr2
spec:
  config: '{
  "type": "sriov",
  "cniVersion": "0.3.1",
  "name": "sriov-intel-nic-vsr2",
  "trust": "on",
  "spoofchk": "off"
}'
# kubectl apply -f multus-sriov-dpdk.yaml

Deploy Virtual Service Router into the cluster

Authenticate to 6WIND container registry

The Docker image is available at: https://download.6wind.com/vsr/<arch>-ce/<version>.

First, create a Kubernetes secret to authenticate to the 6WIND registry, using the credentials provided by 6WIND support:

# kubectl create secret docker-registry regcred \
    --docker-server=download.6wind.com \
    --docker-username=<login> --docker-password=<password>

Create the Virtual Service Router pod template

The pod is declared in a YAML file describing its properties and the way it is deployed.

For this example, we use the following file named vsr.yaml:

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: 6windgate-sysctl
spec:
  seLinux:
    rule: RunAsAny
  runAsUser:
    rule: RunAsAny
  fsGroup:
    rule: RunAsAny
  supplementalGroups:
    rule: RunAsAny
  volumes:
  - '*'
  allowedUnsafeSysctls:
  - net.*
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: 6windgate-role
rules:
  - apiGroups: ['policy']
    resources: ['podsecuritypolicies']
    verbs: ['use']
    resourceNames: ['6windgate-sysctl']
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: 6windgate-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: 6windgate-role
subjects:
- kind: ServiceAccount
  name: 6windgate-user
  namespace: default
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: 6windgate-user
  namespace: default
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: vsr
  name: vsr
spec:
  replicas: 1
  selector:
    matchLabels:
      app: fastpath
  template:
    metadata:
      labels:
        app: fastpath
      annotations:
         k8s.v1.cni.cncf.io/networks: multus-intel-sriov-nic-vsr1,multus-intel-sriov-nic-vsr2
    spec:
      restartPolicy: Always
      serviceAccountName: 6windgate-user
      securityContext:
        sysctls:
        - name: net.ipv4.conf.default.disable_policy
          value: "1"
        - name: net.ipv4.ip_local_port_range
          value: "30000 40000"
        - name: net.ipv4.ip_forward
          value: "1"
        - name: net.ipv6.conf.all.forwarding
          value: "1"
      containers:
      - image: download.6wind.com/vsr/x86_64-ce/3.5:3.5.1
        imagePullPolicy: IfNotPresent
        name: vsr
        resources:
          limits:
            cpu: "4"
            memory: "2Gi"
            hugepages-1Gi: 8Gi
            intel.com/intel_sriov_nic_vsr1: '1'
            intel.com/intel_sriov_nic_vsr2: '1'
          requests:
            cpu: "4"
            memory: "2Gi"
            hugepages-1Gi: 8Gi
            intel.com/intel_sriov_nic_vsr1: '1'
            intel.com/intel_sriov_nic_vsr2: '1'
        env:
        securityContext:
          capabilities:
            add: ["NET_ADMIN", "SYS_ADMIN", "SYS_NICE", "IPC_LOCK", "NET_BROADCAST", "AUDIT_WRITE"]
        volumeMounts:
        - mountPath: /dev/hugepages
          name: hugepage
        - mountPath: /dev/shm
          name: shm
        - mountPath: /dev/net/tun
          name: net
        - mountPath: /dev/ppp
          name: ppp
        - mountPath: /sys/fs/cgroup
          name: cgroup
        - mountPath: /tmp
          name: tmp
        - mountPath: /run
          name: run
        - mountPath: /run/lock
          name: run-lock
        stdin: true
        tty: true
      imagePullSecrets:
      - name: regcred
      volumes:
      - emptyDir:
          medium: HugePages
          sizeLimit: 8Gi
        name: hugepage
      - name: shm
        emptyDir:
          sizeLimit: "512Mi"
          medium: "Memory"
      - hostPath:
          path: /dev/net/tun
          type: ""
        name: net
      - hostPath:
          path: /dev/ppp
          type: ""
        name: ppp
      - hostPath:
          path: /sys/fs/cgroup
          type: ""
        name: cgroup
      - emptyDir:
          sizeLimit: "200Mi"
          medium: "Memory"
        name: tmp
      - emptyDir:
          sizeLimit: "200Mi"
          medium: "Memory"
        name: run
      - emptyDir:
          sizeLimit: "200Mi"
          medium: "Memory"
        name: run-lock

Note

If the IOMMU of your server is disabled (which is not advised), the CAP_SYS_RAWIO capability must also be enabled when using a VF or PF interface in the container. See Kernel configuration paragraph.

This file contains one pod named vsr that contains the Virtual Service Router container. The pod is started with reduced capabilities and has both SR-IOV and flannel network interfaces attached.

This file also defines the following Kubernetes resources:

  • a PodSecurityPolicy named 6windgate-sysctl allowing the modification of the net.* sysctls

  • a ServiceAccount user 6windgate-user in the default namespace

  • a ClusterRole named 6windgate-role granted with the usage of the items listed in the PodSecurityPolicy

  • a ClusterRoleBinding, giving the ServiceAccount user the ClusterRole previously defined

The pod template declares as an annotation the network resource attachments from multus that it requests. You can add more than one attachment if you need several interfaces.

The ServiceAccount 6windgate-user is then used to instantiate our pods so that they can set the net.* sysctls required by the Virtual Service Router in their SecurityContext.

The following syctls are set:

Sysctls for VSR

Name

Role

net.ipv4.conf.default.disable_policy

Allow the delivery of IPSec packets aiming at a local IP address

net.ipv4.ip_local_port_range

Define the local port ranges used for TCP and UDP

net.ipv4.ip_forward

Allow packet forwarding for IPv4

net.ipv6.conf.all.forwarding

Allow packet forwarding for IPv6

Next the container itself is described. The pod has only one container, that runs the Virtual Service Router image fetched from 6WIND docker repository.

Then the container declares the resources that it requests. The Kubernetes cluster will then find a node satisfying these requirements to schedule the pod. By setting the limits equal to the requests, we ask for this exact amount of resources. For example for the vsr pod:

  • cpu: number of vCPUs resources to be allocated. This reservation does not necessarily imply a CPU pinning. The cluster provides an amount of CPU cycles from all the host’s CPUs that is equivalent to 4 logical CPUs.

  • memory: the pod requests 2GB of RAM.

  • hugepages-1Gi: the pod requests 8GB of memory from 1GB-hugepages. You can use hugepages of 2MB instead, provided that you allocate them at boot time and use the keyword hugepages-2Mi instead.

Note

Setting the CPU and memory limits respectively equal to the CPU and memory requests makes the pod qualify for the Guaranteed QoS class. See the configuration of the QoS for pods page of Kubernetes documentation.

Warning

Make sure to request enough CPU resources to cover the CPU cores that you plan to configure in your fast-path.env and the processes of the control plane. Otherwise your fast path cores will be throttled resulting in a huge performance degradation.

The following capabilities are granted to the container:

Capabilities for VSR

Capability

Role

CAP_SYS_ADMIN

eBPF exception path, VRF, tcpdump

CAP_NET_ADMIN

General Linux networking

CAP_IPC_LOCK

Memory allocation for DPDK

CAP_SYS_NICE

Get NUMA information from memory

CAP_NET_BROADCAST

VRF support (notifications)

CAP_AUDIT_WRITE

Write records to kernel auditing log

Finally we make sure to mount the following paths from the host into the container:

Host resources for VSR

Path

Role

/dev/hugepages, /dev/shm

The fastpath requires access to hugepages for its shared memory

/dev/net/tun

Required for FPVI interfaces (exception path)

/dev/ppp

Required for PPP configuration

/sys/fs/cgroup

Required by systemd, can also be set only in the docker image

/tmp

Mount /tmp as tmpfs, may be required by some applications using the O_TMPFILE open flag

/run

Required by systemd

/run/lock

Required by systemd

The /dev/ppp device only exists on the node if the following kernel module is loaded:

# modprobe ppp_generic

Deploy the pod

# kubectl apply -f vsr.yaml

You can then see your pod running:

# kubectl get pods
NAME                   READY   STATUS    RESTARTS   AGE
vsr-55d6f69dcc-2tl76   1/1     Running   0          6m17s

You can get information about the running pod with:

$ kubectl describe pod vsr-55d6f69dcc-2tl76

Note

The pod can be deleted with kubectl delete -f vsr.yaml.

Connect to the pod

You can connect to the pod command line interface with the command kubectl exec -it <pod name> -- login. For example:

$ kubectl exec vsr-55d6f69dcc-2tl76 -it -- login