2.3.3. Run container using Kubernetes

This section describes how to install a minimal Kubernetes cluster (one pod, running on the same node as the controller), while prodiving SR-IOV interfaces to the container. It illustrates the deployment of Virtual Service Router within this cluster. It has been tested on Ubuntu 20.04 and Ubuntu 22.04.

If you are already familiar with Kubernetes or if you already have a Kubernetes cluster deployed, you may want to skip the Kubernetes installation procedure and focus on Install smarter-device-manager plugin.

Note

To simplify the documentation, we assume that all commands are run by the root user.

Kubernetes installation

Load kernel modules

Load the required kernel modules on the host node, as listed in Kernel modules.

Memory configuration

To run Kubernetes, you need to disable swap:

# swapoff -a

Note

The kubelet agent running on each node fails to start if swap is enabled. Indeed it currently can’t guarantee that a pod requesting a given amount of memory will never swap during its lifecycle and thus does not know how to enforce the memory limitation as swap does not account for memory. For now, the kubelet agent avoids this issue by requiring that swap is disabled.

The Virtual Service Router pod requires hugepages to run (the recommended value is 8GB per pod). If you need to spawn multiple Virtual Service Router on the same node, you may allocate more, let’s say 16GB per node for example:

# echo 16 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
# echo 16 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
# umount -f /dev/hugepages
# mkdir -p /dev/hugepages
# mount -t hugetlbfs -o pagesize=1G none /dev/hugepages

Install containerd

At the time of writing, the latest Kubernetes stable version is 1.29.2.

Kubernetes version >= 1.26 needs containerd >= 1.6. If such version is not available in your distribution repository, you may install it from docker repository using the following procedure:

# curl -s https://download.docker.com/linux/ubuntu/gpg | apt-key add -
# echo "deb [arch=$(dpkg --print-architecture)] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" \
  | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
# apt -qy update
# apt -qy install containerd.io apparmor apparmor-utils

Install Kubernetes packages

First install the apt-transport-https package, which will allow us to use http and https in Ubuntu repositories:

# apt install -y apt-transport-https curl

Next, add the Kubernetes signing key:

# K8S_VERSION=1.29
# mkdir -p /etc/apt/keyrings
# curl -fsSL "https://pkgs.k8s.io/core:/stable:/v${K8S_VERSION}/deb/Release.key" | \
  gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg

Next, add the Kubernetes package repository:

# echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v${K8S_VERSION}/deb/ /" | \
  tee /etc/apt/sources.list.d/kubernetes.list
# apt update

Now install Kubernetes packages:

# apt install -y kubeadm kubectl kubelet

Once the packages are installed, put them on hold as a Kubernetes update is way more subtle than a simple update of the packages provided by the distribution:

# apt-mark hold kubeadm kubectl kubelet

Configure containerd as Kubernetes Container Runtime

Enable required sysctls:

# cat <<EOF | tee /etc/sysctl.d/99-kubernetes-cri.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
fs.inotify.max_user_instances = 2048
fs.inotify.max_user_watches = 1048576
EOF
# sysctl -p /etc/sysctl.d/99-kubernetes-cri.conf

Extend lock memory limit in pods:

# mkdir -p /etc/systemd/system/containerd.service.d
# cat <<EOF | tee /etc/systemd/system/containerd.service.d/override.conf
[Service]
LimitMEMLOCK=4194304
LimitNOFILE=1048576
EOF

Create the containerd configuration, and restart it:

# mkdir -p /etc/containerd
# containerd config default > /etc/containerd/config.toml
# sed -i 's/SystemdCgroup \= false/SystemdCgroup \= true/g' /etc/containerd/config.toml

# systemctl daemon-reload
# systemctl restart containerd

Cluster initialization

Create a kubelet-config.yaml with the following content:

apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
nodeRegistration:
  criSocket: "unix:///run/containerd/containerd.sock"
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
networking:
  podSubnet: 10.229.0.0/16
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd
reservedSystemCPUs: 0-3
allowedUnsafeSysctls:
- net.*
cpuManagerPolicy: static
topologyManagerPolicy: best-effort

The static CPU Manager policy enables the dedication of cores to a pod. The best-effort Topology Manager policy will try to conciliate the network devices and the dedicated cores that might be allocated to a pod so that these resources are allocated from the same NUMA node. We also reserve a few cores to kubernetes housekeeping daemons and processes with the reservedSystemCPUs configuration.

See also

the CPU management policies and the Topology manager pages from Kubernetes documentation.

The desired subnet for internal communication is also provided in this file and can be modified.

# kubeadm config images pull
# kubeadm init --config kubelet-config.yaml
# mkdir /root/.kube
# cp /etc/kubernetes/admin.conf /root/.kube/config
# chown $(id -u):$(id -g) /root/.kube/config

By default, your cluster will not schedule Pods on the controller node for security reasons. In our example of a single-machine Kubernetes cluster, we allow scheduling pods on the controller node:

# kubectl taint nodes --all node-role.kubernetes.io/control-plane-

Network configuration

In this subsection we will install several network plugins for Kubernetes. These are plugins that provide networking to the pods deployed in the cluster.

We will use:

  • flannel: a basic networking plugin leveraging veth interfaces to provide the default pod connectivity;

  • sriov-cni: to actually plumb host VFs into pods;

  • sriov-network-device-plugin: to allocate host resources to the pods;

  • multus: a CNI meta-plugin enabling the multiplexing of several other plugins required to make the previous three plugins work together.

Install golang

Golang is required to build multus and sriov-cni.

# cd /root
# wget https://golang.org/dl/go1.18.5.linux-amd64.tar.gz
# tar -C /usr/local -xf go1.18.5.linux-amd64.tar.gz
# export PATH=$PATH:/usr/local/go/bin

Install flannel plugin

Flannel is responsible for allocating a subnet lease to each host out of a larger, preconfigured address space. In Kubernetes, each pod has a unique, routable IP inside the cluster. The transport between pods from different nodes is assured by a VXLAN overlay.

Create a kube-flannel.yaml file with the following content:

---
kind: Namespace
apiVersion: v1
metadata:
  name: kube-flannel
  labels:
    pod-security.kubernetes.io/enforce: privileged
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: flannel
rules:
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - nodes/status
  verbs:
  - patch
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: flannel
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: flannel
subjects:
- kind: ServiceAccount
  name: flannel
  namespace: kube-flannel
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: flannel
  namespace: kube-flannel
---
kind: ConfigMap
apiVersion: v1
metadata:
  name: kube-flannel-cfg
  namespace: kube-flannel
  labels:
    tier: node
    app: flannel
data:
  cni-conf.json: |
    {
      "name": "cbr0",
      "cniVersion": "0.3.1",
      "plugins": [
        {
          "type": "flannel",
          "delegate": {
            "hairpinMode": true,
            "isDefaultGateway": true
          }
        },
        {
          "type": "portmap",
          "capabilities": {
            "portMappings": true
          }
        }
      ]
    }
  net-conf.json: |
    {
      "Network": "10.229.0.0/16",
      "Backend": {
        "Type": "vxlan"
      }
    }
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kube-flannel-ds
  namespace: kube-flannel
  labels:
    tier: node
    app: flannel
spec:
  selector:
    matchLabels:
      app: flannel
  template:
    metadata:
      labels:
        tier: node
        app: flannel
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/os
                operator: In
                values:
                - linux
      hostNetwork: true
      priorityClassName: system-node-critical
      tolerations:
      - operator: Exists
        effect: NoSchedule
      serviceAccountName: flannel
      initContainers:
      - name: install-cni-plugin
       #image: flannelcni/flannel-cni-plugin:v1.1.0 for ppc64le and mips64le (dockerhub limitations may apply)
        image: docker.io/rancher/mirrored-flannelcni-flannel-cni-plugin:v1.1.0
        command:
        - cp
        args:
        - -f
        - /flannel
        - /opt/cni/bin/flannel
        volumeMounts:
        - name: cni-plugin
          mountPath: /opt/cni/bin
      - name: install-cni
       #image: flannelcni/flannel:v0.19.2 for ppc64le and mips64le (dockerhub limitations may apply)
        image: docker.io/rancher/mirrored-flannelcni-flannel:v0.19.2
        command:
        - cp
        args:
        - -f
        - /etc/kube-flannel/cni-conf.json
        - /etc/cni/net.d/10-flannel.conflist
        volumeMounts:
        - name: cni
          mountPath: /etc/cni/net.d
        - name: flannel-cfg
          mountPath: /etc/kube-flannel/
      containers:
      - name: kube-flannel
       #image: flannelcni/flannel:v0.19.2 for ppc64le and mips64le (dockerhub limitations may apply)
        image: docker.io/rancher/mirrored-flannelcni-flannel:v0.19.2
        command:
        - /opt/bin/flanneld
        args:
        - --ip-masq
        - --kube-subnet-mgr
        resources:
          requests:
            cpu: "100m"
            memory: "50Mi"
          limits:
            cpu: "100m"
            memory: "50Mi"
        securityContext:
          privileged: false
          capabilities:
            add: ["NET_ADMIN", "NET_RAW"]
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: EVENT_QUEUE_DEPTH
          value: "5000"
        volumeMounts:
        - name: run
          mountPath: /run/flannel
        - name: flannel-cfg
          mountPath: /etc/kube-flannel/
        - name: xtables-lock
          mountPath: /run/xtables.lock
      volumes:
      - name: run
        hostPath:
          path: /run/flannel
      - name: cni-plugin
        hostPath:
          path: /opt/cni/bin
      - name: cni
        hostPath:
          path: /etc/cni/net.d
      - name: flannel-cfg
        configMap:
          name: kube-flannel-cfg
      - name: xtables-lock
        hostPath:
          path: /run/xtables.lock
          type: FileOrCreate

Install the YAML file describing this network provider:

# kubectl apply -f kube-flannel.yaml

This deploys the configuration and the daemonset that runs the flanneld binary on each node.

Note

This documentation has been tested with flannel v0.19.2.

Install multus plugin

Multus is a meta-plugin that can leverage several other CNI plugins simultaneously. We use it here to be able to provide a sriov-type configuration to a SR-IOV device allocated by the sriov-network-device-plugin. It can also provide additional network connectivity through other CNI plugins. When installed, it automatically includes the existing flannel plugin in its configuration so that it provides this default connectivity to all pods in addition to explicitely defined network interfaces. This default connectivity is required by the cluster so that a pod can reach the external world (other pods on the same or on another node, internet resources …).

Build the CNI:

# TAG=v3.9.2
# cd /root
# git clone https://github.com/intel/multus-cni.git
# cd multus-cni/
# git checkout $TAG
# ./hack/build-go.sh

Install the plugin binary:

# cp bin/multus /opt/cni/bin

Install the daemonset:

# sed -i 's,\(image: ghcr\.io/k8snetworkplumbingwg/multus-cni\):.*,\1:'$TAG',' deployments/multus-daemonset.yml
# kubectl create -f deployments/multus-daemonset.yml

Note

This documentation has been tested with multus v3.9.2.

Install sriov plugin

The purpose of the sriov-cni plugin is to configure the VF allocated to the containers.

# cd /root
# git clone https://github.com/intel/sriov-cni.git
# cd sriov-cni
# git checkout v2.6.3
# make build
# cp build/sriov /opt/cni/bin

Note

This documentation has been tested with sriov-cni v2.6.3.

Configure the NICs

In this example, we want to pass to the pod the two Intel NICs ens787f0 and ens804f0, that have the PCI addresses 0000:81:00.0 and 0000:83:00.0 respectively. We will create a Virtual Function for each interface (PCI addresses 0000:81:10.0 and 0000:83:10.0) and bind them to the vfio-pci driver.

Set the PF devices up and create the desired number of VFs for each NIC:

Note

This subsection applies to Intel network devices (ex: Niantic, Fortville). Other devices like Nvidia Mellanox NICs require different operations, which are not detailed in this document.

# ip link set ens787f0 up
# echo 1 > /sys/class/net/ens787f0/device/sriov_numvfs

# ip link set ens804f0 up
# echo 1 > /sys/class/net/ens804f0/device/sriov_numvfs

Source the following helper shell functions:

# Bind a device to a driver
# $1: pci bus address (ex: 0000:04:00.0)
# $2: driver
bind_device () {
        echo "Binding $1 to $2"
        sysfs_dev=/sys/bus/pci/devices/$1
        if [ -e ${sysfs_dev}/driver ]; then
                sudo sh -c "echo $1 > ${sysfs_dev}/driver/unbind"
        fi
        sudo sh -c "echo $2 > ${sysfs_dev}/driver_override"
        sudo sh -c "echo $1 > /sys/bus/pci/drivers/$2/bind"
        if [ ! -e ${sysfs_dev}/driver ]; then
                echo "Failed to bind device $1 to driver $2" >&2
                return 1
        fi
}

# Bind a device and devices in the same iommu group to a driver
# $1: pci bus address (ex: 0000:04:00.0)
# $2: driver
bind_device_and_siblings () {
        bind_device $1 $2
        # take devices in the same iommu group
        for dir in $sysfs_dev/iommu_group/devices/*; do
                [ -e "$dir" ] || continue
                sibling=$(basename $(readlink -e "$dir"))
                # we can skip ourself
                [ "$sibling" = "$1" ] && continue
                bind_device $sibling $2
        done
}

# get the iommu group of a device
# $1: pci bus address (ex: 0000:04:00.0)
get_iommu_group () {
        iommu_is_enabled || echo -n "noiommu-"
        echo $(basename $(readlink -f /sys/bus/pci/devices/$1/iommu_group))
}

# return 0 (success) if there is at least one file in /sys/class/iommu
iommu_is_enabled() {
        for f in /sys/class/iommu/*; do
                if [ -e "$f" ]; then
                        return 0
                fi
        done
        return 1
}

# get arguments to be passed to docker cli
# $*: list of pci devices
get_vfio_device_args () {
        iommu_is_enabled || echo -n "--cap-add=SYS_RAWIO "
        echo "--device /dev/vfio/vfio "
        for d in $*; do
                echo -n "--device /dev/vfio/$(get_iommu_group $d) "
        done
        echo
}

These helpers can be downloaded from there.

The following command sets the unsafe mode in case the IOMMU is not available.

$ if ! iommu_is_enabled; then \
    sudo sh -c "echo Y > /sys/module/vfio/parameters/enable_unsafe_noiommu_mode"; \
  fi

Bind the VFs devices to the vfio-pci driver:

# bind_device_and_siblings 0000:81:10.0 vfio-pci
# bind_device_and_siblings 0000:83:10.0 vfio-pci

See also

Depending on your system, additional configuration may be required. See Providing physical devices or virtual functions to the container paragraph.

Create Kubernetes networking resources

To pass the VF interfaces to the pod, we need to declare networking resources via the sriov-network-device plugin.

To do so, create the following config-map-dpdk.yaml file with this content:

apiVersion: v1
kind: ConfigMap
metadata:
  name: sriovdp-config
  namespace: kube-system
data:
  config.json: |
    {
        "resourceList": [
            {
                "resourceName": "intel_sriov_nic_vsr1",
                "resourcePrefix": "intel.com",
                "selectors": {
                    "vendors": ["8086"],
                    "devices": ["10ed"],
                    "drivers": ["vfio-pci"],
                    "pfNames": ["ens787f0"],
                    "needVhostNet": true
                }
            },
            {
                "resourceName": "intel_sriov_nic_vsr2",
                "resourcePrefix": "intel.com",
                "selectors": {
                    "vendors": ["8086"],
                    "devices": ["10ed"],
                    "drivers": ["vfio-pci"],
                    "pfNames": ["ens804f0"],
                    "needVhostNet": true
                }
            }
        ]
    }

This configuration file declares two different resource types, one for VF 0 of the Intel NIC ens787f0 and one for VF 0 of the Intel NIC ens804f0.

The selector keywords vendor and devices match the hexadecimal ID that can be found in /sys/bus/pci/devices/<PCI_ID>/ of the VF devices.

The "needVhostNet": true directive instructs the sriov-network-device-plugin to mount the /dev/vhost-net device alongside the VF devices into the pods. This is required for all interfaces that will be handled by the fast path.

See also

The detailed explanation of the syntax can be found on the plugin website

Now deploy this ConfigMap into the cluster:

# kubectl apply -f config-map-dpdk.yaml

Finally, we deploy the YAML file describing the daemonset enabling the SR-IOV network device plugin on all worker nodes. It leverages the ConfigMap that we just installed and the SR-IOV device plugin image reachable by all nodes. This file named sriovdp-daemonset.yaml has the following content:

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: sriov-device-plugin
  namespace: kube-system

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kube-sriov-device-plugin-amd64
  namespace: kube-system
  labels:
    tier: node
    app: sriovdp
spec:
  selector:
    matchLabels:
      name: sriov-device-plugin
  template:
    metadata:
      labels:
        name: sriov-device-plugin
        tier: node
        app: sriovdp
    spec:
      hostNetwork: true
      hostPID: true
      nodeSelector:
        kubernetes.io/arch: amd64
      tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule
      serviceAccountName: sriov-device-plugin
      containers:
      - name: kube-sriovdp
        image: ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:v3.5.1
        imagePullPolicy: Always
        args:
        - --log-dir=sriovdp
        - --log-level=10
        securityContext:
          privileged: true
        resources:
          requests:
            cpu: "250m"
            memory: "40Mi"
          limits:
            cpu: 1
            memory: "200Mi"
        volumeMounts:
        - name: devicesock
          mountPath: /var/lib/kubelet/
          readOnly: false
        - name: log
          mountPath: /var/log
        - name: config-volume
          mountPath: /etc/pcidp
        - name: device-info
          mountPath: /var/run/k8s.cni.cncf.io/devinfo/dp
      volumes:
        - name: devicesock
          hostPath:
            path: /var/lib/kubelet/
        - name: log
          hostPath:
            path: /var/log
        - name: device-info
          hostPath:
            path: /var/run/k8s.cni.cncf.io/devinfo/dp
            type: DirectoryOrCreate
        - name: config-volume
          configMap:
            name: sriovdp-config
            items:
            - key: config.json
              path: config.json

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kube-sriov-device-plugin-arm64
  namespace: kube-system
  labels:
    tier: node
    app: sriovdp
spec:
  selector:
    matchLabels:
      name: sriov-device-plugin
  template:
    metadata:
      labels:
        name: sriov-device-plugin
        tier: node
        app: sriovdp
    spec:
      hostNetwork: true
      nodeSelector:
        kubernetes.io/arch: arm64
      tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule
      serviceAccountName: sriov-device-plugin
      containers:
      - name: kube-sriovdp
        image: ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:latest-arm64
        imagePullPolicy: Always
        args:
        - --log-dir=sriovdp
        - --log-level=10
        securityContext:
          privileged: true
        resources:
          requests:
            cpu: "250m"
            memory: "40Mi"
          limits:
            cpu: 1
            memory: "200Mi"
        volumeMounts:
        - name: devicesock
          mountPath: /var/lib/kubelet/
          readOnly: false
        - name: log
          mountPath: /var/log
        - name: config-volume
          mountPath: /etc/pcidp
        - name: device-info
          mountPath: /var/run/k8s.cni.cncf.io/devinfo/dp
      volumes:
        - name: devicesock
          hostPath:
            path: /var/lib/kubelet/
        - name: log
          hostPath:
            path: /var/log
        - name: device-info
          hostPath:
            path: /var/run/k8s.cni.cncf.io/devinfo/dp
            type: DirectoryOrCreate
        - name: config-volume
          configMap:
            name: sriovdp-config
            items:
            - key: config.json
              path: config.json
# kubectl apply -f sriovdp-daemonset.yaml

Note

We use the v3.5.1 stable version of sriov-network-device-plugin.

Create a multus-sriov-dpdk.yaml file to integrate the SR-IOV plugins into the multus environment and then deploy it, with the following content:

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: multus-intel-sriov-nic-vsr1
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/intel_sriov_nic_vsr1
spec:
  config: '{
  "type": "sriov",
  "cniVersion": "0.3.1",
  "name": "sriov-intel-nic-vsr1",
  "trust": "on",
  "spoofchk": "off"
}'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: multus-intel-sriov-nic-vsr2
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/intel_sriov_nic_vsr2
spec:
  config: '{
  "type": "sriov",
  "cniVersion": "0.3.1",
  "name": "sriov-intel-nic-vsr2",
  "trust": "on",
  "spoofchk": "off"
}'
# kubectl apply -f multus-sriov-dpdk.yaml

Install smarter-device-manager plugin

If you plan to use PPP inside the container, it is required to pass the /dev/ppp device to the container. This can be done with the smarter-device-manager plugin.

First, deploy the YAML file describing the daemonset enabling the smarter-device-manager plugin on all worker nodes. This file named smarter-device-manager.yaml has the following content:

# derived from https://gitlab.com/arm-research/smarter/smarter-device-manager/-/blob/master/smarter-device-manager-ds.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: smarter-device-manager
  labels:
    name: smarter-device-manager
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: smarter-device-manager
  namespace: smarter-device-manager
  labels:
    name: smarter-device-manager
    role: agent
spec:
  selector:
    matchLabels:
      name: smarter-device-manager
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: smarter-device-manager
      annotations:
        node.kubernetes.io/bootstrap-checkpoint: "true"
    spec:
      priorityClassName: "system-node-critical"
      hostname: smarter-device-management
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - name: smarter-device-manager
        image: registry.gitlab.com/arm-research/smarter/smarter-device-manager:v1.20.11
        imagePullPolicy: IfNotPresent
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        resources:
          limits:
            cpu: 100m
            memory: 15Mi
          requests:
            cpu: 10m
            memory: 15Mi
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
          - name: dev-dir
            mountPath: /dev
          - name: sys-dir
            mountPath: /sys
          - name: config
            mountPath: /root/config
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
        - name: dev-dir
          hostPath:
            path: /dev
        - name: sys-dir
          hostPath:
            path: /sys
        - name: config
          configMap:
            name: smarter-device-manager
      terminationGracePeriodSeconds: 30
# kubectl apply -f smarter-device-manager.yaml

Note

We use the v1.20.11 stable version of smarter-device-manager plugin.

Apply the following configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: smarter-device-manager
  namespace: smarter-device-manager
data:
  conf.yaml: |
    - devicematch: ^ppp$
      nummaxdevices: 100
# kubectl apply -f smarter-device-manager-config.yaml

The associated resource will be requested in the Virtual Service Router deployment file, as described below.

Deploy Virtual Service Router into the cluster

Authenticate to 6WIND container registry

The Docker image is available at: download.6wind.com/vsr/x86_64-ce/3.8:3.8.0.ga

First, create a Kubernetes secret to authenticate to the 6WIND registry, using the credentials provided by 6WIND support:

# kubectl create secret docker-registry regcred \
    --docker-server=download.6wind.com \
    --docker-username=$LOGIN --docker-password=$PASSWORD

Warning

replace $LOGIN and $PASSWORD with the credentials provided by 6WIND support.

Create the Virtual Service Router pod template

The pod is declared in a YAML file describing its properties and the way it is deployed.

For this example, we use a file named vsr.yaml (available for download here):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vsr
spec:
  replicas: 1
  selector:
    matchLabels:
      role: vsr
  template:
    metadata:
      labels:
        role: vsr
      annotations:
         k8s.v1.cni.cncf.io/networks: multus-intel-sriov-nic-vsr1,multus-intel-sriov-nic-vsr2
         container.apparmor.security.beta.kubernetes.io/vsr: unconfined
    spec:
      restartPolicy: Always
      securityContext:
        sysctls:
        - name: net.ipv4.conf.default.disable_policy
          value: "1"
        - name: net.ipv4.ip_local_port_range
          value: "30000 40000"
        - name: net.ipv4.ip_forward
          value: "1"
        - name: net.ipv6.conf.all.forwarding
          value: "1"
      containers:
      - image: download.6wind.com/vsr/x86_64-ce/3.8:3.8.0.ga
        imagePullPolicy: IfNotPresent
        name: vsr
        resources:
          limits:
            cpu: "4"
            memory: "2Gi"
            hugepages-1Gi: 8Gi
            intel.com/intel_sriov_nic_vsr1: '1'
            intel.com/intel_sriov_nic_vsr2: '1'
            smarter-devices/ppp: 1
          requests:
            cpu: "4"
            memory: "2Gi"
            hugepages-1Gi: 8Gi
            intel.com/intel_sriov_nic_vsr1: '1'
            intel.com/intel_sriov_nic_vsr2: '1'
            smarter-devices/ppp: 1
        env:
        securityContext:
          capabilities:
            add: ["NET_ADMIN", "NET_RAW", "SYS_ADMIN", "SYS_NICE", "IPC_LOCK", "NET_BROADCAST", "SYSLOG"]
        volumeMounts:
        - mountPath: /dev/hugepages
          name: hugepage
        - mountPath: /dev/shm
          name: shm
        - mountPath: /dev/net/tun
          name: net
        - mountPath: /tmp
          name: tmp
        - mountPath: /run
          name: run
        - mountPath: /run/lock
          name: run-lock
        stdin: true
        tty: true
      imagePullSecrets:
      - name: regcred
      volumes:
      - emptyDir:
          medium: HugePages
          sizeLimit: 8Gi
        name: hugepage
      - name: shm
        emptyDir:
          sizeLimit: "512Mi"
          medium: "Memory"
      - hostPath:
          path: /dev/net/tun
          type: ""
        name: net
      - emptyDir:
          sizeLimit: "200Mi"
          medium: "Memory"
        name: tmp
      - emptyDir:
          sizeLimit: "200Mi"
          medium: "Memory"
        name: run
      - emptyDir:
          sizeLimit: "200Mi"
          medium: "Memory"
        name: run-lock

Note

If the IOMMU of your server is disabled (which is not advised), the CAP_SYS_RAWIO capability must also be enabled when using a VF or PF interface in the container. See Required capabilities paragraph.

This file contains one pod named vsr that contains the Virtual Service Router container. The pod is started with reduced capabilities and has both SR-IOV and flannel network interfaces attached.

The pod template declares as an annotation the network resource attachments from multus that it requests. You can add more than one attachment if you need several interfaces.

Next the container itself is described. The pod has only one container, that runs the Virtual Service Router image fetched from 6WIND docker repository.

Then the container declares the resources that it requests. The Kubernetes cluster will then find a node satisfying these requirements to schedule the pod. By setting the limits equal to the requests, we ask for this exact amount of resources. For example for the vsr pod:

  • cpu: number of vCPUs resources to be allocated. This reservation does not necessarily imply a CPU pinning. The cluster provides an amount of CPU cycles from all the host’s CPUs that is equivalent to 4 logical CPUs.

  • memory: the pod requests 2GB of RAM.

  • hugepages-1Gi: the pod requests 8GB of memory from 1GB-hugepages. You can use hugepages of 2MB instead, provided that you allocate them at boot time and use the keyword hugepages-2Mi instead.

Note

Setting the CPU and memory limits respectively equal to the CPU and memory requests makes the pod qualify for the Guaranteed QoS class. See the configuration of the QoS for pods page of Kubernetes documentation.

Warning

Make sure to request enough CPU resources to cover the CPU cores that you plan to configure in your fast-path.env and the processes of the control plane. Otherwise your fast path cores will be throttled resulting in a huge performance degradation.

The following capabilities are granted to the container:

Capabilities for VSR

Capability

Role

CAP_SYS_ADMIN

eBPF exception path, VRF, tcpdump

CAP_NET_ADMIN

General Linux networking

CAP_NET_RAW

Support of filtering, tcpdump, …

CAP_IPC_LOCK

Memory allocation for DPDK

CAP_SYS_NICE

Get NUMA information from memory

CAP_NET_BROADCAST

VRF support (notifications)

CAP_SYSLOG

Use syslog from pod

Finally we make sure to mount the following paths from the host into the container:

Host resources for VSR

Path

Role

/dev/hugepages, /dev/shm

The fastpath requires access to hugepages for its shared memory

/dev/net/tun

Required for FPVI interfaces (exception path)

/dev/ppp

Required for PPP configuration

/tmp

Mount /tmp as tmpfs, may be required by some applications using the O_TMPFILE open flag

/run

Required by systemd

/run/lock

Required by systemd

Deploy the pod

# kubectl apply -f vsr.yaml

You can then see your pod running:

# kubectl get pods
NAME                   READY   STATUS    RESTARTS   AGE
vsr-55d6f69dcc-2tl76   1/1     Running   0          6m17s

You can get information about the running pod with:

$ kubectl describe pod vsr-55d6f69dcc-2tl76

Note

The pod can be deleted with kubectl delete -f vsr.yaml.

Connect to the pod

You can connect to the pod command line interface with the command kubectl exec -it <pod name> -- login. For example:

$ kubectl exec vsr-55d6f69dcc-2tl76 -it -- login