2.3.1. Node requirements and configuration

This section lists the requirements and recommendations that apply to the node (the host that runs the containers).

Linux Distribution

The supported distributions are Ubuntu 20.04, Ubuntu 22.04 and Red Hat Enterprise Linux 8.

CPU

To run Virtual Service Router, at least 2 CPUs are required: 1 for fast path, and 1 for control plane. The packet processing performance scales with the number of fast path CPUs.

Note

The fast path CPUs must be dedicated: no other application (on the node or inside a container) should use them. Container engines like Docker or Kubernetes have features for that.

Memory

The host has to be configured to provide hugepages and POSIX shared memory to the container ; else the Virtual Service Router fast path will fail to start.

Note

A hugepage is a page that addresses more memory than the usual 4KB. Accessing a hugepage is more efficient than accessing a regular memory page. Its default size is 2MB.

Minimum memory to run Virtual Service Router

  • 1GB of standard memory

  • 1GB of hugepages

  • 64MB of POSIX shared memory

Using the fast path with this amount of memory requires to customize its configuration.

Kernel configuration

The following kernel options are required to run Virtual Service Router:

Option

Role

CONFIG_CGROUP_BPF

L3VRF support

CONFIG_XFRM_INTERFACE

IPsec virtual interface support

CONFIG_NF_TABLES

IPsec Linux synchronization support

CONFIG_NETFILTER_XT_TARGET_NOTRACK

NOTRACK Netfilter target support

CONFIG_NET_ACT_BPF

Exception support

CONFIG_MPLS_ROUTING

MPLS forwarding support

CONFIG_MPLS_IPTUNNEL

IP in MPLS encapsulation support

CONFIG_PPPOE

PPPoE (PPP over Ethernet) support

Kernel modules

The following kernel modules must be loaded on the hypervisor node:

# KMODS="
br_netfilter
ebtables
ifb
ip6_tables
ip_tables
mpls_iptunnel
mpls_router
nf_conntrack
overlay
ppp_generic
vfio-pci
vhost-net
vrf
"
# for kmod in $KMODS; do modprobe $kmod; done

To load the modules at next reboot, add them in the /etc/modules-load.d directory:

# for kmod in $KMODS; do echo $kmod; done > /etc/modules-load.d/vsr-hypervisor.conf

Required capabilities

Linux divides the privileges into distinct units, known as capabilities, which can be independently enabled and disabled.

The following capabilities are required to run Virtual Service Router:

Capability

Role

CAP_SYS_ADMIN

Dataplane processing (exception path and VRF)

CAP_NET_ADMIN

General Linux networking

CAP_NET_RAW

Support of filtering

tcpdump

CAP_IPC_LOCK

Memory allocation for dataplane

CAP_SYS_NICE

NUMA information retrieval

CAP_NET_BROADCAST

VRF notifications support

CAP_SYSLOG

Log to syslog

The following capabilities may be required to run Virtual Service Router:

Capability

Role

CAP_AUDIT_WRITE

Required when using an SSH server

CAP_SYS_RAWIO

Required if IOMMU is disabled to access vfio group

CAP_SYS_PTRACE

Enhance accuracy of BPF filters synchronization in fast path

CAP_SYS_TIME

Set the node clock from the container

Providing physical devices or virtual functions to the container

It is possible to provide a physical device or a virtual function to the container, used for dataplane packet processing. In this situation, we recommend to enable the IOMMU:

  • in the BIOS: the feature is called VT-d on Intel servers.

  • in the Linux kernel: use intel_iommu=on in kernel command line on Intel servers.

Note

This security measure prevents a malicious or misconfigured NIC to access to the whole machine physical memory.

If the IOMMU is not available, the vfio-pci has to be configured in unsafe mode:

# echo Y > /sys/module/vfio/parameters/enable_unsafe_noiommu_mode

Configure cgroup v2 for net_prio,net_cls

Note

The following is needed only when using a distribution that uses a hybrid cgroup hierarchy (like Ubuntu 20.04 or Red Hat Enterprise Linux 8).

When using Kubernetes, the net_prio and net_cls must be configured with cgroup version 2. For that, the following argument must be passed to the kernel command line:

cgroup_no_v1=net_prio,net_cls

If this argument is missing, it won’t be possible to configure L3VRF in Virtual Service Router.

Setting maximum receive socket buffer size

The maximum receive socket buffer size is global to the node, and is not configurable from inside a container. We recommend to increase this value to 128MB on the node:

# echo 134217728 > /proc/sys/net/core/rmem_max

A lower value can slow down the Virtual Service Router management or the control plane traffic.

Setting maximum number of connection tracking objects

The maximum number of connection tracking objects is global to the node, and is not configurable from inside a container. The default value set by the Linux kernel depends on the amount of memory on the node. This value can be displayed from the container CLI:

vsr> show state / system network-stack conntrack max-entries

We recommend to increase this value to a larger value if you plan to track a large number of connections. On the node, do:

# echo 10485760 > /proc/sys/net/netfilter/nf_conntrack_max
# # usually, 4 times lower than nf_conntrack_max is a good compromise
# echo 2621440 > /sys/module/nf_conntrack/parameters/hashsize

A too low value will prevent connections to be properly tracked, impacting firewall and NAT features.

Note

the nf_conntrack kernel module must be loaded to make the nf_conntrack_max sysctl available.

See also

Setting maximum number of inotify watchers

When running several Virtual Service Router containers on the same node, the maximum number of inotify watchers can be reached, causing “too many open files” error inside the container. It should be increased with:

# sysctl fs.inotify.max_user_instances=2048
# sysctl fs.inotify.max_user_watches=1048576

See also

inotify(7)

Configuring filtering on bridges

The ability to apply filtering on bridges depends on a sysctl (enabled by default).

This configuration is per container for kernel versions >= 5.3. In this case there is no limitation for this feature.

If the kernel version of your node is < 5.3, the configuration is global to the node, and it has to be customized on the node with:

# # disable filtering on bridge ports for IPv4 and IPv6
# echo 0 > /proc/sys/net/bridge/bridge-nf-call-iptables
# echo 0 > /proc/sys/net/bridge/bridge-nf-call-ip6tables

# # enable filtering on bridge ports for IPv4 and IPv6
# echo 1 > /proc/sys/net/bridge/bridge-nf-call-iptables
# echo 1 > /proc/sys/net/bridge/bridge-nf-call-ip6tables

The current value can be displayed from the container CLI:

vsr> show state / system network-stack bridge call-ipv4-filtering
vsr> show state / system network-stack bridge call-ipv6-filtering

Enabling log filter target in all containers

By default, the log filter target is disabled when used in a container. It can only be enabled on the node.

Use the following command on the node to enable the log filter target:

# echo 1 > /proc/sys/net/netfilter/nf_log_all_netns

The filter rules that use a log target in the container won’t generate any kernel log if nf_log_all_netns is set to 0.

Configuring ARP/NDP parameters

The default configuration of the ARP/NDP network stack is global to the node. The maximum number of ARP/NDP entries cannot be configured from a container, therefore the following configuration items will be ignored:

system network-stack neighbor ipv4-max-entries
system network-stack neighbor ipv6-max-entries
vrf main network-stack neighbor ipv4-max-entries
vrf main network-stack neighbor ipv6-max-entries

The default time during which a neighbor entry stays reachable is also not configurable from inside a container. The following configuration items won’t work properly and should not be used:

system network-stack neighbor ipv4-base-reachable-time
system network-stack neighbor ipv6-base-reachable-time

The default ARP/NDP stack configuration can be customized on the node, for instance with the following commands:

# echo 10 > /proc/sys/net/ipv4/neigh/default/base_reachable_time
# echo 10 > /proc/sys/net/ipv6/neigh/default/base_reachable_time
# # set thresholds for a maximum of 2048 neighbors
# echo 1024 > /proc/sys/net/ipv4/neigh/default/gc_thresh1
# echo 2048 > /proc/sys/net/ipv4/neigh/default/gc_thresh2
# echo 4096 > /proc/sys/net/ipv4/neigh/default/gc_thresh3
# echo 1024 > /proc/sys/net/ipv6/neigh/default/gc_thresh1
# echo 2048 > /proc/sys/net/ipv6/neigh/default/gc_thresh2
# echo 4096 > /proc/sys/net/ipv6/neigh/default/gc_thresh3

Configuring IPv6 max route cache

Before Linux version 5.16, the maximum size of IPv6 route cache is global to the node. The default value is 4096, which is enough for most use cases.

If you plan to deal with a large number of IPv6 routes (ex: IPv6 full route), the value should be increased on the node:

# echo 16384 > /proc/sys/net/ipv6/route/max_size

A lower value can cause performance issues or packet drops when a large number of IPv6 routes are used.

Setting system clock

Setting the system clock from inside a container is not recommended.

For specific use-cases where it makes sense, it requires the CAP_SYS_TIME capability. In this case, setting the system clock, either manually or through an NTP client configuration, impacts the whole node.

When the capability is disabled, setting the system clock will fail, and NTP configurations that set the system time will be ineffective. It is still possible to act as an NTP server.

Linux kernel security modules

Depending on the distribution running on the host, a security module like AppArmor or SELinux may be enabled, preventing access to files that are needed to run Virtual Service Router.

We recommend that you first validate without these modules, then build your own configuration.

For instance, on Ubuntu 22.04, AppArmor can be disabled with:

# aa-teardown

Using Network Virtual Functions

Some physical PCI Express network devices can be shared among containers through Virtual Functions.

Note

We recommend that you use the latest driver and the latest firmware available for your device.

VLAN strip

On some hardware (like Intel 82599, X5xx or X7xx families), if the VF is configured with VLAN on the host, the VLAN tags are visible in the container when receiving a network packet. To disable this behavior, the fast path can be configured from the CLI:

vsr running config# system fast-path advanced vlan-strip true
vsr running config# commit

Using L2 features on VFs

Using L2 features (VLAN, bridge, LAG, …) on VFs may require additional VF configuration to enable the promiscuous mode or to change the MAC address from inside the container. On most hardware, doing this is not allowed if the VF is not trusted.

To enable trust mode, use the following command on the host:

# ip link set dev PF_NAME vf VF_NUM trust on

It may also be needed to disable spoof checking, in case the container is allowed to send packet with same MAC address or same VLAN as the PF or another VF:

# ip link set dev PF_NAME vf VF_NUM spoofchk off

On some NICs (Intel X540 and X550), the PF must also be in promiscuous mode to enable the promiscuous mode on the VF:

# ip link set dev PF_NAME promisc on

Note

Some features are not available on some devices. For instance it is not possible to enable a true promiscuous mode on a VF when using an Intel 82599 or Intel X520 NIC. Refer to the datasheet of your device.

Large MTU

The MTU of the VF cannot be higher than the MTU of the PF. To increase the MTU of the PF, use this command on the host:

# ip link set dev PF_NAME mtu 9000

Malicious Driver Detection not supported on Intel X550

The Intel X550 series NICs support a feature called MDD (Malicious Driver Detection) which checks the behavior of the VF driver. This feature is not supported by Virtual Service Router and must be disabled. This is a known issue of DPDK driver.

To disable MDD, use the following command on the host, then reload the ixgbe driver:

# echo "options ixgbe MDD=0,0" > "/etc/modprobe.d/ixgbe-mdd.conf"