2.4.1. Node requirements and configuration¶

This section lists the requirements and recommendations that apply to the node (the host that runs the containers).

Linux Distribution¶

The supported distributions are Ubuntu 20.04, Ubuntu 22.04 and Red Hat Enterprise Linux 8.

CPU¶

To run Virtual Service Router, at least 2 CPUs are required: 1 for fast path, and 1 for control plane. The packet processing performance scales with the number of fast path CPUs.

Note

The fast path CPUs must be dedicated: no other application (on the node or inside a container) should use them. Container engines like Docker or Kubernetes have features for that.

See also

Memory¶

The host has to be configured to provide hugepages and POSIX shared memory to the container ; else the Virtual Service Router fast path will fail to start.

Note

A hugepage is a page that addresses more memory than the usual 4KB. Accessing a hugepage is more efficient than accessing a regular memory page. Its default size is 2MB.

Recommended memory to run Virtual Service Router¶

2GB of standard memory
8GB of hugepages
512MB of POSIX shared memory

Minimum memory to run Virtual Service Router¶

1GB of standard memory
1GB of hugepages
64MB of POSIX shared memory

Using the fast path with this amount of memory requires to customize its configuration.

Kernel configuration¶

The following kernel options are required to run Virtual Service Router:

Option	Role
CONFIG_CGROUP_BPF	L3VRF support
CONFIG_XFRM_INTERFACE	IPsec virtual interface support
CONFIG_NF_TABLES	IPsec Linux synchronization support
CONFIG_NETFILTER_XT_TARGET_NOTRACK	NOTRACK Netfilter target support
CONFIG_NET_ACT_BPF	Exception support
CONFIG_MPLS_ROUTING	MPLS forwarding support
CONFIG_MPLS_IPTUNNEL	IP in MPLS encapsulation support
CONFIG_PPPOE	PPPoE (PPP over Ethernet) support

Kernel modules¶

The following kernel modules must be loaded on the hypervisor node:

# KMODS="
br_netfilter
ebtables
ifb
ip6_tables
ip_tables
mpls_iptunnel
mpls_router
nf_conntrack
overlay
ppp_generic
vfio-pci
vhost-net
vrf
"
# for kmod in $KMODS; do modprobe $kmod; done

To load the modules at next reboot, add them in the /etc/modules-load.d directory:

# for kmod in $KMODS; do echo $kmod; done > /etc/modules-load.d/vsr-hypervisor.conf

Required capabilities¶

Linux divides the privileges into distinct units, known as capabilities, which can be independently enabled and disabled.

The following capabilities are required to run Virtual Service Router:

Capability	Role
CAP_SYS_ADMIN	Dataplane processing (exception path and VRF)
CAP_NET_ADMIN	General Linux networking
CAP_NET_RAW	Support of filtering	tcpdump	…
CAP_IPC_LOCK	Memory allocation for dataplane
CAP_SYS_NICE	NUMA information retrieval
CAP_NET_BROADCAST	VRF notifications support
CAP_SYSLOG	Log to syslog

The following capabilities may be required to run Virtual Service Router:

Capability	Role
CAP_AUDIT_WRITE	Required when using an SSH server
CAP_SYS_RAWIO	Required if IOMMU is disabled to access vfio group
CAP_SYS_PTRACE	Enhance accuracy of BPF filters synchronization in fast path
CAP_SYS_TIME	Set the node clock from the container

See also

capabilities(7) man page

Providing physical devices or virtual functions to the container¶

It is possible to provide a physical device or a virtual function to the container, used for dataplane packet processing. In this situation, we recommend to enable the IOMMU:

in the BIOS: the feature is called VT-d on Intel servers.
in the Linux kernel: use intel_iommu=on in kernel command line on Intel servers.

Note

This security measure prevents a malicious or misconfigured NIC to access to the whole machine physical memory.

If the IOMMU is not available, the vfio-pci has to be configured in unsafe mode:

# echo Y > /sys/module/vfio/parameters/enable_unsafe_noiommu_mode

Configure cgroup v2 for `net_prio,net_cls`¶

Note

The following is needed only when using a distribution that uses a hybrid cgroup hierarchy (like Ubuntu 20.04 or Red Hat Enterprise Linux 8).

When using Kubernetes, the net_prio and net_cls must be configured with cgroup version 2. For that, the following argument must be passed to the kernel command line:

cgroup_no_v1=net_prio,net_cls

If this argument is missing, it won’t be possible to configure L3VRF in Virtual Service Router.

Setting maximum receive socket buffer size¶

The maximum receive socket buffer size is global to the node, and is not configurable from inside a container. We recommend to increase this value to 128MB on the node:

# echo 134217728 > /proc/sys/net/core/rmem_max

A lower value can slow down the Virtual Service Router management or the control plane traffic.

See also

Linux kernel net sysctl documentation

Setting maximum number of connection tracking objects¶

The maximum number of connection tracking objects is global to the node, and is not configurable from inside a container. The default value set by the Linux kernel depends on the amount of memory on the node. This value can be displayed from the container CLI:

vsr> show state / system network-stack conntrack max-entries

We recommend to increase this value to a larger value if you plan to track a large number of connections. On the node, do:

# echo 10485760 > /proc/sys/net/netfilter/nf_conntrack_max
# # usually, 4 times lower than nf_conntrack_max is a good compromise
# echo 2621440 > /sys/module/nf_conntrack/parameters/hashsize

A too low value will prevent connections to be properly tracked, impacting firewall and NAT features.

Note

the nf_conntrack kernel module must be loaded to make the nf_conntrack_max sysctl available.

See also

Connection tracking paragraph in IP packet filtering section.
Linux kernel conntrack sysctl documentation

Setting maximum number of inotify watchers¶

When running several Virtual Service Router containers on the same node, the maximum number of inotify watchers can be reached, causing “too many open files” error inside the container. It should be increased with:

# sysctl fs.inotify.max_user_instances=2048
# sysctl fs.inotify.max_user_watches=1048576

See also

inotify(7)

Configuring filtering on bridges¶

The ability to apply filtering on bridges depends on a sysctl (enabled by default).

This configuration is per container for kernel versions >= 5.3. In this case there is no limitation for this feature.

If the kernel version of your node is < 5.3, the configuration is global to the node, and it has to be customized on the node with:

# # disable filtering on bridge ports for IPv4 and IPv6
# echo 0 > /proc/sys/net/bridge/bridge-nf-call-iptables
# echo 0 > /proc/sys/net/bridge/bridge-nf-call-ip6tables

# # enable filtering on bridge ports for IPv4 and IPv6
# echo 1 > /proc/sys/net/bridge/bridge-nf-call-iptables
# echo 1 > /proc/sys/net/bridge/bridge-nf-call-ip6tables

The current value can be displayed from the container CLI:

vsr> show state / system network-stack bridge call-ipv4-filtering
vsr> show state / system network-stack bridge call-ipv6-filtering

See also

Linux kernel IP sysctl documentation

Enabling log filter target in all containers¶

By default, the log filter target is disabled when used in a container. It can only be enabled on the node.

Use the following command on the node to enable the log filter target:

# echo 1 > /proc/sys/net/netfilter/nf_log_all_netns

The filter rules that use a log target in the container won’t generate any kernel log if nf_log_all_netns is set to 0.

Configuring ARP/NDP parameters¶

The default configuration of the ARP/NDP network stack is global to the node. The maximum number of ARP/NDP entries cannot be configured from a container, therefore the following configuration items will be ignored:

system network-stack neighbor ipv4-max-entries
system network-stack neighbor ipv6-max-entries
vrf main network-stack neighbor ipv4-max-entries
vrf main network-stack neighbor ipv6-max-entries

The default time during which a neighbor entry stays reachable is also not configurable from inside a container. The following configuration items won’t work properly and should not be used:

system network-stack neighbor ipv4-base-reachable-time
system network-stack neighbor ipv6-base-reachable-time

The default ARP/NDP stack configuration can be customized on the node, for instance with the following commands:

# echo 10 > /proc/sys/net/ipv4/neigh/default/base_reachable_time
# echo 10 > /proc/sys/net/ipv6/neigh/default/base_reachable_time
# # set thresholds for a maximum of 2048 neighbors
# echo 1024 > /proc/sys/net/ipv4/neigh/default/gc_thresh1
# echo 2048 > /proc/sys/net/ipv4/neigh/default/gc_thresh2
# echo 4096 > /proc/sys/net/ipv4/neigh/default/gc_thresh3
# echo 1024 > /proc/sys/net/ipv6/neigh/default/gc_thresh1
# echo 2048 > /proc/sys/net/ipv6/neigh/default/gc_thresh2
# echo 4096 > /proc/sys/net/ipv6/neigh/default/gc_thresh3

See also

Linux kernel IP sysctl documentation

Configuring IPv6 max route cache¶

Before Linux version 5.16, the maximum size of IPv6 route cache is global to the node. The default value is 4096, which is enough for most use cases.

If you plan to deal with a large number of IPv6 routes (ex: IPv6 full route), the value should be increased on the node:

# echo 16384 > /proc/sys/net/ipv6/route/max_size

A lower value can cause performance issues or packet drops when a large number of IPv6 routes are used.

See also

Linux kernel IP sysctl documentation

Setting system clock¶

Setting the system clock from inside a container is not recommended.

For specific use-cases where it makes sense, it requires the CAP_SYS_TIME capability. In this case, setting the system clock, either manually or through an NTP client configuration, impacts the whole node.

When the capability is disabled, setting the system clock will fail, and NTP configurations that set the system time will be ineffective. It is still possible to act as an NTP server.

Linux kernel security modules¶

Depending on the distribution running on the host, a security module like AppArmor or SELinux may be enabled, preventing access to files that are needed to run Virtual Service Router.

We recommend that you first validate without these modules, then build your own configuration.

For instance, on Ubuntu 22.04, AppArmor can be disabled with:

# aa-teardown

Using Network Virtual Functions¶

Some physical PCI Express network devices can be shared among containers through Virtual Functions.

Note

We recommend that you use the latest driver and the latest firmware available for your device.

VLAN strip¶

On some hardware (like Intel 82599, X5xx or X7xx families), if the VF is configured with VLAN on the host, the VLAN tags are visible in the container when receiving a network packet. To disable this behavior, the fast path can be configured from the CLI:

vsr running config# system fast-path advanced vlan-strip true
vsr running config# commit

Using L2 features on VFs¶

Using L2 features (VLAN, bridge, LAG, …) on VFs may require additional VF configuration to enable the promiscuous mode or to change the MAC address from inside the container. On most hardware, doing this is not allowed if the VF is not trusted.

To enable trust mode, use the following command on the host:

# ip link set dev PF_NAME vf VF_NUM trust on

It may also be needed to disable spoof checking, in case the container is allowed to send packet with same MAC address or same VLAN as the PF or another VF:

# ip link set dev PF_NAME vf VF_NUM spoofchk off

On some NICs (Intel X540 and X550), the PF must also be in promiscuous mode to enable the promiscuous mode on the VF:

# ip link set dev PF_NAME promisc on

Note

Some features are not available on some devices. For instance it is not possible to enable a true promiscuous mode on a VF when using an Intel 82599 or Intel X520 NIC. Refer to the datasheet of your device.

Large MTU¶

The MTU of the VF cannot be higher than the MTU of the PF. To increase the MTU of the PF, use this command on the host:

# ip link set dev PF_NAME mtu 9000

Malicious Driver Detection not supported on Intel X550¶

The Intel X550 series NICs support a feature called MDD (Malicious Driver Detection) which checks the behavior of the VF driver. This feature is not supported by Virtual Service Router and must be disabled. This is a known issue of DPDK driver.

To disable MDD, use the following command on the host, then reload the ixgbe driver:

# echo "options ixgbe MDD=0,0" > "/etc/modprobe.d/ixgbe-mdd.conf"

2.4.1. Node requirements and configuration¶

Linux Distribution¶

CPU¶

Memory¶

Recommended memory to run Virtual Service Router¶

Minimum memory to run Virtual Service Router¶

Kernel configuration¶

Kernel modules¶

Required capabilities¶

Providing physical devices or virtual functions to the container¶

Configure cgroup v2 for net_prio,net_cls¶

Setting maximum receive socket buffer size¶

Setting maximum number of connection tracking objects¶

Setting maximum number of inotify watchers¶

Configuring filtering on bridges¶

Enabling log filter target in all containers¶

Configuring ARP/NDP parameters¶

Configuring IPv6 max route cache¶

Setting system clock¶

Linux kernel security modules¶

Using Network Virtual Functions¶

VLAN strip¶

Using L2 features on VFs¶

Large MTU¶

Malicious Driver Detection not supported on Intel X550¶

Configure cgroup v2 for `net_prio,net_cls`¶