2.3.1. Node requirements and configuration¶
This section lists the requirements and recommendations that apply to the node (the host that runs the containers).
Linux Distribution¶
The supported distributions are Ubuntu 20.04, Ubuntu 22.04 and Red Hat Enterprise Linux 8.
CPU¶
To run Virtual Service Router, at least 2 CPUs are required: 1 for fast path, and 1 for control plane. The packet processing performance scales with the number of fast path CPUs.
Note
The fast path CPUs must be dedicated: no other application (on the node or inside a container) should use them. Container engines like Docker or Kubernetes have features for that.
Memory¶
The host has to be configured to provide hugepages and POSIX shared memory to the container ; else the Virtual Service Router fast path will fail to start.
Note
A hugepage is a page that addresses more memory than the usual 4KB. Accessing a hugepage is more efficient than accessing a regular memory page. Its default size is 2MB.
Recommended memory to run Virtual Service Router¶
2GB of standard memory
8GB of hugepages
512MB of POSIX shared memory
Minimum memory to run Virtual Service Router¶
1GB of standard memory
1GB of hugepages
64MB of POSIX shared memory
Using the fast path with this amount of memory requires to customize its configuration.
See also
The advanced fast path configuration in command reference for details.
Kernel configuration¶
The following kernel options are required to run Virtual Service Router:
Option |
Role |
---|---|
CONFIG_CGROUP_BPF |
L3VRF support |
CONFIG_XFRM_INTERFACE |
IPsec virtual interface support |
CONFIG_NF_TABLES |
IPsec Linux synchronization support |
CONFIG_NETFILTER_XT_TARGET_NOTRACK |
NOTRACK Netfilter target support |
CONFIG_NET_ACT_BPF |
Exception support |
CONFIG_MPLS_ROUTING |
MPLS forwarding support |
CONFIG_MPLS_IPTUNNEL |
IP in MPLS encapsulation support |
CONFIG_PPPOE |
PPPoE (PPP over Ethernet) support |
Kernel modules¶
The following kernel modules must be loaded on the hypervisor node:
# KMODS="
br_netfilter
ebtables
ifb
ip6_tables
ip_tables
mpls_iptunnel
mpls_router
nf_conntrack
overlay
ppp_generic
vfio-pci
vhost-net
vrf
"
# for kmod in $KMODS; do modprobe $kmod; done
To load the modules at next reboot, add them in the /etc/modules-load.d
directory:
# for kmod in $KMODS; do echo $kmod; done > /etc/modules-load.d/vsr-hypervisor.conf
Required capabilities¶
Linux divides the privileges into distinct units, known as capabilities, which can be independently enabled and disabled.
The following capabilities are required to run Virtual Service Router:
Capability |
Role |
||
---|---|---|---|
CAP_SYS_ADMIN |
Dataplane processing (exception path and VRF) |
||
CAP_NET_ADMIN |
General Linux networking |
||
CAP_NET_RAW |
Support of filtering |
tcpdump |
… |
CAP_IPC_LOCK |
Memory allocation for dataplane |
||
CAP_SYS_NICE |
NUMA information retrieval |
||
CAP_NET_BROADCAST |
VRF notifications support |
||
CAP_SYSLOG |
Log to syslog |
The following capabilities may be required to run Virtual Service Router:
Capability |
Role |
---|---|
CAP_AUDIT_WRITE |
Required when using an SSH server |
CAP_SYS_RAWIO |
Required if IOMMU is disabled to access vfio group |
CAP_SYS_PTRACE |
Enhance accuracy of BPF filters synchronization in fast path |
CAP_SYS_TIME |
Set the node clock from the container |
See also
Providing physical devices or virtual functions to the container¶
It is possible to provide a physical device or a virtual function to the container, used for dataplane packet processing. In this situation, we recommend to enable the IOMMU:
in the BIOS: the feature is called
VT-d
on Intel servers.in the Linux kernel: use
intel_iommu=on
in kernel command line on Intel servers.
Note
This security measure prevents a malicious or misconfigured NIC to access to the whole machine physical memory.
If the IOMMU is not available, the vfio-pci
has to be configured in unsafe
mode:
# echo Y > /sys/module/vfio/parameters/enable_unsafe_noiommu_mode
Configure cgroup v2 for net_prio,net_cls
¶
Note
The following is needed only when using a distribution that uses a hybrid cgroup hierarchy (like Ubuntu 20.04 or Red Hat Enterprise Linux 8).
When using Kubernetes, the net_prio
and net_cls
must be
configured with cgroup version 2. For that, the following argument must
be passed to the kernel command line:
cgroup_no_v1=net_prio,net_cls
If this argument is missing, it won’t be possible to configure L3VRF in Virtual Service Router.
Setting maximum receive socket buffer size¶
The maximum receive socket buffer size is global to the node, and is not configurable from inside a container. We recommend to increase this value to 128MB on the node:
# echo 134217728 > /proc/sys/net/core/rmem_max
A lower value can slow down the Virtual Service Router management or the control plane traffic.
Setting maximum number of connection tracking objects¶
The maximum number of connection tracking objects is global to the node, and is not configurable from inside a container. The default value set by the Linux kernel depends on the amount of memory on the node. This value can be displayed from the container CLI:
vsr> show state / system network-stack conntrack max-entries
We recommend to increase this value to a larger value if you plan to track a large number of connections. On the node, do:
# echo 10485760 > /proc/sys/net/netfilter/nf_conntrack_max
# # usually, 4 times lower than nf_conntrack_max is a good compromise
# echo 2621440 > /sys/module/nf_conntrack/parameters/hashsize
A too low value will prevent connections to be properly tracked, impacting firewall and NAT features.
Note
the nf_conntrack
kernel module must be loaded to make
the nf_conntrack_max
sysctl available.
See also
Connection tracking paragraph in IP packet filtering section.
Setting maximum number of inotify watchers¶
When running several Virtual Service Router containers on the same node, the maximum number of inotify watchers can be reached, causing “too many open files” error inside the container. It should be increased with:
# sysctl fs.inotify.max_user_instances=2048
# sysctl fs.inotify.max_user_watches=1048576
See also
Configuring filtering on bridges¶
The ability to apply filtering on bridges depends on a sysctl (enabled by default).
This configuration is per container for kernel versions >= 5.3. In this case there is no limitation for this feature.
If the kernel version of your node is < 5.3, the configuration is global to the node, and it has to be customized on the node with:
# # disable filtering on bridge ports for IPv4 and IPv6
# echo 0 > /proc/sys/net/bridge/bridge-nf-call-iptables
# echo 0 > /proc/sys/net/bridge/bridge-nf-call-ip6tables
# # enable filtering on bridge ports for IPv4 and IPv6
# echo 1 > /proc/sys/net/bridge/bridge-nf-call-iptables
# echo 1 > /proc/sys/net/bridge/bridge-nf-call-ip6tables
The current value can be displayed from the container CLI:
vsr> show state / system network-stack bridge call-ipv4-filtering
vsr> show state / system network-stack bridge call-ipv6-filtering
See also
Enabling log filter target in all containers¶
By default, the log filter target is disabled when used in a container. It can only be enabled on the node.
Use the following command on the node to enable the log filter target:
# echo 1 > /proc/sys/net/netfilter/nf_log_all_netns
The filter rules that use a log target in the container won’t generate
any kernel log if nf_log_all_netns
is set to 0.
Configuring ARP/NDP parameters¶
The default configuration of the ARP/NDP network stack is global to the node. The maximum number of ARP/NDP entries cannot be configured from a container, therefore the following configuration items will be ignored:
system network-stack neighbor ipv4-max-entries
system network-stack neighbor ipv6-max-entries
vrf main network-stack neighbor ipv4-max-entries
vrf main network-stack neighbor ipv6-max-entries
The default time during which a neighbor entry stays reachable is also not configurable from inside a container. The following configuration items won’t work properly and should not be used:
system network-stack neighbor ipv4-base-reachable-time
system network-stack neighbor ipv6-base-reachable-time
The default ARP/NDP stack configuration can be customized on the node, for instance with the following commands:
# echo 10 > /proc/sys/net/ipv4/neigh/default/base_reachable_time
# echo 10 > /proc/sys/net/ipv6/neigh/default/base_reachable_time
# # set thresholds for a maximum of 2048 neighbors
# echo 1024 > /proc/sys/net/ipv4/neigh/default/gc_thresh1
# echo 2048 > /proc/sys/net/ipv4/neigh/default/gc_thresh2
# echo 4096 > /proc/sys/net/ipv4/neigh/default/gc_thresh3
# echo 1024 > /proc/sys/net/ipv6/neigh/default/gc_thresh1
# echo 2048 > /proc/sys/net/ipv6/neigh/default/gc_thresh2
# echo 4096 > /proc/sys/net/ipv6/neigh/default/gc_thresh3
See also
Configuring IPv6 max route cache¶
Before Linux version 5.16, the maximum size of IPv6 route cache is global to the node. The default value is 4096, which is enough for most use cases.
If you plan to deal with a large number of IPv6 routes (ex: IPv6 full route), the value should be increased on the node:
# echo 16384 > /proc/sys/net/ipv6/route/max_size
A lower value can cause performance issues or packet drops when a large number of IPv6 routes are used.
See also
Setting system clock¶
Setting the system clock from inside a container is not recommended.
For specific use-cases where it makes sense, it requires the
CAP_SYS_TIME
capability. In this case, setting the system clock,
either manually or through an NTP client configuration, impacts the
whole node.
When the capability is disabled, setting the system clock will fail, and NTP configurations that set the system time will be ineffective. It is still possible to act as an NTP server.
Linux kernel security modules¶
Depending on the distribution running on the host, a security module like AppArmor or SELinux may be enabled, preventing access to files that are needed to run Virtual Service Router.
We recommend that you first validate without these modules, then build your own configuration.
For instance, on Ubuntu 22.04, AppArmor can be disabled with:
# aa-teardown
Using Network Virtual Functions¶
Some physical PCI Express network devices can be shared among containers through Virtual Functions.
Note
We recommend that you use the latest driver and the latest firmware available for your device.
VLAN strip¶
On some hardware (like Intel 82599, X5xx or X7xx families), if the VF is configured with VLAN on the host, the VLAN tags are visible in the container when receiving a network packet. To disable this behavior, the fast path can be configured from the CLI:
vsr running config# system fast-path advanced vlan-strip true
vsr running config# commit
Using L2 features on VFs¶
Using L2 features (VLAN, bridge, LAG, …) on VFs may require additional VF configuration to enable the promiscuous mode or to change the MAC address from inside the container. On most hardware, doing this is not allowed if the VF is not trusted.
To enable trust mode, use the following command on the host:
# ip link set dev PF_NAME vf VF_NUM trust on
It may also be needed to disable spoof checking, in case the container is allowed to send packet with same MAC address or same VLAN as the PF or another VF:
# ip link set dev PF_NAME vf VF_NUM spoofchk off
On some NICs (Intel X540 and X550), the PF must also be in promiscuous mode to enable the promiscuous mode on the VF:
# ip link set dev PF_NAME promisc on
Note
Some features are not available on some devices. For instance it is not possible to enable a true promiscuous mode on a VF when using an Intel 82599 or Intel X520 NIC. Refer to the datasheet of your device.
Large MTU¶
The MTU of the VF cannot be higher than the MTU of the PF. To increase the MTU of the PF, use this command on the host:
# ip link set dev PF_NAME mtu 9000
Malicious Driver Detection not supported on Intel X550¶
The Intel X550 series NICs support a feature called MDD (Malicious Driver Detection) which checks the behavior of the VF driver. This feature is not supported by Virtual Service Router and must be disabled. This is a known issue of DPDK driver.
To disable MDD, use the following command on the host, then reload the ixgbe driver:
# echo "options ixgbe MDD=0,0" > "/etc/modprobe.d/ixgbe-mdd.conf"