3.4. System Information

3.4.1. CPU Pinning for VMs

For each virtual CPU, QEMU uses one pthread for actually running the VM and pthreads for management. For best performance, you need to make sure cores used to run fast path dataplane are only used for that.

To get the threads associated with each virtual CPU, use info cpus in QEMU monitor console:

QEMU 2.3.0 monitor - type 'help' for more information
(qemu) info cpus
* CPU #0: pc=0xffffffff8104f596 (halted) thread_id=26773
  CPU #1: pc=0x00007faee19be9f9 thread_id=26774
  CPU #2: pc=0xffffffff8104f596 (halted) thread_id=26775
  CPU #3: pc=0x0000000000530233 thread_id=26776

To get all threads associated with your running VM (including management threads) and what CPU they are currently pinned on, call:

# taskset -ap <qemu_pid>
pid 26770's current affinity mask: f
pid 26771's current affinity mask: f
pid 26773's current affinity mask: f
pid 26774's current affinity mask: f
pid 26775's current affinity mask: f
pid 26776's current affinity mask: f
pid 27053's current affinity mask: f

By pinning our VM on a specific set of cores, we ensure less overload.

You may either run qemu with a specific set of cores when starting, using:

# taskset -c 0-1 <qemu command>

You may also pin a VM after it has been started, using the PID of its threads. For instance, to change the physical CPU on which to pin the virtual CPU #0, use:

# taskset -cp 0-1 26773
pid 26773's current affinity list: 0-3
pid 26773's new affinity list: 0,1

Note

Refer to the taskset manpage for specific options.

When using libvirt, you may use <cputune> with vcpupin to pin virtual CPUs to physical ones. e.g.:

<vcpu cpuset='7-8,27-28'>4</vcpu>
<cputune>
  <vcpupin vcpu="0" cpuset="7"/>
  <vcpupin vcpu="1" cpuset="8"/>
  <vcpupin vcpu="2" cpuset="27"/>
  <vcpupin vcpu="3" cpuset="28"/>
</cputune>

Note

Refer to the libvirt Domain XML format documentation for further details.

We can look at htop results (after filtering results for this qemu instance) to confirm what threads are actually used:

  PID USER       VIRT   RES   SHR S CPU% MEM%   TIME+  NLWP Command
26770 mazon     7032M 4067M  7092 S 200. 25.5  2h19:55    7 |- qemu-system-x86_64 -daemonize --enable-kvm -m 6G -cpu host -smp sockets=1,cores=4,threads=1 ...
27053 mazon     7032M 4067M  7092 S  0.0 25.5  0:01.13    7 |  |- qemu-system-x86_64 -daemonize --enable-kvm -m 6G -cpu host -smp sockets=1,cores=4,threads=1 ...
26776 mazon     7032M 4067M  7092 R 99.1 25.5  1h04:38    7 |  |- qemu-system-x86_64 -daemonize --enable-kvm -m 6G -cpu host -smp sockets=1,cores=4,threads=1 ...
26775 mazon     7032M 4067M  7092 S  0.9 25.5  2:48.21    7 |  |- qemu-system-x86_64 -daemonize --enable-kvm -m 6G -cpu host -smp sockets=1,cores=4,threads=1 ...
26774 mazon     7032M 4067M  7092 R 99.1 25.5  1h09:42    7 |  |- qemu-system-x86_64 -daemonize --enable-kvm -m 6G -cpu host -smp sockets=1,cores=4,threads=1 ...
26773 mazon     7032M 4067M  7092 S  0.0 25.5  2:23.03    7 |  |- qemu-system-x86_64 -daemonize --enable-kvm -m 6G -cpu host -smp sockets=1,cores=4,threads=1 ...
26771 mazon     7032M 4067M  7092 S  0.0 25.5  0:00.00    7 |  |- qemu-system-x86_64 -daemonize --enable-kvm -m 6G -cpu host -smp sockets=1,cores=4,threads=1 ...

You may even change CPU affinity by typing a when on a specific PID line in htop.

Similarly, you can get threads PID by looking in /proc/<pid>/task/, e.g.:

# ls /proc/26773/task
26770/  26771/  26773/  26774/  26775/  26776/  27053/

3.4.2. fp-cli dpdk-port-stats

The fp-cli dpdk-port-stats command is used to display and set options related to network drivers (for those that support it).

To display the statistics for a given port, use fp-cli dpdk-port-stats <port>:

# fp-cli dpdk-port-stats <port>

     rx_good_packets: 261064663
     tx_good_packets: 256512062
     rx_good_bytes: 15663879780
     tx_good_bytes: 15390725600
     rx_missed_errors: 0
     rx_errors: 36554346
     tx_errors: 0
     rx_mbuf_allocation_errors: 0
     rx_q0_packets: 32632951
     rx_q0_bytes: 1957977060
     rx_q0_errors: 0
     ...
     tx_q0_packets: 128251039
     tx_q0_bytes: 7695062334
     ...
     rx_total_packets: 297619011
     rx_total_bytes: 17857140760
     tx_total_packets: 256512062
     tx_size_64_packets: 256512046
     ...

The most important stats to look at are the {r,t}x_good_{packets,bytes} and errors.

They indicate globally how well the port is handling packets.

There is also per queue statistics that might be interesting in case of multiqueue. It’s better to have packets transmitted on as many different queues as possible, but it depends on various factors, such as the IP addresses and UDP / TCP ports.

The drop statistics provide useful information about why packets are dropped. For instance, the rx_missed_errors counter represents the number of packets dropped because the CPU was not fast enough to dequeue them. A non-zero value for rx_mbuf_allocation_errors shows that there is not enough mbuf structure configured in the fast path.

Note

Statistics field names may vary considering the driver.

fp-cli dpdk-port-offload can be used to check whether offload is enabled, using the following:

# fp-cli dpdk-port-offload <port>

     TX vlan insert off [fixed]
     TX IPv4 checksum off [fixed]
     TX TCP checksum off [fixed]
     TX UDP checksum off [fixed]
     TX SCTP checksum off [fixed]
     TX TSO off [fixed]
     TX UFO off [fixed]
     RX vlan strip off
     RX vlan filter off
     RX IPv4 checksum on
     RX TCP checksum on
     RX UDP checksum on
     RX MPLS IP off
     RX LRO off
     RX GRO off

If you want to enable TSO (which should provide you with better performance for TCP, as the hardware will handle the segmentation), use:

# fp-cli dpdk-port-offload-set eth1 tso on

Note

You can get various error messages when trying to change hardware parameters. For instance, Cannot change tcp-segmentation-offload may appear if the driver does not support to dynamically change TSO offload.

3.4.3. lspci

The lspci command is useful to display information about PCI buses. In most cases, we look for “Ethernet” devices.

Use lspci to get basic information on all connected devices:

# lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)

To display the driver handling devices, use:

# lspci -k
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
        Subsystem: Red Hat, Inc Qemu virtual machine
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
        Subsystem: Red Hat, Inc Qemu virtual machine
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
        Subsystem: Red Hat, Inc Qemu virtual machine
        Kernel driver in use: ata_piix
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
        Subsystem: Red Hat, Inc Qemu virtual machine
00:02.0 VGA compatible controller: Device 1234:1111 (rev 02)
        Subsystem: Red Hat, Inc Device 1100
00:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
        Subsystem: Red Hat, Inc QEMU Virtual Machine
        Kernel driver in use: igb_uio

Note

Refer to the lspci manpage for specific options.

3.4.4. lstopo

lstopo provides a global view of the system’s topology. It details what machines contain what nodes, containing sockets, containing cores, containing processor units.

The following command presents a graphical representation of a big machine’s topology:

# lstopo --of png > lstopo_output.png
../_images/lstopo.png

You can use the following command to get a textual representation:

# lstopo --of txt

3.4.5. meminfo

The file /proc/meminfo presents a memory status summary. You can also look at memory by node through /sys/devices/system/node/node<node_id>/meminfo.

On a VM with 1GB of RAM running redhat-7, we have this:

# cat /proc/meminfo
MemTotal:        1016548 kB
MemFree:          107716 kB
MemAvailable:     735736 kB
Buffers:           83244 kB
Cached:           626528 kB
SwapCached:            0 kB
Active:           400416 kB
Inactive:         352892 kB
Active(anon):      49808 kB
Inactive(anon):    13304 kB
Active(file):     350608 kB
Inactive(file):   339588 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:         43652 kB
Mapped:             9500 kB
Shmem:             19572 kB
Slab:             130972 kB
SReclaimable:     100896 kB
SUnreclaim:        30076 kB
KernelStack:        1888 kB
PageTables:         2692 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:      507248 kB
Committed_AS:     214004 kB
VmallocTotal:   34359738367 kB
VmallocUsed:        4412 kB
VmallocChunk:   34359730912 kB
HardwareCorrupted:     0 kB
AnonHugePages:      4096 kB
HugePages_Total:       1
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       79744 kB
DirectMap2M:      968704 kB

Note

The kernel documentation provides details regarding meminfo here. For details regarding the HugePages fields, look there.

The most interesting fields in our case are:

MemTotal

should be the same as the “total” memory displayed on top lines when running top

MemFree

should be the same as the “free” memory displayed on top lines when running top

MemAvailable

estimate of how much memory is available for starting new applications, without swapping

HugePages_Total

size of the pool of huge pages

HugePages_Free

number of huge pages in the pool that are not yet allocated

HugePages_Rsvd

number of huge pages for which a commitment to allocate from the pool has been made, but no allocation has yet been made

3.4.6. numastat

This tool shows per-NUMA-node memory statistics for processes and the operating system.

Without argument, it displays per-node NUMA hit and miss system statistics from the kernel memory allocator. A high value in other_node means that there are cross-NUMA memory transfers, which impacts performance. This information is dynamic and can be monitored with the watch command.

# numastat
                           node0           node1
numa_hit               589246433       556912817
numa_miss                      0               0
numa_foreign                   0               0
interleave_hit             11616           17088
local_node             589229023       556900289
other_node                 17410           12528

When a PID or a pattern is passed, it shows per-node memory allocation information for the specified process (including all its pthreads). The hugepages correspond to the DPDK memory, and the private area mainly corresponds to the shared memories.

# numastat fp-rte
Per-node process memory usage (in MBs) for PID 2176 (fp-rte:8)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                       842.00          842.00         1684.00
Heap                         0.41            0.00            0.41
Stack                        0.03            0.00            0.03
Private                   1004.35           24.27         1028.62
----------------  --------------- --------------- ---------------
Total                     1846.79          866.27         2713.06

Note

Refer to the numa manpage for details.

3.4.7. Scheduler statistics

The fast path dataplane threads are designed to run on dedicated cores. An interruption of a fast path thread will prevent the Rx queues to be polled. If this interruption lasts too long, it can induce packet drops.

This condition is enforced by the cpuset.sh script which is started by the fast path service, which prevents all other tasks to run on cores reserved for the fast path. It is enabled by default, see /etc/cpuset.env for its configuration.

Some tasks may still run on the control plane cores, especially the kernel threads, which may not be an issue if the duration of the kernel task is not too high. The number of context switches can be monitored in /proc/<pid>/task/<tid>/status:

# grep 'Name\|ctxt\|Cpus' /proc/$(pidof fp-rte)/task/*/status
/proc/2176/task/2176/status:Name:       fp-rte:8
/proc/2176/task/2176/status:Cpus_allowed:       000100
/proc/2176/task/2176/status:Cpus_allowed_list:  8
/proc/2176/task/2176/status:voluntary_ctxt_switches:    65
/proc/2176/task/2176/status:nonvoluntary_ctxt_switches: 19883672
/proc/2176/task/2177/status:Name:       eal-intr-thread
/proc/2176/task/2177/status:Cpus_allowed:       fffcff
/proc/2176/task/2177/status:Cpus_allowed_list:  0-7,10-23
/proc/2176/task/2177/status:voluntary_ctxt_switches:    10
/proc/2176/task/2177/status:nonvoluntary_ctxt_switches: 0
/proc/2176/task/2178/status:Name:       fp-rte:9
/proc/2176/task/2178/status:Cpus_allowed:       000200
/proc/2176/task/2178/status:Cpus_allowed_list:  9
/proc/2176/task/2178/status:voluntary_ctxt_switches:    237
/proc/2176/task/2178/status:nonvoluntary_ctxt_switches: 19740497
/proc/2176/task/2179/status:Name:       dpdk-ctrl-port
/proc/2176/task/2179/status:Cpus_allowed:       fffcff
/proc/2176/task/2179/status:Cpus_allowed_list:  0-7,10-23
/proc/2176/task/2179/status:voluntary_ctxt_switches:    3514703
/proc/2176/task/2179/status:nonvoluntary_ctxt_switches: 208025
/proc/2176/task/2182/status:Name:       npf-mgmt
/proc/2176/task/2182/status:Cpus_allowed:       fffcff
/proc/2176/task/2182/status:Cpus_allowed_list:  0-7,10-23
/proc/2176/task/2182/status:voluntary_ctxt_switches:    9
/proc/2176/task/2182/status:nonvoluntary_ctxt_switches: 0

The number of context switches for the dataplane fast path threads should be as small as possible. The perf monitoring tool can give even more information about the cause and duration of these interruptions.

Example that monitors the CPU 8 (one of the fast path cores):

# perf sched record -C 8 -- sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.283 MB perf.data (954 samples) ]
# perf script --header
[...]
        fp-rte:8  2176 [008] 608356.309895: sched:sched_stat_runtime: comm=fp-rte:8 pid=2176 runtime=3999603 [ns] vruntime=5078834868400
        fp-rte:8  2176 [008] 608356.309898: sched:sched_stat_sleep: comm=kworker/8:2 pid=1373 delay=12000294 [ns]
        fp-rte:8  2176 [008] 608356.309899: sched:sched_wakeup: comm=kworker/8:2 pid=1373 prio=120 target_cpu=008
        fp-rte:8  2176 [008] 608356.309900: sched:sched_stat_sleep: comm=rcu_sched pid=8 delay=3949943 [ns]
        fp-rte:8  2176 [008] 608356.309900: sched:sched_wakeup: comm=rcu_sched pid=8 prio=120 target_cpu=004
        fp-rte:8  2176 [008] 608356.309902: sched:sched_stat_runtime: comm=fp-rte:8 pid=2176 runtime=4680 [ns] vruntime=507883486844717
        fp-rte:8  2176 [008] 608356.309902: sched:sched_stat_wait: comm=kworker/8:2 pid=1373 delay=0 [ns]
        fp-rte:8  2176 [008] 608356.309903: sched:sched_switch: prev_comm=fp-rte:8 prev_pid=2176 prev_prio=120 prev_state=R ==> next_com
     kworker/8:2  1373 [008] 608356.309904: sched:sched_stat_runtime: comm=kworker/8:2 pid=1373 runtime=6359 [ns] vruntime=1029668683487
     kworker/8:2  1373 [008] 608356.309905: sched:sched_stat_wait: comm=fp-rte:8 pid=2176 delay=6359 [ns]
     kworker/8:2  1373 [008] 608356.309905: sched:sched_switch: prev_comm=kworker/8:2 prev_pid=1373 prev_prio=120 prev_state=S ==> next_
        fp-rte:8  2176 [008] 608356.313893: sched:sched_stat_runtime: comm=fp-rte:8 pid=2176 runtime=3988661 [ns] vruntime=5078834908333

If this causes packet loss, it can be avoided by increasing the size of the Rx queues (example: FPNSDK_OPTIONS="--nb-rxd=4096" option in the fast path configuration file).

When running in a latency-sensitive environment, increasing the Rx queue size is not suitable, and kernel thread interruptions can be an issue. This can be tempered by:

  • Disabling the watchdog:

    # echo 0 > /proc/sys/kernel/watchdog
    
  • Booting the kernel with specific arguments to isolate fast path cores.

    Example with FP_MASK=8-9:

    isolcpus=8-9 nohz_full=8-9 rcu_nocbs=8-9
    

    isolcpus isolates the CPUs from the general scheduler. nohz_full will stop the scheduler tick whenever possible (CONFIG_NO_HZ_FULL=y must be set). rcu_nocbs prevents rcu callbacks to be scheduled on these CPUs (CONFIG_RCU_NOCB_CPU=y must be set).

Note

Refer to the Linux kernel documentation for details.