Verifying correctness of tuning configuration

Hello, and thanks for running a forum with such amazing resources! I am working on tuning Suricata towards multiple tenths of Gbit/s and I’m learning about the different aspects involved. I already found that CPU affinity and isolating CPUs has the biggest impact on performance, probably because context switches and unnecessary cache misses are avoided.

I’m working on NIC settings as well, and I think I have it right. However, I would like to verify that I do, and if not, any feedback is welcome!

I’m automating the installation and maintanance of sensors with Ansible so it can be done at scale, and sensors can have more than one monitored interface. While working with Intel NICs like the X710 and XL710 (both i40e), I now have the following command executed for the interfaces.

#!/bin/bash

# Set interface names from Ansible using template.
nics="eth0 eth1 ... ethN"
num_queues=number_of_threads_per_interface

for nic in $nics; do
        # Set number of queues on the each interface to whatever number of threads is available per interface.
        # By configuration, this number will be evenly divided.
        ethtool -L $nic combined $num_queues

        # Set symmetric hashing to enabled and enable ntuple.
        ethtool -K $nic rxhash on
        ethtool -K $nic ntuple on

        # Set the symmetric hash key to use, and specify the number of queues at the end.
        ethtool -X $nic hkey 6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A equal $num_queues

        # Some additional settings to disable offloading and coalescing interrupts towards the CPU.
        ethtool -A $nic rx off
        ethtool -C $nic adaptive-rx off adaptive-tx off rx-usecs 125
        ethtool -G $nic rx 1024

        # Set the interface to use a Toeplitz hash for the best performance.
        ethtool -X $nic hfunc toeplitz

        # Let the NIC balance as much as possible by itself.
        for proto in tcp4 udp4 tcp6 udp6; do
                ethtool -N $nic rx-flow-hash $proto sdfn
        done
done

The sensors use a 32-core AMD EPYC processor with HyperTransport enabled, so I have 64 logical cores. I read it is fine to leave HT enabled, so I did. With those 64 logical cores, I isolated 4-63 using isolcpus, and left the first four cores for system tasks, interrupts, and non-worker Suricata tasks. I now have 60 threads left for workers, and:

  • I can create an af-packet section for every interface, with a different cluster-id and using cluster_qm as mode. Because they map workers to a RSS queue on the NIC;
  • I will divide 60 by the number of monitored interfaces and set the result as the number of workers under af-packet;
  • Under CPU affinity, I will create “sections” of CPU cores instead of putting [ “4-63” ], since I have multiple af-packet sections. For example, for two interfaces, I would set [ “4-33”, “5-63” ].

With this configuration, Suricata will map the workers from one interface correctly to the 30 cores of one interface, and the other 30 to the other interface. Is that correct?

And regarding to set_irq_affinity, I have only one NUMA node using modern AMD CPUs. Therefore, should my sensor have only one interface, set_irq_affinity eth0 local should do, if necessary at all. However, if I have multiple interfaces, should I configure a cpu-list instead?

set_irq_affinity_cpulist 4,33 eth0
set_irq_affinity_cpulist 5,63 eth1

Is that correct?

Then, if possible, I have a couple of side-questions:

  • Are there any performance tuning guides available on Mellanox cards like the Connect-X 6? There is plenty of information available for Intel, but I can’t seem to find any extensive ones on Mellanox. For 100Gbit/s I am in the process of deciding whether to go for Mellanox or test the E810 instead.
  • Will the performance tuning guide for Intel also apply on the E810?

Thanks in advance!

-Gijs