Need help tuning Suricata to 10Gbps

alex_cs · November 16, 2021, 4:27am

Hi all,

I am following GitHub - pevma/SEPTun-Mark-II: Suricata Extreme Performance Tuning guide - Mark II to build a Suricata NSM.
I expect performance can be 10Gbps, but currently my setup can only handle about 6Gbps with 0% kernel drops. When I try increase network traffic to 7Gbps, the kernel drops start to increase and i could see 50%. Any help or general optimization tips appreciated!

Here is my setup:

Suricata 6.0.2 build from source
OS: Debian 9.5
Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                48
On-line CPU(s) list:   0-47
Thread(s) per core:    2
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz
Stepping:              7
CPU MHz:               2400.000
BogoMIPS:              4800.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              16896K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47

64GB RAM
HPE Ethenet 10Gb 2-port 557SFP+. Card installed into NUMA node 1. I am only using 1 port now. NIC configures as below:

ifconfig enp175s0f1 down
ethtool -L enp175s0f1 combined 16
ethtool -K enp175s0f1 rxhash on
ethtool -K enp175s0f1 ntuple on
ifconfig enp175s0f1 up
./set_irq_affinity 17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47 enp175s0f1
ethtool -X enp175s0f1 hkey 6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A equal 16
ethtool -A enp175s0f1 rx off tx off
ethtool -C enp175s0f1 adaptive-rx off adaptive-tx off
ethtool -G enp175s0f1 rx 1024
for proto in tcp4 udp4 tcp6 udp6; do
echo “ethtool -N enp175s0f1 rx-flow-hash $proto sdfn”
ethtool -N enp175s0f1 rx-flow-hash $proto sdfn
done

I am using DPDK-Pktgen tool to replay pcap files to mirror port of NSM.
AF-packet configuration in suricata.yaml

af-packet:
  - interface: enp175s0f1
    threads: 16
    cluster-id: 99
    cluster-type: cluster_qm
    defrag: yes
    use-mmap: yes
    mmap-locked: yes
    tpacket-v3: yes
    ring-size: 600000
    block-size: 1048576

CPU affinity in suricata.yaml

cpu-affinity:
    - management-cpu-set:
        cpu: [ "1,3,5,7,9,11,13,15" ]  # include only these CPUs in affinity settings
    - receive-cpu-set:
        cpu: [ "0-10" ]  # include only these CPUs in affinity settings
    - worker-cpu-set:
        cpu: [ "17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47" ]
        mode: "exclusive"
        # Use explicitly 3 threads and don't compute number by using
        # detect-thread-ratio variable:
        #threads: 12
        prio:
          #low: [ 0 ]
          medium: [ "0-3" ]
          high: [ "17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47" ]
          default: "high"

suricata_tuning_1card.yaml (72.7 KB)

suricatalfon · November 16, 2021, 6:35am

Hí,

Let’s see if this documentation can help you

https://suricata.readthedocs.io/en/latest/performance/high-performance-config.html

ulimit · November 16, 2021, 7:46am

Just wondering, which rules set do you use?

During my tests I found that rules processing take a lot of cpu time (for IDS mode at least)
For example, l3-l4 decode/stream event rules (stream-event: / decode-event keywords) had a high load impact on the test traffic I ran
Also you can check amount of alerts triggered - high amount of alerts also impact performance

suricatalfon · November 16, 2021, 12:33pm

Hí,

Datasets, etc. also penalize performance a lot.

Jeff_Lucovsky · November 16, 2021, 1:33pm

Hi – thanks for posting!

I’d suggest a few things

Enable threaded eve.json logging (eve-log.threaded = true)
Allocate less than 16 cores to the management-cpu-set
Experiment with different worker cpu layouts where the hyperthread and real core are not handling worker loads.
Also, try balancing the worker threads across both numa nodes. This will introduce extra latency (minor) but it shouldn’t negatively impact a 10Gbps load
Once a worker cpu layout shows good results, consider isolating the cores from the Linux scheduler

Others have asked about rulesets and logging loads. If both are “high” you might want to consider the payload settings in types.alert and disable any that aren’t strictly needed.

anomaly logging can be very useful to help expose issues – if the number of records in eve.json with event_type == anomaly is excessive, consider disabling that setting.

Finally, an htop listing showing the custom thread names would be helpful.

alex_cs · November 24, 2021, 5:40am

Hi all,

Thank you for the advices.

The rulesets I use is ETpro and secureworks, with 26535 rules total.
I tried to use 2 NIC cards installing in separate NUMA node. After that performance are able to reach 10Gpbs with 5Gbps in each NIC card. Drop rate is <1%.
But I could not increate performance in one NIC card. After enable threaded eve.json logging ( eve-log.threaded = true ), I found a problem, the number packet enters to thread #9 is much higher than others and it makes high kernel_drop rate

capture.kernel_packets                        | W#01-enp59s0f0            | 269238563
capture.kernel_packets                        | W#02-enp59s0f0            | 426517029
capture.kernel_packets                        | W#03-enp59s0f0            | 310534759
capture.kernel_packets                        | W#04-enp59s0f0            | 407302256
capture.kernel_packets                        | W#05-enp59s0f0            | 296126720
capture.kernel_packets                        | W#06-enp59s0f0            | 605689746
capture.kernel_packets                        | W#07-enp59s0f0            | 260600824
capture.kernel_packets                        | W#08-enp59s0f0            | 388083100
**capture.kernel_packets                        | W#09-enp59s0f0            | 876617630**
**capture.kernel_drops                          | W#09-enp59s0f0            | 453771583**
capture.kernel_packets                        | W#10-enp59s0f0            | 229944342
capture.kernel_packets                        | W#11-enp59s0f0            | 390744484
capture.kernel_packets                        | W#12-enp59s0f0            | 191769694
capture.kernel_packets                        | W#13-enp59s0f0            | 292353861
capture.kernel_packets                        | W#14-enp59s0f0            | 239483969
capture.kernel_packets                        | W#15-enp59s0f0            | 394888232
capture.kernel_packets                        | W#16-enp59s0f0            | 333460293

Is this problem come from my test data? or is there any other solution for this. Thank you!

Jeff_Lucovsky · November 27, 2021, 1:53pm

Thread 9 (W#-09) is receiving an overwhelming majority of the ingress traffic. The NIC will hash ingress network traffic to one of 16 queues which are then retrieved and process the Suricata workerthread handling the queue. The cause of the imbalance could be several things – “elephant flows” (a long-lived flow with lots of communication)

Lots of smaller, short-lived connections (e.g., DNS communication)

Have you tried any of the CPU core suggestions I gave earlier?

It’s also hard to follow the details of the system(s) that you’re using Suricata. The first post showed enp175s0f1 as the network interface; the most recent showed enp59s0f0

alex_cs · November 29, 2021, 3:35am

Hi Jeff,

Sorry to have confused you. The enp175s0f1 belongs to first NIC that was installed in Numa node 1. Then I tried second NIC - enp59s0f0 in Numa node 0.

Test each NIC individually , it is able to reach 6Gbps with 0% drop rate. But when combining 2 NICs, it can’t get 12Gpbs, only about 10Gbps and drop rate to 4-5%. Is there any cause for this phenomenon?

As my understanding, CPU cores for worker-threads of each NIC separate between numa nodes. Please correct me if wrong.

Jeff_Lucovsky · November 29, 2021, 1:59pm

There are several things to consider

Worker core layout – does each Suricata worker thread have its own core? Is the core shared with a hyperthread? Is the core isolated from Linux scheduling (see isolcpus boot parameter)
Worker thread count – for 20Gbps, you should have 20-30 worker threads. Of course, the exact number depends on the type of traffic, network protocols used, and other deployment factors including positioning – Is Suricata seeing North/South and/or East/West. The first is traffic between the internet (“North”) and your intranet. The second is traffic within your intranet.
Linux IRQ handling of network interrupts. You’re using irq_affinity so ideally, the Suricata worker threads would be affined to those cores.
Flow distribution – There are often “elephant flows” within an organization’s intranet. These can be excluded from the traffic shown to Suricata – look at Suri’s BPF capabilities.

Sean · November 29, 2021, 8:32pm

I’ll try to throw suggestions up as time permits.

If the NIC is on NUMA node 1, change the IRQ affinity to CPU3 - only. Make sure CPU3 isn’t listed in any of the rest of the confit.
Change the worker CPU set to the NUMA node 1 CPUs that are:

Not on CPU the NIC is using to service interrupts - so NOT CPU3
Not the HT CPU of CPU3 - so NOT CPU 27
Keep 1st CPU of Each core free for housekeeping, and possibly the same for it’s HT CPU - so NOT CPU1.

I would suggest for worker-CPU set: 5 7 9 11 13 15 17 19 21 23 29 31 33 35 37 39 41 43 45 47

Use the above list for the PRIO field for the list of CPUs

Change Management to CPU 25 - IIRC you don’t need much in horsepower for that.
Remove the CPUs assigned to the Receive CPU-SET as I don’t think that it is active in worker mode. Those lines can be remarked out.
Change threads to the number of CPUs in the worker cpu set - 20.
Change the AF-PACKET threads to match - again 20.