Tuning Recommendations

zrobinette12 · November 12, 2024, 1:58pm

Suricata version: 6.0.14
OS: Oracle Linux 8
Compiled and packaged into a RPM
Hardware:

Dell Poweredge R7615
AMD EPYC 9754 128-Core Processor
Mellanox/Nvidia ConnectX-6 Dx 2x 100G QSFP56 OCP3.0 SFF

Looking for tuning recommendations for the above specs. I am having considerable trouble getting Suricata to not drop packets at the kernel. This is a first where I won’t have a dual CPU, since this CPU is more than capable on its own. As such, I’ve not set cpu affinity since I only have 1 NUMA node (I’m assuming I don’t need to set more NUMA nodes since I only have 1 CPU). I’m wondering if this is my issue since I’m not familiar with single CPU configurations.

I’ve set the max RSS queues on this NIC possible (63). Some recommendations say to scale down to 1 RSS queue, but when I do that, the NIC will drop packets. I’ve tried setting the threads to match the RSS queues and also have less threads than RSS queues, but nothing is a good fit. I’ve followed pretty much all recommendations outlined here 11.5. High Performance Configuration — Suricata 8.0.0-dev documentation

af-packet:
  - threads: 48
  - cluster-id: 99
  - cluster-type: qm
  - defrag: yes
  - use-mmap: yes
  - tpacket-v3: yes
  - ring-size: 10000
  - use-emergency-flush: yes

stream:
  memcap: 36gb
  #memcap-policy: ignore
  checksum-validation: no      # reject incorrect csums
  prealloc-sessions: 250000
  #midstream: false
  #midstream-policy: ignore
  inline: no                  # auto will use inline mode in IPS mode, yes or no set it statically
  bypass: yes
  reassembly:
    memcap: 16gb
    #memcap-policy: ignore
    depth: 2mb                  # reassemble 1mb into a stream
    toserver-chunk-size: 2560
    toclient-chunk-size: 2560
    randomize-chunk-size: yes
    #randomize-chunk-range: 10
    #raw: yes
    #segment-prealloc: 2048
    #check-overlap-different-data: true

detect:
  profile: high
  custom-values:
    toclient-groups: 3
    toserver-groups: 25
  sgh-mpm-context: auto
  inspection-recursion-limit: 3000

defrag:
  memcap: 8gb
  # memcap-policy: ignore
  hash-size: 65536
  trackers: 65535 # number of defragmented flows to follow
  max-frags: 65535 # number of fragments to keep (higher than trackers)
  prealloc: yes
  timeout: 60

flow:
  memcap: 36gb
  #memcap-policy: ignore
  hash-size: 65536
  prealloc: 25000
  emergency-recovery: 30
  #managers: 1 # default to one flow manager
  #recyclers: 1 # default to one flow recycler thread

Any recommendations are welcomed and for context I have Zeek listening on the same NIC and it’s not seeing any packet loss so I’m hoping I just have some config overlooked. Thanks!

Andreas_Herz · November 12, 2024, 2:11pm

Suricata 6.0.x is EOL, so please upgrade to Suricata 7.0.x first.

Can you also share the stats.log and suricata.log to check for potential issues?

Is Zeek running in parallel on the same box?

zrobinette12 · November 12, 2024, 2:17pm

Zeek is running in parallel on the same box, yes. Here’s the concerning parts from the stats.log

Date: 11/12/2024 -- 14:01:44 (uptime: 0d, 00h 25m 47s)
------------------------------------------------------------------------------------
Counter                                       | TM Name                   | Value
------------------------------------------------------------------------------------
capture.kernel_packets                        | Total                     | 98728260
capture.kernel_drops                          | Total                     | 7511613
tcp.segment_memcap_drop                       | Total                     | 34010768
tcp.reassembly_memuse                         | Total                     | 7604002888

No errors in the suricata.log. I’ll work on getting Suricata upgraded and will report back. What is the recommended stable version of Suricata 7.0.x?

Andreas_Herz · November 12, 2024, 2:20pm

Does it also happen while Zeek is not running?
Does it capture from the same interface?

memcap.drop is already an indicator that you might have to increase the memcaps further.

How much traffic do you receive?

You could also do a test run without signatures to see if it already drops without those.

Latest stable release is 7.0.7 as of today.

zrobinette12 · November 12, 2024, 2:22pm

Have not run with Zeek not running and it does capture on the same interface. Have not tried with no rules loaded. This is test http traffic sent from an IXIA at 10gbps but hoping to eventually scale this box up to at least 25gbps or more. Will try with no zeek and no rules in addition to upgrading.

Jeff_Lucovsky · November 12, 2024, 2:53pm

I strongly suggest starting with

CPU affinity for the Suricata worker threads
Boot with isolcpus so the linux scheduler won’t use those cores.

zrobinette12 · November 12, 2024, 3:18pm

@Andreas_Herz no drops when there are no rules loaded, having zeek on or off doesn’t make any difference.

@Jeff_Lucovsky got it, I thought I wouldn’t need to do that since it wouldn’t be traversing different NUMA nodes. Our hardware team thought that it would be easier for us to not need to do that, but it turns out if the recommendation is cpu affinity for the workers then I’ll set that again. Thanks.

Jeff_Lucovsky · November 12, 2024, 3:30pm

isolcpus will keep non-data plane processes off of the critical cores; affinity can help with caching.

NUMA is more about keeping processes and their memory close.

Peter Manev will present the 3rd edition of Septun this week in Madrid and it will be available on OISF’s youtube channel in the coming weeks/month. Watch for that; it’ll be a treasure trove of goodness related to performance tuning.

zrobinette12 · November 12, 2024, 5:57pm

@Jeff_Lucovsky sounds good I’ll be on the lookout for that. Any recommendations on how to set the cpu affinity? Since I only have 1 physical CPU, there’s only 1 NUMA node. I’m not exactly sure what to set or how many cores to dedicate to the worker threads.

Jeff_Lucovsky · November 13, 2024, 8:53am

See the threading.cpu_affinity section of the Suricata configuration.

Dedicate as many cores as you can, and make sure those cores are isolated from the Linux scheduler (isolcpus). Monitor the drop and ingress network traffic rates and get a sense of the exact number. There’s no specific formula because network processing times are highly dependent on your hardware, the traffic, the ruleset and other things.

zrobinette12 · November 14, 2024, 9:52pm

@Jeff_Lucovsky @Andreas_Herz I was able to upgrade my Suricata installation to 7.0.7 as recommended, I set isolcpus="160-256" in my kernel parameters (I don’t want to isolate this many CPUs though as I have other resource intensive processes running on this box).

I set my CPU affinity to:

threading:
  set-cpu-affinity: yes
  cpu-affinity:
    - management-cpu-set:
        cpu: [ "all" ]
        mode: "balanced"
    - worker-cpu-set:
        cpu: [ "160-223" ]
        mode: "exclusive"
        prio:
          default: "high"

My last test with 72,515 signatures was 10% kernel drops. If I drop the number of signatures to around 7,000, my kernel drop rate is 2%. I don’t see a significant change using cluster_qm vs cluster_flow, cluster_flow is a bit worse.

I’ve uploaded my suricata.log and stats.log. I had to remove a lot of lines because since this is a test box I don’t have variables configured so it throws a lot of errors with rules. I plan to test with less CPUs isolated as well as less RSS queues. I did have one question about RSS queues as I’m not too familiar with them. I’ve followed the SEPTun guidance on symmetric hashing, but when I do:

for proto in tcp4 udp4 tcp6 udp6; do
    /opt/mellanox/ethtool/sbin/ethtool -N eno12399 rx-flow-hash $proto sdfn
done

ethtool -n eno12399
63 RX rings available
Total 0 rules

does that 0 rules mean the rx-flow-hash isn’t being applied?

Any help is greatly appreciated.

suricata-filtered.log (7.0 KB)
stats-filtered.log (70.8 KB)

Jeff_Lucovsky · November 15, 2024, 8:56am

160-223 Are these physical cores or a mix of physical and hyperthreaded.

I’ve found that performance is improved if the entire (physical, and ht core (if enabled)) core is dedicated to suricata.

Additionally, note that the AMD processor layout includes “chiplets” or “compute complexes” in groups of 8 (shared L3, private l1/l2).

zrobinette12 · November 15, 2024, 1:29pm

160-223 is a mix of physical and ht as the box has 128 physical cores with ht enabled. Currently, there’s just numa node 0 with all 256 cores in it however would it still improve performance if I set them in groups of 8 like you suggest, even though they’re all in 1 numa node?

Jeff_Lucovsky · November 15, 2024, 1:38pm

I’d suggest either not using hyperthreads or reserving the physical and hyperthreaded cores allocated to suricata.

Remember, NUMA awareness is mostly about memory.

zrobinette12 · November 15, 2024, 7:33pm

Definitely made some progress, I ended up doing:

ifconfig eno12399 down
/opt/mellanox/ethtool/sbin/ethtool -L eno12399 combined 15
/opt/mellanox/ethtool/sbin/ethtool -K eno12399 rxhash on
/opt/mellanox/ethtool/sbin/ethtool -K eno12399 ntuple on
ifconfig eno12399 up
/sbin/set_irq_affinity_cpulist.sh 1-7,64-71 $HOST_DEV
/opt/mellanox/ethtool/sbin/ethtool -X eno12399 hkey 6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A equal 15
/opt/mellanox/ethtool/sbin/ethtool -C eno12399 adaptive-rx off adaptive-tx off rx-usecs 125
/opt/mellanox/ethtool/sbin/ethtool -G eno12399 rx 1024
for i in rx tx tso ufo gso gro lro tx nocache copy sg txvlan rxvlan; do
    /opt/mellanox/ethtool/sbin/ethtool -K eno12399 $1 off >/dev/null 2>&1;
done
for proto in tcp4 udp4 tcp6 udp6; do
    /opt/mellanox/ethtool/sbin/ethtool -N eno12399 rx-flow-hash $proto sdfn
done

With this suricata.yaml configs:

af-packet:
  - interface: eno12399
    threads: 48
    cluster-id: 99
    cluster-type: cluster_flow
    defrag: yes
    use-mmap: yes
    mmap-locked: yes
    tpacket-v3: yes
    ring-size: 10000
    block-size: 1048576
threading:
  set-cpu-affinity: yes
  cpu-affinity:
    - management-cpu-set:
        cpu: [ "120-127" ]
    - receive-cpu-set:
        cpu: [ 0 ]
    - worker-cpu-set:
        cpu: [ "8-55" ]
        mode: "exclusive"
        prio:
          high: [ "8-55" ]
          default: "high"

Mostly stolen from the AMD section of this 11.5. High Performance Configuration — Suricata 8.0.0-dev documentation.

This has resulted in around 0.10% kernel drop (which is a huge improvement from 50% when I first started).
This is after about 10 mins of 8.5gbps traffic (mix of mostly HTTP/SMB/FTP traffic).

capture.kernel_packets                        | Total                     | 543695953
capture.kernel_drops                          | Total                     | 548207
flow.memcap                                   | Total                     | 0
tcp.ssn_memcap_drop                           | Total                     | 0
tcp.pkt_on_wrong_thread                       | Total                     | 152859
tcp.segment_memcap_drop                       | Total                     | 79071
memcap_pressure                               | Total                     | 99
memcap_pressure_max                           | Total                     | 99
tcp.memuse                                    | Total                     | 5328000128
tcp.reassembly_memuse                         | Total                     | 51539611520
http.memuse                                   | Total                     | 52375979554
http.memcap                                   | Total                     | 0
ftp.memuse                                    | Total                     | 8578906
ftp.memcap                                    | Total                     | 0
flow.memuse                                   | Total                     | 545449840

zrobinette12 · November 15, 2024, 8:53pm

Also, turns out I wasn’t loading tcmalloc anymore so I have that loading now and I’m under 0.02% drops across the board. Thanks for everyone’s assistance!