What is the bottleneck while using workers mode with Hyperscan?

PlinkPlunkT · June 5, 2023, 4:19pm

Dear Suricata Team Members,

I am testing the performance of Suricata with Hyperscan.

I am using Suricata 6.0.12 with workers mode. And I am using the command
suricata -c suricata.yaml -i ens801f1 -l ./log/log_hs_hs.
I use Tcpreplay to test the max throughput of suricata using different threads. Tcpreplay will replay a pcap file with the speed of 1000 Mbps to NIC ens801f1.

What I found is that I have many capture.kernel_drops in my stat.log like this:

Date: 6/5/2023 -- 05:30:36 (uptime: 0d, 00h 01m 37s)
------------------------------------------------------------------------------------
Counter                                       | TM Name                   | Value
------------------------------------------------------------------------------------
capture.kernel_packets                        | Total                     | 8654107
capture.kernel_drops                          | Total                     | 4998424
decoder.pkts                                  | Total                     | 3658745
decoder.bytes                                 | Total                     | 3756217659
decoder.ipv4                                  | Total                     | 3650943
decoder.ipv6                                  | Total                     | 414
decoder.ethernet                              | Total                     | 3658745
decoder.tcp                                   | Total                     | 3571410
decoder.udp                                   | Total                     | 77992
decoder.icmpv4                                | Total                     | 1317
decoder.icmpv6                                | Total                     | 272
decoder.avg_pkt_size                          | Total                     | 1026
decoder.max_pkt_size                          | Total                     | 1392
flow.tcp                                      | Total                     | 4260
flow.udp                                      | Total                     | 1525
flow.icmpv4                                   | Total                     | 40
flow.icmpv6                                   | Total                     | 4
flow.tcp_reuse                                | Total                     | 891
flow.wrk.spare_sync_avg                       | Total                     | 100
flow.wrk.spare_sync                           | Total                     | 53
decoder.event.ipv4.opt_pad_required           | Total                     | 366
decoder.event.ipv6.zero_len_padn              | Total                     | 254
flow.wrk.flows_evicted                        | Total                     | 779
tcp.sessions                                  | Total                     | 3022
tcp.syn                                       | Total                     | 17689
tcp.synack                                    | Total                     | 16689
tcp.rst                                       | Total                     | 10702
tcp.pkt_on_wrong_thread                       | Total                     | 1074799
tcp.stream_depth_reached                      | Total                     | 47
tcp.reassembly_gap                            | Total                     | 10365
tcp.overlap                                   | Total                     | 38741
detect.alert                                  | Total                     | 28422
detect.alerts_suppressed                      | Total                     | 40250
app_layer.flow.http                           | Total                     | 687
app_layer.tx.http                             | Total                     | 1463
app_layer.flow.tls                            | Total                     | 725
app_layer.flow.mqtt                           | Total                     | 1
app_layer.tx.mqtt                             | Total                     | 5
app_layer.flow.failed_tcp                     | Total                     | 241
app_layer.flow.dns_udp                        | Total                     | 930
app_layer.tx.dns_udp                          | Total                     | 10636
app_layer.flow.failed_udp                     | Total                     | 595
flow.mgr.full_hash_pass                       | Total                     | 1
flow.spare                                    | Total                     | 9313
flow.mgr.rows_maxlen                          | Total                     | 3
flow.mgr.flows_checked                        | Total                     | 550
flow.mgr.flows_notimeout                      | Total                     | 550
flow.mgr.flows_evicted                        | Total                     | 13
tcp.memuse                                    | Total                     | 2437320
tcp.reassembly_memuse                         | Total                     | 75298712
http.memuse                                   | Total                     | 3810295
flow.memuse                                   | Total                     | 8866304
------------------------------------------------------------------------------------

And here is my cpu-affinity in my suricata.yaml:
suricata.yaml (73.8 KB)

threading:
  set-cpu-affinity: yes
  # Tune cpu affinity of threads. Each family of threads can be bound
  # to specific CPUs.
  #
  # These 2 apply to the all runmodes:
  # management-cpu-set is used for flow timeout handling, counters
  # worker-cpu-set is used for 'worker' threads
  #
  # Additionally, for autofp these apply:
  # receive-cpu-set is used for capture threads
  # verdict-cpu-set is used for IPS verdict threads
  #
  cpu-affinity:
    - management-cpu-set:
        cpu: [ 44-47 ]  # include only these CPUs in affinity settings
    - receive-cpu-set:
        cpu: [ 48-63 ]  # include only these CPUs in affinity settings
    - worker-cpu-set:
        cpu: [ 26-41 ]
        mode: "exclusive"
        # Use explicitly 3 threads and don't compute number by using
        # detect-thread-ratio variable:
        threads: 4
        prio:
          # low: [ 0 ]
          # medium: [ "1-2" ]
          # high: [ 3 ]
          default: "high"

In my test, I use htop to see the utilization of CPU cores, and none of 26-29 is fully used.
It should not be so many pkts to be dropped. My NIC supports up to 40Gbps. And there are still 2% drops when I decrease the traffic to 200 Mbps, which should be far below what 4 CPU core can handle. I also used autofp mode with AF_PACKET. And I got the same result, also many dropped pkts. I know there will be dropped pkts when the traffic is too large, but I got a high drop rate with traffic obviously below cores can handle.

I wonder why so many pkts are dropped and where is the bottleneck? How can I config to reduce the rate of kernel_drops?

Does the decoder process data packets earlier than hyperscan? And I was just testing the performance of decoder?

Thanks!

Andreas_Herz · June 5, 2023, 4:28pm

Can you also post:

suricata --build-info

Also the suricata.log and what NIC are you using?

Another idea would be to run perf top -p $(pidof suricata) while the traffic is feeding in, to see if there is another potential bottleneck.

Also depends on the traffic being replayed, the tcp.reassembly_gap and tcp.pkt_on_wrong_thread are also quite high besides the high drop rate.

One recommendation would be to set threads in worker-cpu-set as well to the same amount as in your af-packet section (you configured 16).

Your correct on first debugging the issue with the lower traffic rate, 2% shouldn’t be there with enough cores. What CPU is used? A normal CPU should achieve at least 100mbit/s per core.

You can also do test runs without rules as well.

PlinkPlunkT · June 6, 2023, 3:51am

Thanks for your timely reply!

Information of suricata --build-info:

This is Suricata version 6.0.12 RELEASE
Features: PCAP_SET_BUFF AF_PACKET HAVE_PACKET_FANOUT LIBCAP_NG LIBNET1.1 HAVE_HTP_URI_NORMALIZE_HOOK PCRE_JIT HAVE_LIBJANSSON TLS TLS_C11 MAGIC RUST 
SIMD support: SSE_4_2 SSE_4_1 SSE_3 
Atomic intrinsics: 1 2 4 8 16 byte(s)
64-bits, Little-endian architecture
GCC version 7.5.0, C version 201112
compiled with _FORTIFY_SOURCE=2
L1 cache line size (CLS)=64
thread local storage method: _Thread_local
compiled with LibHTP v0.5.43, linked against LibHTP v0.5.43

Suricata Configuration:
  AF_PACKET support:                       yes
  eBPF support:                            no
  XDP support:                             no
  PF_RING support:                         no
  NFQueue support:                         no
  NFLOG support:                           no
  IPFW support:                            no
  Netmap support:                          no  using new api: no
  DAG enabled:                             no
  Napatech enabled:                        no
  WinDivert enabled:                       no

  Unix socket enabled:                     yes
  Detection enabled:                       yes

  Libmagic support:                        yes
  libnss support:                          no
  libnspr support:                         no
  libjansson support:                      yes
  hiredis support:                         no
  hiredis async with libevent:             no
  Prelude support:                         no
  PCRE jit:                                yes
  LUA support:                             no
  libluajit:                               no
  GeoIP2 support:                          no
  Non-bundled htp:                         no
  Hyperscan support:                       yes
  Libnet support:                          yes
  liblz4 support:                          no
  HTTP2 decompression:                     no

  Rust support:                            yes
  Rust strict mode:                        no
  Rust compiler path:                      /usr/bin/rustc
  Rust compiler version:                   rustc 1.65.0
  Cargo path:                              /usr/bin/cargo
  Cargo version:                           cargo 1.65.0
  Cargo vendor:                            yes

  Python support:                          yes
  Python path:                             /usr/bin/python3
  Install suricatactl:                     yes
  Install suricatasc:                      yes
  Install suricata-update:                 yes

  Profiling enabled:                       no
  Profiling locks enabled:                 no

  Plugin support (experimental):           yes

Development settings:
  Coccinelle / spatch:                     no
  Unit tests enabled:                      no
  Debug output enabled:                    no
  Debug validation enabled:                no

Generic build parameters:
  Installation prefix:                     /root/tnb/suricata/usr
  Configuration directory:                 /root/tnb/suricata/etc/suricata/
  Log directory:                           /root/tnb/suricata/var/log/suricata/

  --prefix                                 /root/tnb/suricata/usr
  --sysconfdir                             /root/tnb/suricata/etc
  --localstatedir                          /root/tnb/suricata/var
  --datarootdir                            /root/tnb/suricata/usr/share

  Host:                                    x86_64-pc-linux-gnu
  Compiler:                                gcc (exec name) / g++ (real)
  GCC Protect enabled:                     no
  GCC march native enabled:                yes
  GCC Profile enabled:                     no
  Position Independent Executable enabled: no
  CFLAGS                                   -g -O2 -std=c11 -march=native -I${srcdir}/../rust/gen -I${srcdir}/../rust/dist
  PCAP_CFLAGS                               -I/usr/include
  SECCFLAGS

Here is the information of my NIC:

~/tnb/suricata/usr/bin > ethtool ens801f1                                                              
Settings for ens801f1:
        Supported ports: [ FIBRE ]
        Supported link modes:   40000baseCR4/Full 
        Supported pause frame use: Symmetric Receive-only
        Supports auto-negotiation: Yes
        Supported FEC modes: Not reported
        Advertised link modes:  40000baseCR4/Full 
        Advertised pause frame use: No
        Advertised auto-negotiation: Yes
        Advertised FEC modes: Not reported
        Speed: 40000Mb/s
        Duplex: Full
        Port: Direct Attach Copper
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: off
        Supports Wake-on: d
        Wake-on: d
        Current message level: 0x00000007 (7)
                               drv probe link
        Link detected: yes
~/tnb/suricata/usr/bin > ethtool -i ens801f1               
driver: i40e
version: 2.1.14-k
firmware-version: 8.30 0x8000a4ae 1.2926.0
expansion-rom-version: 
bus-info: 0000:82:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

Here is my CPU:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              88
On-line CPU(s) list: 0-87
Thread(s) per core:  2
Core(s) per socket:  22
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Stepping:            1
CPU MHz:             1508.485
CPU max MHz:         2200.0000
CPU min MHz:         1200.0000
BogoMIPS:            4389.86
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            56320K
NUMA node0 CPU(s):   0-21,44-65
NUMA node1 CPU(s):   22-43,66-87
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts md_clear flush_l1d

Here is the suricata.log when using 4 cores(26-29) and 500 Mbps traffic in workers mode. Still a 30% drops.

6/6/2023 -- 03:24:08 - <Notice> - This is Suricata version 6.0.12 RELEASE running in SYSTEM mode
6/6/2023 -- 03:24:08 - <Info> - CPUs/cores online: 88
6/6/2023 -- 03:24:08 - <Info> - Setting engine mode to IDS mode by default
6/6/2023 -- 03:24:08 - <Info> - Found an MTU of 1500 for 'ens801f1'
6/6/2023 -- 03:24:08 - <Info> - Found an MTU of 1500 for 'ens801f1'
6/6/2023 -- 03:24:08 - <Info> - fast output device (regular) initialized: fast.log
6/6/2023 -- 03:24:08 - <Info> - eve-log output device (regular) initialized: eve.json
6/6/2023 -- 03:24:08 - <Info> - stats output device (regular) initialized: stats.log
6/6/2023 -- 03:24:08 - <Info> - Running in live mode, activating unix socket
6/6/2023 -- 03:24:18 - <Info> - 1 rule files processed. 33804 rules successfully loaded, 0 rules failed
6/6/2023 -- 03:24:18 - <Info> - Threshold config parsed: 0 rule(s) found
6/6/2023 -- 03:24:19 - <Info> - 33807 signatures processed. 1247 are IP-only rules, 5200 are inspecting packet payload, 27153 inspect application layer, 108 are decoder event only
6/6/2023 -- 03:24:32 - <Info> - Going to use 4 thread(s)
6/6/2023 -- 03:24:32 - <Info> - Running in live mode, activating unix socket
6/6/2023 -- 03:24:32 - <Info> - Using unix socket file '/root/tnb/suricata/var/run/suricata/suricata-command.socket'
6/6/2023 -- 03:24:32 - <Notice> - all 4 packet processing threads, 4 management threads initialized, engine started.
6/6/2023 -- 03:24:32 - <Info> - All AFP capture threads are running.
6/6/2023 -- 03:26:52 - <Notice> - Signal Received.  Stopping engine.
6/6/2023 -- 03:26:53 - <Info> - time elapsed 140.528s
6/6/2023 -- 03:26:54 - <Info> - Alerts: 35515
6/6/2023 -- 03:26:54 - <Info> - cleaning up signature grouping structure... complete
6/6/2023 -- 03:26:54 - <Notice> - Stats for 'ens801f1':  pkts: 6785095, drop: 2064384 (30.43%), invalid chksum: 0

perf information of above test:

I also tried autofp mode with setting threads in worker-cpu-set as well to the same amount as in my af-packet section to 4. And I use 500 Mbps live traffic to make the test. Finally I still got 4.34% drops.
Besides, can you tell me when does the decoder work? Before Hyperscan or after it?

Thanks for your insightful comments!

Jeff_Lucovsky · June 8, 2023, 1:26pm

Hyperscan is used as a multi-pattern matcher; it determines which rule(s) should be checked based on the packet contents.

pevma · June 8, 2023, 1:43pm

Something to try - I would suggest to adjust in suricata.yaml max-pending-packets to something like 30k (30000) and ring-size (in the af-packet section) to 20000 .
Also comment out buffer-size (in the af-packet section)
Then restart and see if any improvement.

PlinkPlunkT · June 9, 2023, 8:41am

Thanks for your timely reply, which helps a lot.

Besides, I still have some questions. Hope I can get answers from you.

How many instances decoder has? Is the same as threads I configured in af-packet section?
Is it a proper way to use the speed of decoder.bytes to evaluate the performance of Suricata with Hyperscan?
What will you do with the capture_kernel_drops pkts when benchmarking? And how you consider it?

Looking forward to your early reply.

Thanks!

Jeff_Lucovsky · June 9, 2023, 11:33am

There will be threads worker-threads each of which retrieves the packet and fully processes the packet.
Another way to measure Suricata performance is to use the rate at which packets arrive at the ingress NIC(s) and subtract the packets dropped. That’ll give the PPS; the bits/second can be derived from the same information.
Not sure what this question is.

PlinkPlunkT · June 13, 2023, 7:38am

Hi Jeff,

Thanks for your kindly reply!

So you mean the number of decoders are exactly the same as detectors, both decoder and detector are in the same worker-thread?
I saw when no packets dropping, the rate of decoder.bytes is two times of the rate sent with TCPReplay, while the rate of the rate of decoder.bytes is the same of the rate using real live traffic.
I also wonder if its proper to use the speed of decoder.bytes to evaluate the performance of Suricata with Hyperscan? Does decoder.bytes have some trouble in testing performance？
Sorry I didn’t make it clear. I am wondering if you take packets dropping into account when doing your benchmarks and how do you use packet dropping as an indicator?

Thanks!

Jeff_Lucovsky · June 13, 2023, 2:47pm

Please see the diagrams in 9.1. Runmodes — Suricata 7.0.0-rc2-dev documentation

Note that the workers runmode is recommended for better performance since a single thread is responsible for packet acquisition and processes the packet completely, including sending an alert (if one or more rules match)>

Are you calculating a rate by sampling decoder.bytes? Note that decoder.bytes represents the number of bytes seen by Suricata and is not a byte rate.

How you measure performance is up to you. Generally, it includes the ingress traffic bit and packet counts and the same values from Suricata. Usually, packets dropped by the NIC are factored in.