No performance improvement from 4C4T to 8C8T with Hyperscan

Dear developers,

I am trying to make a core scaling about Suricata 6.0.10 with Hyperscan using some pcap files. I tested the total seconds used with different cores and threads. I ran this command with specific configuration file:

./bench_install_root/usr/bin/suricata -c suricata_bench/suricata.yaml -r ./pcap_files -l ./log_std/log_hs_hs

I got outputs like this with 8C16T:

5/5/2023 -- 15:26:15 - <Notice> - This is Suricata version 6.0.10 RELEASE running in USER mode
5/5/2023 -- 15:26:16 - <Error> - [ERRCODE: SC_WARN_JA3_DISABLED(309)] - ja3 support is not enabled
5/5/2023 -- 15:26:16 - <Error> - [ERRCODE: SC_ERR_INVALID_SIGNATURE(39)] - error parsing signature "alert tls $HOME_NET any -> $EXTERNAL_NET any (msg:"ET JA3 Hash - Suspected Cobalt Strike Malleable C2 M1 (set)"; flow:established,to_server; ja3.hash; content:"eb88d0b3e1961a0562f006e5ce2a0b87"; ja3.string; content:"771,49192-49191-49172-49171"; flowbits:set,ET.cobaltstrike.ja3; flowbits:noalert; classtype:command-and-control; sid:2028831; rev:1; metadata:affected_product Windows_XP_Vista_7_8_10_Server_32_64_Bit, attack_target Client_Endpoint, created_at 2019_10_15, deployment Perimeter, former_category JA3, malware_family Cobalt_Strike, signature_severity Major, updated_at 2019_10_15, mitre_tactic_id TA0011, mitre_tactic_name Command_And_Control, mitre_technique_id T1001, mitre_technique_name Data_Obfuscation;)" from file /home/xuhao/suricata_bench/bench_install_root/var/lib/suricata/rules/emerging-all.rules at line 27115
5/5/2023 -- 15:26:19 - <Error> - [ERRCODE: SC_WARN_JA3_DISABLED(309)] - ja3(s) support is not enabled
5/5/2023 -- 15:26:19 - <Error> - [ERRCODE: SC_ERR_INVALID_SIGNATURE(39)] - error parsing signature "alert tls $EXTERNAL_NET any -> $HOME_NET any (msg:"ET JA3 HASH - Possible RustyBuer Server Response"; flowbits:isset,ET.rustybuer; ja3s.hash; content:"f6dfdd25d1522e4e1c7cd09bd37ce619"; reference:md5,ea98a9d6ca6f5b2a0820303a1d327593; classtype:bad-unknown; sid:2032960; rev:1; metadata:attack_target Client_Endpoint, created_at 2021_05_13, deployment Perimeter, former_category JA3, malware_family RustyBuer, performance_impact Low, signature_severity Major, updated_at 2021_05_13;)" from file /home/xuhao/suricata_bench/bench_install_root/var/lib/suricata/rules/emerging-all.rules at line 60175
5/5/2023 -- 15:26:28 - <Notice> - all 17 packet processing threads, 4 management threads initialized, engine started.
5/5/2023 -- 15:26:51 - <Notice> - Signal Received.  Stopping engine.
5/5/2023 -- 15:26:52 - <Notice> - Pcap-file module read 3 files, 8383530 packets, 4178235146 bytes

I used the time difference between line 8 and line 1 as the performance metric of this test. And I got time reduction from 1C1T configuration to 4C4T. However, when I allocate more cores and threads, the performance does not improve anymore.

I configured 4C4T in suricata.yaml:

threading:
  set-cpu-affinity: yes
  # Tune cpu affinity of threads. Each family of threads can be bound
  # to specific CPUs.
  #
  # These 2 apply to the all runmodes:
  # management-cpu-set is used for flow timeout handling, counters
  # worker-cpu-set is used for 'worker' threads
  #
  # Additionally, for autofp these apply:
  # receive-cpu-set is used for capture threads
  # verdict-cpu-set is used for IPS verdict threads
  #
  cpu-affinity:
    - management-cpu-set:
        cpu: ["44-55", "100-111"]
    - receive-cpu-set:
        cpu: ["36-43", "92-99"]
    - worker-cpu-set:
        cpu: ["28-31"]
        mode: "exclusive"
        # Use explicitly 3 threads and don't compute number by using
        # detect-thread-ratio variable:
        threads: 4
        prio:
          low: [ 0 ]
          medium: [ "1-2" ]
          high: ["28-35", "84-91"]
          default: "high"
    #- verdict-cpu-set:
    #    cpu: [ 0 ]
    #    prio:
    #      default: "high"

8C8T:

threading:
  set-cpu-affinity: yes
  # Tune cpu affinity of threads. Each family of threads can be bound
  # to specific CPUs.
  #
  # These 2 apply to the all runmodes:
  # management-cpu-set is used for flow timeout handling, counters
  # worker-cpu-set is used for 'worker' threads
  #
  # Additionally, for autofp these apply:
  # receive-cpu-set is used for capture threads
  # verdict-cpu-set is used for IPS verdict threads
  #
  cpu-affinity:
    - management-cpu-set:
        cpu: ["44-55", "100-111"]
    - receive-cpu-set:
        cpu: ["36-43", "92-99"]
    - worker-cpu-set:
        cpu: ["28-35"]
        mode: "exclusive"
        # Use explicitly 3 threads and don't compute number by using
        # detect-thread-ratio variable:
        threads: 8
        prio:
          low: [ 0 ]
          medium: [ "1-2" ]
          high: ["28-35", "84-91"]
          default: "high"
    #- verdict-cpu-set:
    #    cpu: [ 0 ]
    #    prio:
    #      default: "high"

The result showed that it cost 31s with 4C4T and 30s in 8C8T. When I use more cores and threads, I get no performance increase.

Did I do something wrong to test core scaling? If so, what should I do to correctly test?

I also have another question. I configured 4C4T, but the output showed there are 5 packet processing threads.

5/5/2023 -- 16:20:38 - <Notice> - all 5 packet processing threads, 4 management threads initialized, engine started.

The output shows that there is always one more packet processing thread than I configured, can you tell me why?

I would be very grateful if you could help me. Thanks!

suricata.yaml (73.3 KB)

The affinity settings are for the worker runmode, which is mostly used for AF_PACKET and other packet capture modes. For PCAP runmode you only have the runmodes single and autofp, single is single threaded and autofp multithreaded. This also explains a bit the odd numbers of threads. I think we could improve on that either from the doc or the actual code side, especially seeing that the change of the value has an impact in autfop mode although it’s set for workers.

To your initial question, I would argue that the pcap runs reach a peak limit rather soon. You have some overhead that’s always seen and higher cpu count is mostly relevant for real live traffic capture.

Thanks, that helps a lot.

Now I’ve encountered a new problem. I am using pcap files to test the multi-threading performance of Suricata with Hyperscan.

And I found that if I use runmode single, one core of my CPU can be fully utilized at 100% occupancy, but when I use the autofp mode and configure multiple cores and threads like this:

threading:
  set-cpu-affinity: yes
  # Tune cpu affinity of threads. Each family of threads can be bound
  # to specific CPUs.
  #
  # These 2 apply to the all runmodes:
  # management-cpu-set is used for flow timeout handling, counters
  # worker-cpu-set is used for 'worker' threads
  #
  # Additionally, for autofp these apply:
  # receive-cpu-set is used for capture threads
  # verdict-cpu-set is used for IPS verdict threads
  #
  cpu-affinity:
    - management-cpu-set:
        cpu: [84-91]
    - receive-cpu-set:
        cpu: [92-99]
    - worker-cpu-set:
        cpu: [28-35]
        mode: "exclusive"
        # Use explicitly 3 threads and don't compute number by using
        # detect-thread-ratio variable:
        threads: 8
        prio:
          # low: [ 0 ]
          # medium: [ "1-2" ]
          # high: ["28-35", "84-91"]
          default: "high"
    #- verdict-cpu-set:
    #    cpu: [ 0 ]
    #    prio:
    #      default: "high"
  #

However, it showed these cores cannot be fully utilized. I want to know why this is happening?
If the receiving thread reaches a receiving bottleneck, how should I configure it?

Thanks!

You could run perf top -p $(pidof suricata) and see if there is something obvious that could give a hint to the limits. But as mentioned above, this mode when reading pcaps is not optmized for high performance.