I am testing the performance of Suricata with Hyperscan.
I am using Suricata 6.0.12 with workers mode. And I am using the command suricata -c suricata.yaml -i ens801f1 -l ./log/log_hs_hs.
I use Tcpreplay to test the max throughput of suricata using different threads. Tcpreplay will replay a pcap file with the speed of 1000 Mbps to NIC ens801f1.
What I found is that I have many capture.kernel_drops in my stat.log like this:
Date: 6/5/2023 -- 05:30:36 (uptime: 0d, 00h 01m 37s)
------------------------------------------------------------------------------------
Counter | TM Name | Value
------------------------------------------------------------------------------------
capture.kernel_packets | Total | 8654107
capture.kernel_drops | Total | 4998424
decoder.pkts | Total | 3658745
decoder.bytes | Total | 3756217659
decoder.ipv4 | Total | 3650943
decoder.ipv6 | Total | 414
decoder.ethernet | Total | 3658745
decoder.tcp | Total | 3571410
decoder.udp | Total | 77992
decoder.icmpv4 | Total | 1317
decoder.icmpv6 | Total | 272
decoder.avg_pkt_size | Total | 1026
decoder.max_pkt_size | Total | 1392
flow.tcp | Total | 4260
flow.udp | Total | 1525
flow.icmpv4 | Total | 40
flow.icmpv6 | Total | 4
flow.tcp_reuse | Total | 891
flow.wrk.spare_sync_avg | Total | 100
flow.wrk.spare_sync | Total | 53
decoder.event.ipv4.opt_pad_required | Total | 366
decoder.event.ipv6.zero_len_padn | Total | 254
flow.wrk.flows_evicted | Total | 779
tcp.sessions | Total | 3022
tcp.syn | Total | 17689
tcp.synack | Total | 16689
tcp.rst | Total | 10702
tcp.pkt_on_wrong_thread | Total | 1074799
tcp.stream_depth_reached | Total | 47
tcp.reassembly_gap | Total | 10365
tcp.overlap | Total | 38741
detect.alert | Total | 28422
detect.alerts_suppressed | Total | 40250
app_layer.flow.http | Total | 687
app_layer.tx.http | Total | 1463
app_layer.flow.tls | Total | 725
app_layer.flow.mqtt | Total | 1
app_layer.tx.mqtt | Total | 5
app_layer.flow.failed_tcp | Total | 241
app_layer.flow.dns_udp | Total | 930
app_layer.tx.dns_udp | Total | 10636
app_layer.flow.failed_udp | Total | 595
flow.mgr.full_hash_pass | Total | 1
flow.spare | Total | 9313
flow.mgr.rows_maxlen | Total | 3
flow.mgr.flows_checked | Total | 550
flow.mgr.flows_notimeout | Total | 550
flow.mgr.flows_evicted | Total | 13
tcp.memuse | Total | 2437320
tcp.reassembly_memuse | Total | 75298712
http.memuse | Total | 3810295
flow.memuse | Total | 8866304
------------------------------------------------------------------------------------
And here is my cpu-affinity in my suricata.yaml: suricata.yaml (73.8 KB)
threading:
set-cpu-affinity: yes
# Tune cpu affinity of threads. Each family of threads can be bound
# to specific CPUs.
#
# These 2 apply to the all runmodes:
# management-cpu-set is used for flow timeout handling, counters
# worker-cpu-set is used for 'worker' threads
#
# Additionally, for autofp these apply:
# receive-cpu-set is used for capture threads
# verdict-cpu-set is used for IPS verdict threads
#
cpu-affinity:
- management-cpu-set:
cpu: [ 44-47 ] # include only these CPUs in affinity settings
- receive-cpu-set:
cpu: [ 48-63 ] # include only these CPUs in affinity settings
- worker-cpu-set:
cpu: [ 26-41 ]
mode: "exclusive"
# Use explicitly 3 threads and don't compute number by using
# detect-thread-ratio variable:
threads: 4
prio:
# low: [ 0 ]
# medium: [ "1-2" ]
# high: [ 3 ]
default: "high"
In my test, I use htop to see the utilization of CPU cores, and none of 26-29 is fully used.
It should not be so many pkts to be dropped. My NIC supports up to 40Gbps. And there are still 2% drops when I decrease the traffic to 200 Mbps, which should be far below what 4 CPU core can handle. I also used autofp mode with AF_PACKET. And I got the same result, also many dropped pkts. I know there will be dropped pkts when the traffic is too large, but I got a high drop rate with traffic obviously below cores can handle.
I wonder why so many pkts are dropped and where is the bottleneck? How can I config to reduce the rate of kernel_drops?
Does the decoder process data packets earlier than hyperscan? And I was just testing the performance of decoder?
Another idea would be to run perf top -p $(pidof suricata) while the traffic is feeding in, to see if there is another potential bottleneck.
Also depends on the traffic being replayed, the tcp.reassembly_gap and tcp.pkt_on_wrong_thread are also quite high besides the high drop rate.
One recommendation would be to set threads in worker-cpu-set as well to the same amount as in your af-packet section (you configured 16).
Your correct on first debugging the issue with the lower traffic rate, 2% shouldn’t be there with enough cores. What CPU is used? A normal CPU should achieve at least 100mbit/s per core.
I also tried autofp mode with setting threads in worker-cpu-set as well to the same amount as in my af-packet section to 4. And I use 500 Mbps live traffic to make the test. Finally I still got 4.34% drops.
Besides, can you tell me when does the decoder work? Before Hyperscan or after it?
Something to try - I would suggest to adjust in suricata.yaml max-pending-packets to something like 30k (30000) and ring-size (in the af-packet section) to 20000 .
Also comment out buffer-size (in the af-packet section)
Then restart and see if any improvement.
There will be threads worker-threads each of which retrieves the packet and fully processes the packet.
Another way to measure Suricata performance is to use the rate at which packets arrive at the ingress NIC(s) and subtract the packets dropped. That’ll give the PPS; the bits/second can be derived from the same information.
So you mean the number of decoders are exactly the same as detectors, both decoder and detector are in the same worker-thread?
I saw when no packets dropping, the rate of decoder.bytes is two times of the rate sent with TCPReplay, while the rate of the rate of decoder.bytes is the same of the rate using real live traffic.
I also wonder if its proper to use the speed of decoder.bytes to evaluate the performance of Suricata with Hyperscan? Does decoder.bytes have some trouble in testing performance?
Sorry I didn’t make it clear. I am wondering if you take packets dropping into account when doing your benchmarks and how do you use packet dropping as an indicator?
Note that the workers runmode is recommended for better performance since a single thread is responsible for packet acquisition and processes the packet completely, including sending an alert (if one or more rules match)>
Are you calculating a rate by sampling decoder.bytes? Note that decoder.bytes represents the number of bytes seen by Suricata and is not a byte rate.
How you measure performance is up to you. Generally, it includes the ingress traffic bit and packet counts and the same values from Suricata. Usually, packets dropped by the NIC are factored in.