Kernel Drops and Optimization Recommendations

Hi, any help or general optimization tips appreciated! I’m running Suricata 6.0.1, with the ET Pro ruleset, on a system with two NICs, each receiving about 250Mbps from a traffic load balancer. I am experiencing 2.6% kernel drops, but I expect this number to increase once more users return to the office. I have seen it as high as 10%. I have attached my suricata.yaml and the last stats run is below. The system is only running around a 2.0 load average, with plenty of CPU and memory to spare.

I let Suricata use the default AF_PACKET NIC settings. I call Suricata on the command line as such:

suricata -i p1p1 -i p1p2 --user suricata --group suricata -F /etc/suricata/BPF.txt -D

32GB memory, NIC and CPU info:

2 X Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
2 X 14 core Xeon E5-2680 v4 @ 2.40GHz (total of 56 threads)

KiB Mem : 32545996 total, 22761596 free, 6744496 used, 3039904 buff/cache

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
30367 suricata 20 0 9691876 2.3g 12092 S 156.4 7.5 806:34.10 Suricata-Main

1/2/2021 – 13:46:47 - - 40178 signatures processed. 1365 are IP-only rules, 8666 are inspecting packet payload, 30112 inspect application layer, 0 are decoder event only

1/2/2021 – 13:46:58 - - Going to use 56 thread(s)

1/2/2021 – 13:46:59 - - Going to use 56 thread(s)

1/2/2021 – 13:46:59 - - Running in live mode, activating unix socket

1/2/2021 – 13:46:59 - - Using unix socket file ‘/var/run/suricata/suricata-command.socket’

1/2/2021 – 13:46:59 - - all 112 packet processing threads, 4 management threads initialized, engine started.

Stats:

apture.kernel_packets | Total | 3598306967

capture.kernel_drops | Total | 96106690

decoder.pkts | Total | 3502295231

decoder.bytes | Total | 1981756209742

decoder.invalid | Total | 1076

decoder.ipv4 | Total | 3502299013

decoder.ipv6 | Total | 5643

decoder.ethernet | Total | 3502295231

decoder.tcp | Total | 2362354323

decoder.udp | Total | 1133809509

decoder.icmpv4 | Total | 6124635

decoder.icmpv6 | Total | 5643

decoder.gre | Total | 3828

decoder.vlan | Total | 3502295231

decoder.vxlan | Total | 1

decoder.avg_pkt_size | Total | 565

decoder.max_pkt_size | Total | 1524

flow.tcp | Total | 61446687

flow.udp | Total | 3020553

flow.icmpv4 | Total | 550676

flow.icmpv6 | Total | 1411

flow.tcp_reuse | Total | 16597

flow.get_used | Total | 112

flow.get_used_eval | Total | 136

flow.get_used_eval_reject | Total | 23

flow.wrk.spare_sync_avg | Total | 99

flow.wrk.spare_sync | Total | 613045

flow.wrk.spare_sync_incomplete | Total | 242

flow.wrk.spare_sync_empty | Total | 395

decoder.event.ipv4.trunc_pkt | Total | 1075

decoder.event.vxlan.unknown_payload_type | Total | 1

flow.wrk.flows_evicted_needs_work | Total | 1633589

flow.wrk.flows_evicted_pkt_inject | Total | 2424730

flow.wrk.flows_evicted | Total | 2206892

flow.wrk.flows_injected | Total | 1622563

tcp.sessions | Total | 42717064

tcp.ssn_memcap_drop | Total | 16216136

tcp.invalid_checksum | Total | 5

tcp.syn | Total | 63228791

tcp.synack | Total | 59142698

tcp.rst | Total | 53288440

tcp.pkt_on_wrong_thread | Total | 3

tcp.segment_memcap_drop | Total | 1166739

tcp.stream_depth_reached | Total | 80486

tcp.reassembly_gap | Total | 77411525

tcp.overlap | Total | 136420

tcp.insert_data_normal_fail | Total | 555164861

tcp.insert_data_overlap_fail | Total | 422

detect.alert | Total | 21

app_layer.flow.http | Total | 4375

app_layer.tx.http | Total | 13248

app_layer.flow.tls | Total | 86317

app_layer.flow.ssh | Total | 1

app_layer.flow.dns_tcp | Total | 52

app_layer.tx.dns_tcp | Total | 169

app_layer.flow.ntp | Total | 77119

app_layer.tx.ntp | Total | 97264

app_layer.flow.ikev2 | Total | 71

app_layer.tx.ikev2 | Total | 152

app_layer.flow.snmp | Total | 30336

app_layer.tx.snmp | Total | 67384

app_layer.flow.sip | Total | 144

app_layer.tx.sip | Total | 144

app_layer.flow.failed_tcp | Total | 1085

app_layer.flow.dcerpc_udp | Total | 3

app_layer.flow.dns_udp | Total | 2808739

app_layer.tx.dns_udp | Total | 5531676

app_layer.flow.failed_udp | Total | 104141

flow.mgr.full_hash_pass | Total | 3296

flow.spare | Total | 10789

flow.emerg_mode_entered | Total | 107

flow.emerg_mode_over | Total | 107

flow.mgr.rows_maxlen | Total | 19

flow.mgr.flows_checked | Total | 78588085

flow.mgr.flows_notimeout | Total | 57942295

flow.mgr.flows_timeout | Total | 20645790

flow.mgr.flows_evicted | Total | 62443666

flow.mgr.flows_evicted_needs_work | Total | 1622563

tcp.memuse | Total | 67108840

tcp.reassembly_memuse | Total | 268434068

http.memuse | Total | 56125

flow.memuse | Total | 127679104

suricata.yaml (70.9 KB)

You have tcp reassembly memcap drops.
Try increasing all memcaps under the stream section of suricata.yaml until you stop getting memcap drops in stats.log.

I would uncomment the tpacket-v3: yes part of the af-packet config.
Try increasing the ring size and block size as well.
Try to disable checkum checking suricata -k

Your box sounds like it should be overkill for 500Mbps. You might want to try taking some pcaps during high load to check for elephant flows.

Recommended reading:
https://suricata.readthedocs.io/en/suricata-6.0.0/performance/index.html

You might also want to google for suricata septun guides.

Thank you for the great advice! I have updated the size of the memcaps in the stream section, which eliminated the TCP session and reassembly drops. Now the kernel drops have been reduced to around 0.8%. I will continue with your suggestions and see if I can reduce the kernel drops even further.

I looked into your config and I recommend to play around with the AF_PACKET recommendation on the official docs that was already linked. With just using -i you will just run with defaults.
The system should be fine with several Gbit/s of traffic, unless there is some problematic traffic (elephant flows, broken protocols etc.).

I looked into your stats.log again and the “tcp.insert_data_normal_fail” is really high. It could be related to this Bug #4502: TCP reassembly memuse approaching memcap value results in TCP detection being stopped - Suricata - Open Information Security Foundation so you could try 5.0.6 for comparison and report back to us.