Yesterday I changed my setting back from autofp to workers and let Suricata run for about 23 hours. What I noticed there is that I got a much higher drop rate from about 5%, I also had 15% somewhat after 4 hours.
It seems to me that autofp gives much better performance for my setup. Therefore, I might try to use autofp and change my cluster-type and so on, to work better with autofp.
Here is the latest entry of my stats.log:
------------------------------------------------------------------------------------
Date: 8/5/2020 -- 08:40:38 (uptime: 0d, 23h 07m 23s)
------------------------------------------------------------------------------------
Counter | TM Name | Value
------------------------------------------------------------------------------------
capture.kernel_packets | Total | 1212461422
capture.kernel_drops | Total | 60955119
decoder.pkts | Total | 1151505376
decoder.bytes | Total | 1168493087321
decoder.ipv4 | Total | 1148304120
decoder.ipv6 | Total | 376699
decoder.ethernet | Total | 1151505376
decoder.tcp | Total | 1104398468
decoder.udp | Total | 42814320
decoder.icmpv4 | Total | 1169220
decoder.icmpv6 | Total | 4834
decoder.vlan | Total | 697926590
decoder.vxlan | Total | 2
decoder.avg_pkt_size | Total | 1014
decoder.max_pkt_size | Total | 1514
flow.tcp | Total | 2517348
flow.udp | Total | 3989883
flow.icmpv4 | Total | 2732
flow.icmpv6 | Total | 1306
defrag.ipv4.fragments | Total | 293834
defrag.ipv4.reassembled | Total | 145886
decoder.event.ipv4.opt_pad_required | Total | 141
decoder.event.ipv6.zero_len_padn | Total | 790
tcp.sessions | Total | 1605811
tcp.pseudo | Total | 1153004
tcp.syn | Total | 3070944
tcp.synack | Total | 1237993
tcp.rst | Total | 2163387
tcp.pkt_on_wrong_thread | Total | 8025758
tcp.stream_depth_reached | Total | 904
tcp.reassembly_gap | Total | 8176
tcp.overlap | Total | 13812
tcp.insert_list_fail | Total | 814
detect.alert | Total | 2306475
app_layer.flow.http | Total | 346641
app_layer.tx.http | Total | 398476
app_layer.flow.smtp | Total | 422
app_layer.tx.smtp | Total | 851
app_layer.flow.tls | Total | 185482
app_layer.flow.ssh | Total | 47
app_layer.flow.smb | Total | 8072
app_layer.tx.smb | Total | 292243
app_layer.flow.dcerpc_tcp | Total | 11207
app_layer.flow.dns_tcp | Total | 10806
app_layer.tx.dns_tcp | Total | 41675
app_layer.flow.nfs_tcp | Total | 1
app_layer.tx.nfs_tcp | Total | 422
app_layer.flow.ntp | Total | 20748
app_layer.tx.ntp | Total | 35747
app_layer.flow.tftp | Total | 2
app_layer.tx.tftp | Total | 4
app_layer.flow.krb5_tcp | Total | 21123
app_layer.tx.krb5_tcp | Total | 21094
app_layer.flow.dhcp | Total | 529
app_layer.tx.dhcp | Total | 4901
app_layer.flow.snmp | Total | 231646
app_layer.tx.snmp | Total | 4952265
app_layer.flow.failed_tcp | Total | 56782
app_layer.flow.dcerpc_udp | Total | 23
app_layer.flow.dns_udp | Total | 2695795
app_layer.tx.dns_udp | Total | 6106081
app_layer.flow.krb5_udp | Total | 593
app_layer.tx.krb5_udp | Total | 475
app_layer.flow.failed_udp | Total | 1040547
flow_mgr.closed_pruned | Total | 651371
flow_mgr.new_pruned | Total | 3255558
flow_mgr.est_pruned | Total | 2554899
flow.spare | Total | 1048658
flow.tcp_reuse | Total | 31381
flow_mgr.flows_checked | Total | 381
flow_mgr.flows_notimeout | Total | 301
flow_mgr.flows_timeout | Total | 80
flow_mgr.flows_timeout_inuse | Total | 1
flow_mgr.flows_removed | Total | 79
flow_mgr.rows_checked | Total | 1048576
flow_mgr.rows_skipped | Total | 1048154
flow_mgr.rows_empty | Total | 55
flow_mgr.rows_maxlen | Total | 3
tcp.memuse | Total | 5734400
tcp.reassembly_memuse | Total | 983040
flow.memuse | Total | 427304032
I only see an average amount of 380 MBit/s this goes up to multiple GBit if our backup is running.
The one I used with the workers runmode: suricata_worker.yaml (68.9 KB)
And the one I am using currently with the autofp runmode: suricata_autofp.yaml (68.9 KB)
I also noticed that I am dropping some packets at my network interface, this could be the Problem if I am not misstaken. Imagine I have a flow started and I would drop one packet of the flow on my interface, then the kernel would drop all packets of this one flow, which occurs before, and after the one, I dropped on my Interface. Correct me if I am wrong though.
Can you narrow it down to those timeframes where the backups are happening that the drops increase? Elephant Flows are a typical case where drops occur.
You could try cluster_flow in the workers mode.
And those NIC drops can also be caused by those Elephant Flows. Your NIC uses the ixgbe driver, right?
If you want to use cluster_qm you should also ensure that you enable symmetric hashing and some other optimizations. If you have a rather new ethtool version and drivers try this:
ethtool -L enp5s0f1 combined 10
ethtool -X enp5s0f1 hkey 6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A equal 10
ethtool -K enp5s0f1 rxhash on
ethtool -K enp5s0f1 ntuple on
for i in rx tx tso gso gro lro tx-nocache-copy sg txvlan rxvlan; do ethtool -K enp5s0f1 $i off; done
for proto in tcp4 udp4 tcp6 udp6; do ethtool -N enp5s0f1rx-flow-hash $proto sdfn; done
ethtool -C enp5s0f1 adaptive-rx off rx-usecs 62
ethtool -G enp5s0f1 rx 1024
/usr/local/bin/set_irq_affinity 4-13 enp5s0f1
Some suggest to use sd instead of sdfn and other values are also worth playing around.
Ok, I changed back to the workers runmode and set the cluster_type to cluster_flow.
In addition, today I will monitor the drops and try to figure out if they mainly happens when backups are running.
Moreover, I will look into the network card configuration and see if I can tweak some options there.
Ok. Yesterday I did not had so many droped packages, so that I am still below one percent of dropped packages.
Although I had, some drops while backup times, I dropped the most around noon.
If my drops will not go through the roove and stay somewhere below 1% it might be ok I think.
Can I change the ring-size parameter in the suricata.yml file to reduce drops? How I understand it might reduce them if I’m decreasing the variable. Right now, it is set to 300000, and I would try to change it to 150000 or 200000 and see the effects.
I would not recommend increasing the ring-size, I think at this point the problem might be elsewhere.
I looked at the provided workers yaml. It does not seem you are using AFPv3 (might have missed that in the conversation, sorry if i did ).
I suggest you should try commenting the following in the af-packet section of yaml and give it a run:
mmap-locked: yes
# Use tpacket_v3 capture mode, only active if use-mmap is true
# Don't use it in IPS or TAP mode as it causes severe latency
tpacket-v3: yes
I would rather decrease than increase the ring-size. If I am not mistaken, the original value is somewhat of 2000 or so and I changed it for whatever reason to 300000.
Never the less I will try your suggestion.
Regarding AFPv3, do you mean af_pacet version 3, I am new to this kind of stuff, so I do not really know…
Soo it seems like that was the trick. The drops are so much less now. Thanks for the Tip
Now I have the problem that I can’t start Suricata with systemd…
In order to work with the changed config I had to change the amount of memory a process can lock. Because it looks like by default it is locked to 64kb like described in this post.
I changed the limits for the root user and suricata user to unlimited. However, when I try to start suricata with systemd it fails with following error: [ERRCODE: SC_ERR_MEM_ALLOC(1)] - Unable to mmap, error Resource temporarily unavailable
I can start suricata as daemon from the cli with the -D switch, so it is not that urgent, but maybe you had some similar issue at some point or have an idea how to fix it, I would be glad to hear it^^
How mentioned in my previous post, I already did that
After two days I am still below 0.0% dropped packages, meaning what you proposed had a huge positive effect.
Now the only problem is that after uncommenting the two options (tpacket-v3: yes and mmap-locked: yes), I cannot start suricata with this command: systemctl start suricata.service.
That is the full error in the log file.
Aug 19 08:24:36 ebjen-ids suricata[379107]: 19/8/2020 -- 08:24:36 - <Notice> - all 10 packet processing threads, 4 management threads initialized, engine started.
Aug 19 08:24:36 ebjen-ids suricata[379107]: 19/8/2020 -- 08:24:36 - <Error> - [ERRCODE: SC_ERR_MEM_ALLOC(1)] - Unable to mmap, error Resource temporarily unavailable
Aug 19 08:24:36 ebjen-ids suricata[379107]: 19/8/2020 -- 08:24:36 - <Error> - [ERRCODE: SC_ERR_AFP_CREATE(190)] - Couldn't init AF_PACKET socket, fatal error
Aug 19 08:24:36 ebjen-ids suricata[379107]: 19/8/2020 -- 08:24:36 - <Error> - [ERRCODE: SC_ERR_FATAL(171)] - thread W#01-enp5s0f1 failed
Aug 19 08:24:36 ebjen-ids systemd[1]: suricata.service: Main process exited, code=exited, status=1/FAILURE
Aug 19 08:24:36 ebjen-ids systemd[1]: suricata.service: Failed with result 'exit-code'.
For me it looks like when suricata is unable to lock memory map, when started as system service.
Never the less it is not that big of a Problem right now, because how point out in the preview post I am able to run suricata in daemon-mode if I enter the command manually, it is just not as nice as running it as service.
@pevma After almost 2 days of running suricata with tpacket-v3 enabled and mmap-locked disabled, I can say that the packet drop is not as high as it was to be at the beginning of this thread, but also not as low as when mmap-locked was enabled.
@Andreas_Herz Currently I use this Kernel: 4.18.0-193.6.3.el8_2.x86_64
There is a minor update to the 4.18.0-193.14.2.el8_2 Kernel available.
Edit:
I saw I never mentioned my Suricata version… I have the 5.0.3 release installed.
Without mmap-locked it is: ~0.49%
With mmap-locked it is: ~0.02%
I already read this thread earlier… that is how I found out about the default Limit for the memory lock of 64kb. But thanks for pointing out again because I did not saw the comment in there about the systemd service as I first read it.^^
I had to add LimitMEMLOCK=infinity to the suricata.service file. Now I can start suricata as service with mmap-locked enabled. I will monitor suricata and the drops over the weekend and the next Monday and report back on Thuesday.
In case everything is fine, I would like to summarize what we have done here and mark it as solution, unless some of you want to do that
Sorry for the late response.
Anyways, I am glad to report that everything is working fine. The dropped packages are way below 1% (0.1% when I last checked).
Thanks for the great help!
At first, I want to thank everybody who helped me with this problem!
If anybody should have a similar problem, here what we did:
set cluster-type: to cluster_flow
set runmode: to workers
activate and configure the cpu-affinity settings
In the end what really did the trick I think, was setting mmap-locked: and tpacket-v3: to yes. But in order to use the mmap-locked option you have to edit the /etc/security/limits.conf file and add something like this to the End of the file, otherwise suricata will fail because it can’t lock enough memory:
suricata hard memlock unlimited
suricata soft memlock unlimited