Suricata 6.0.1 high packet loss

Happy Monday!

After being absent from Suricata (since v. 3.*) happy to be back. Lots of things has changed and still some things have not.

Just turned on the version of Suricata and directed the traffic to the machine (IDS not IPS) on a CentOS 7 machine.
Right from the starts I noticed a very high number of dropped packets (30-50%):
18/1/2021 – 03:23:29 - - Stats for ‘eth5’: pkts: 6121165, drop: 3327089 (54.35%), invalid chksum: 0

AND

  • 18/1/2021 – 04:21:15 - - Stats for ‘eth5’: pkts: 1883180, drop: 733267 (38.94%), invalid chksum: 0*

stats.log and suricata.yaml attached.suricata.yaml (71.2 KB) stats_log_1182021.log (6.0 KB)

Also result when running ethool eth5:

driver: i40e
version: 2.8.20-k
firmware-version: 7.10 0x800075e1 19.5.12
expansion-rom-version:
bus-info: 0000:3b:00.3
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

Followed the following post, but no success on my side.

Thanks for checking and helping out in advance.

Hi, welcome to the community!

Could you share some information about the hardware you’re using?

It might be helpful to see your suricata.log file too – if you can upload that, please do.

Good morning Jeff and thank you for coming back to me.

Here are some screenshots of the hardware we are using:

vm_stats

Attached you can also find the suricata.log file as requested.suricata.log (11.2 KB)

Let me know if you need anything else and I am happy to provide.

S.

I can recommend to take a look into 9.5. High Performance Configuration — Suricata 7.0.0-dev documentation and what amount of traffic do you see?

Hi @Andreas_Herz, Thank you for coming back to me. In the last 12 days I have been tinkering around with the setting and making minor adjustments to the yaml file.

I now excluded everything besides the alerts and excluded all of the stream alerts.

Still the packet drop is relatively high but has been reduced to around 8% in the last hour.

amount of traffic is max 2 gbps

Can you share the last update of your stats.log after running for few hrs?

Please make sure the NIC offloading and other parameters as per the docs are there. Also a good first try would be to disable the vlan tracking in the yaml config and try to see if that is not interfering ?

Happy Monday!

stats.log screenshots attached, 1 after + 2 days and another from the last +/- 2hrs (after making the change regarding VLAN tracking (still at 16%)

Also added the ethtool show off loading for reference and added current suricata.yaml.suricata.yaml (71.4 KB)

Still battling packet loss here. Didn’t change anything on the NIC besides the following:

sudo rmmod i40e
sudo modprobe i40e
sudo ifconfig eth5 down
sudo ethtool -L eth5 combined 24
sudo ethtool -K eth5 rxhash on
sudo ethtool -K eth1 ntuple on
sudo ifconfig eth5 up
sudo ethtool -X eth5 hkey 6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A equal 24
sudo ethtool -A eth5 rx off tx off
sudo ethtool -C eth5 adaptive-rx off adaptive-tx off rx-usecs 125
sudo ethtool -G eth5 rx 1024
for proto in tcp4 udp4 tcp6 udp6; do
echo “sudo ethtool -N eth5 rx-flow-hash $proto sdfn”
sudo ethtool -N eth5 rx-flow-hash $proto sdfn
done

Attached my latest yaml file (restarted with a blanc one as I got lost from all the changes).
suricata.yaml (71.0 KB)

Didn’t change any cpu_affinity settings yet, but did turn off Vlan tracking.

Noticed that CPU usage is very high:

Another overview for traffic ref.

image

I was planning to adjust the cpu_affinity setting as follows:

threading:
 cpu-affinity:
   - management-cpu-set:
       cpu: [ "1-10" ]  # include only these CPUs in affinity settings
   - receive-cpu-set:
       cpu: [ "0-10" ]  # include only these CPUs in affinity settings
   - worker-cpu-set:
       cpu: [ "1","3","5","7","9","11","13","15","17","19","21","23","25","27","29","31","33","35","37","39","41","43","45","47" ]
       mode: "exclusive"
       prio:
         low: [ 0 ]
         medium: [ "1" ]
         high: [ "1","3","5","7","9","11","13","15","17","19","21","23","25","27","29","31","33","35","37","39","41","43","45","47" ]
         default: "high"

Also perf shows me:

Just change runmode to worker.

Might be worth a try to crank the afpacket:ring-size and af-packet:block-size way up.
Have a look at the example values in the high performance tuning guide linked above.

Do you see memcap drops in your stats.log? The reassembly memcap looks a bit low.

Thanks @syoc for your insights.
I made the following adjustments on the mpm side (no hyperscan available):

  • mpm-algo: ac-ks
  • detect.sgh-mpm-context: full
    AND
    image

In the documentation online it is mentioned to increase ring-size all the way up to 100K. My largest packet according to stats.log = 1433. Running 48 threads = 10GB!!!
Choosing 10K makes more sense?

Have been running Suricata for +/- 1 hr now and I am at a dropped packet rate of 1.6%.
Certainly better than the 20-40% I came from. However, based on the amount of traffic max 2GB/s, with a potential increase to 10GB/s this is not so good.

Also CPU seems still high, although shows more details and leading to MPM?

I haven’t tried stream-bypass and maybe I should?
https://suricata.readthedocs.io/en/latest/performance/tuning-considerations.html?highlight=mpm#stream-bypass

Also, before, I added this solution which did not work for me: https://forum.suricata.io/t/suricata-high-capture-kernel-drops-count

@syoc in regards to your question. Do you mean:

image

This is based after 15.000.0000 packets.

Correction on all of the above. Dealing with 64% packet drop :frowning:
Removed the ring-size / block size and I am back to “normal”

What is that spike in traffic each hour? Do you have the huge drop increase in the same timeframe? This might be an elephant flow. I would recommend to plot the drop rate the same way you do with the traffic and see if there is a correlation.

Hi Andreas, good morning,

The spike is around 1.5GB and data is being sent to the HW through a core switch. Firewall logs on the same VLAN tell me there is no spike every hour in terms of packets/bytes/events.

My guess is the switch is collecting all the data and then releasing it, making it hard for Suricata to reassemble to packets?

Besides that, there is a steady increase in drops compared to the amount of packets received and indeed a steeper ascent at the time when there is more traffic being sent to the device,

I notice splunk in that process list.

For what it’s worth, I had similar behavior in an environment once and it was because of filebeat/elasticsearch. Some update at some point caused a huge load increase every hour or so and packet loss would climb in suricata (even though it appeared to be not doing much of anything). Noticed errors coming up in dmesg relating to oom (out of memory). Suricata was cpu-pinned and tuned about as much as your config seems to be, from BIOS to grub to Suricata yaml.

We changed our elastic/filebeat configs and updated the kernel from 3.10.?.? to 5.?.? - probably a bit extreme for the kernel but it helped tremendously.

Not sure if you’ve ruled out any process interference or noticed anything weird in service/systemd or dmesg logs, but honestly where I would start.

Thanks for the feedback @cthomas! Just looking at dmesg and this caught my eye.

I have less than 30K rules running according Suricata update.
Added the recent .yaml suricata_backup03022021.yaml (71.6 KB)

When running perf I see the following:

Maybe my .yaml is too complex? @Andreas_Herz

Did you ever try pinning CPUs? This last yaml looks like it’s default.

GitHub - pevma/SEPTun: Suricata Extreme Performance Tuning guide might help with the amount of loss, but the periodic timing still seems like something curious going on.

Hi @cthomas, @Andreas_Herz @pevma @Jeff_Lucovsky , thank you all for your help and advice so far. So upgraded to 6.0.2. and the overhead is gone.

Also my dropped packets is at 1% which I can live with!

So the upgrade did help massively.

Glad to hear the new release has a positive impact.
Thank you for your feedback!