Runnning DPDK 24.07.0, Suricata 8 main repo (8.0.0-dev (dd71ef0af 2024-11-05), Redhat 8 4.18.0-553.22.1.el8_10.x86_64 , for some weeks now and after a restart of suricata most of the time one or more 100G spanports go inactive, no link seen.
Today this worked again to get the spanport active:
$ systemctl stop suricata
$ /usr/local/bin/dpdk-devbind.py -b ice 0000:84:00.1
$ /usr/local/bin/dpdk-devbind.py --bind=vfio-pci 0000:84:00.1
$ systemctl start suricata
What todo? Going back to DPDK 24.03.0 which had no issues, or do you have some ideas/tips to debug?
we can work on the bug report and then we can reach out to Intel through DPDK “users” mailing list.
I guess I would need more info, so how do you restart Suricata? Is it rule reload or stopping it and then starting again?
I guess the spanport information is not crucial here as it is a configuration outside of the current ice port, am I right?
Does it happen right after Suricata startup or usually Suricata needs to run for a few hours/days and then it happens? If you start and stop Suri immediately can you still loose the link?
Especially if it can be reproduced immediately, can you test it with other DPDK apps as well, e.g. dpdk-testpmd?
I’ve noticed that ice seems to have a delayed startup - the first second or two - Suricata say it started the workers and everything but the card really only starts to receive packets after those 2 seconds. This can also be observed with the aforementioned testpmd app which shows something like Link status changed. So maybe the startup lag is a bit longer in this version? Did you give it enough time to start before killing it?
So I assume you monitor multiple ice ports and they randomly go inactive. Is it correct or have you seen particularly one port being more prone to this?
Rule reload has no affect, it is indeed service restart or stop start via systemctl
Yes you are right
I need to test that one, it restarts every night due to log rotation and then an interface usesually dissapears
I need to test that one
You mean suricata? I should check the logs, sometimes it takes more time because of closing pcap logging. But nothing changed except the dpdk version and running the suricata daily
Last week three ports gone, yesterday and today 1 (the same both times)
I tried to test it with testpmd and it works ok for me.
sudo dpdk-testpmd -a af:00.1 -l 0,1
I tried to run it dozen times and it worked (but the longest run was about 30 mins).
But I had to wait like 5 seconds until Port 0: link state change event showed - before that I received 0 packets.
I tried it with Suricata as well and had no problems as well - again no long runs but rather start & stops but it didn’t occur - tried it with fresh master (278dc24cd).
Maybe your card is acting funky? Maybe something else is intercepting the traffic?
If your testpmd works fine, you could go back to 24.03 and verify that it works 100% all the time.
It was going well for days (since the post , but today the same port down again after a restart during logcycle. So maybe indeed a funky card, sfp or span config or such. I’ve cut down Suricata rules down in half (70k to under 40k) but to no avail.
No I’ve modified the logrotate script and replaced suricata restart with a stop and a start, see if this has a possitive timing effect.
Thanks for your time, much apriciated!
Same behaviour with DPDK 24.03.0, so a flaky card/port/sfp/line I guess.
Only clue I see now:
Nov 11 09:19:19 kernel: ice 0000:84:00.1: IRDMA hardware initialization FAILED init_state=2 status=-12
Nov 11 12:23:50 kernel: ice 0000:84:00.1: IRDMA hardware initialization FAILED init_state=2 status=-12
Nov 11 12:24:15 kernel: ice 0000:84:00.1: IRDMA hardware initialization FAILED init_state=2 status=-12
Nov 11 14:55:56 kernel: ice 0000:84:00.1: IRDMA hardware initialization FAILED init_state=2 status=-12
But thats when unbinding I guess from dpdk.
Last night again the same probs, startup looks oke, but no data on 84:00.1
Nov 11 03:35:55 suricata[994583]: Warning: dpdk: 0000:10:00.0: device queue descriptors adjusted (RX: from 32768 to 4096, TX: from 32768 to 4096) [DeviceConfigure:runmode-dpdk.c:1522]
Nov 11 03:35:55 suricata[994583]: Notice: log-pcap: Ring buffer initialized with 99 files. [PcapLogInitRingBuffer:log-pcap.c:987]
Nov 11 03:35:59 suricata[994583]: Warning: dpdk: 0000:84:00.0: device queue descriptors adjusted (RX: from 32768 to 4096, TX: from 32768 to 4096) [DeviceConfigure:runmode-dpdk.c:1522]
Nov 11 03:36:03 suricata[994583]: Warning: dpdk: 0000:10:00.1: device queue descriptors adjusted (RX: from 32768 to 4096, TX: from 32768 to 4096) [DeviceConfigure:runmode-dpdk.c:1522]
Nov 11 03:36:07 suricata[994583]: Warning: dpdk: 0000:0f:00.0: device queue descriptors adjusted (RX: from 32768 to 4096, TX: from 32768 to 4096) [DeviceConfigure:runmode-dpdk.c:1522]
Nov 11 03:36:08 suricata[994583]: Warning: dpdk: 0000:84:00.1: device queue descriptors adjusted (RX: from 32768 to 4096, TX: from 32768 to 4096) [DeviceConfigure:runmode-dpdk.c:1522]
Nov 11 03:36:13 suricata[994583]: Notice: threads: Threads created → W: 102 FM: 5 FR: 5 Engine started. [TmThreadWaitOnThreadRunning:tm-threads.c:1894]
It is the last interface in the suricata.yaml config file so the last one to startup.
→ /ethdev/link_status,4
{
“/ethdev/link_status”: {
“status”: “DOWN”
}
}
→
On the side of the switch providing the span data, the port is not connected:
Nov 12 02:29:03.776: %LINK-3-UPDOWN: Interface HundredGigE1/1/0/41, changed state to down (UTC) so this matches suricata stop time:
As a workaround been running a logrotate script which first stops suricata, changes the dpdk binding back to ice, sleep some seconds, loads the dpdk driver, sleep and then start suricata. But even that is not a 100% guarantee it will work, see attached file.
Cheers a.yaml (8.9 KB)
It is a text file, but txt extension was grayed out while selecting the upload
Hi Lucas, until now yes. Seems to be an issue with Solid Optics qfps, yesterday we had to replace 2 (now Cisco) others to bring up spanports in an other datacenter. Odd stuff.