Suricata interface keeps going down

I’m fairly certain this is a hardware issue, but wanted to check here to make sure there isn’t anything specific to Suricata that I should troubleshoot before going and replacing the NIC…Recently, I can’t have Suricata running for more than 24 hours before this happens (suricata.log):

21/8/2023 -- 16:45:16 - <Error> - [ERRCODE: SC_ERR_AFP_READ(191)] - Error reading data from iface 'ixgbe2': (100) Network is down
21/8/2023 -- 16:45:24 - <Error> - [ERRCODE: SC_ERR_AFP_READ(191)] - Error reading data from iface 'ixgbe2': (100) Network is down
21/8/2023 -- 16:45:25 - <Error> - [ERRCODE: SC_ERR_AFP_READ(191)] - Error reading data from iface 'ixgbe2': (100) Network is down
21/8/2023 -- 16:45:34 - <Error> - [ERRCODE: SC_ERR_AFP_READ(191)] - Error reading data from iface 'ixgbe2': (100) Network is down
21/8/2023 -- 16:45:34 - <Error> - [ERRCODE: SC_ERR_AFP_READ(191)] - Error reading data from iface 'ixgbe2': (100) Network is down
21/8/2023 -- 16:45:37 - <Error> - [ERRCODE: SC_ERR_AFP_READ(191)] - Error reading data from iface 'ixgbe2': (100) Network is down
21/8/2023 -- 16:45:50 - <Error> - [ERRCODE: SC_ERR_AFP_READ(191)] - Error reading data from iface 'ixgbe2': (100) Network is down
21/8/2023 -- 16:45:51 - <Error> - [ERRCODE: SC_ERR_AFP_READ(191)] - Error reading data from iface 'ixgbe2': (100) Network is down
21/8/2023 -- 16:45:55 - <Error> - [ERRCODE: SC_ERR_AFP_READ(191)] - Error reading data from iface 'ixgbe2': (100) Network is down
21/8/2023 -- 16:45:55 - <Error> - [ERRCODE: SC_ERR_AFP_READ(191)] - Error reading data from iface 'ixgbe2': (100) Network is down
21/8/2023 -- 16:46:01 - <Error> - [ERRCODE: SC_ERR_AFP_READ(191)] - Error reading data from iface 'ixgbe2': (100) Network is down
21/8/2023 -- 16:46:05 - <Error> - [ERRCODE: SC_ERR_AFP_READ(191)] - Error reading data from iface 'ixgbe2': (100) Network is down
21/8/2023 -- 16:46:07 - <Error> - [ERRCODE: SC_ERR_AFP_READ(191)] - Error reading data from iface 'ixgbe2': (100) Network is down
21/8/2023 -- 16:46:10 - <Error> - [ERRCODE: SC_ERR_AFP_READ(191)] - Error reading data from iface 'ixgbe2': (100) Network is down

Immediately prior to Suricata logging the interface is down, this appears in dmesg:

[Mon Aug 21 16:44:50 2023] ixgbe 0000:01:00.0: removed PHC on ixgbe2
[Mon Aug 21 16:44:50 2023] ixgbe 0000:01:00.0: registered PHC device on ixgbe2
[Mon Aug 21 16:44:50 2023] IPv6: ADDRCONF(NETDEV_UP): ixgbe2: link is not ready
[Mon Aug 21 16:44:50 2023] ixgbe 0000:01:00.0 ixgbe2: detected SFP+: 4
[Mon Aug 21 16:44:53 2023] ixgbe 0000:01:00.0 ixgbe2: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[Mon Aug 21 16:44:53 2023] IPv6: ADDRCONF(NETDEV_CHANGE): ixgbe2: link becomes ready

Anything I can/should look at in Suricata to troubleshoot this? We’ve been running on the same system with the same config for years now without issue, which is making me lean toward a hardware issue…

This is Suricata 6.0.13 in af-packet mode, running on fully patched CentOS 7.

TIA!

Hi @abwhite !

From what I can see, this error is coming while polling for data from the socket. Also, haven’t seen others report this. So, I’d also lean towards your assumption.
Just to be sure though, have you recently done a Suricata upgrade leading to this? Are you seeing something on your other machines too?

Thanks for your response @sbhardwaj ! I actually did recently upgrade Suricata from 6.0.6 to 6.0.13. I can’t remember for sure, but I think it was running fine for a couple weeks before the interface started dropping every day. Is there anything I can check to confirm whether it’s the new version that’s causing the problem?

We have one other nearly identical system (same OS and NIC) that has Suricata on it but has not been in use. I’ll replicate my Suricata config on that system and see if I get the same issue.

@sbhardwaj I just remembered another thing that is different since I upgraded–our original install was done from source, but when I did the upgrade I installed from the rpm. Could this have anything to do with the issue I’m experiencing?

The from source to RPM change shouldn’t matter here.

I think Suricata is just reporting here that the interface is down, not the cause. Probably something more to do with your OS network configuration?

Ok, so I’ve had suricata running now on a second system with identical hardware and OS…it’s been more than 24 hours, and so far the interface has stayed up. I will refer to the original system as suri-1 and this second one as suri-2 from now on…One thing I can’t figure out is why suri-2 is using twice the average amount of RAM and threads as what is being used on suri-1; ~140 GB, 28 threads on suri-2 vs. ~70 GB RAM, 14 threads on suri-1. I am using the same suricata.yaml on each system, and both have hyperthreading disabled, and both are using cpu pinning. Here’s the af-packet section from my suricata.yaml:

af-packet:
  - interface: ixgbe2
    threads: 14
    cluster_id: 99
    cluster_type: cluster_flow
    defrag: yes
    use-mmap: yes
    mmap-locked: no
    tpacket-v3: yes
    ring-size: 400000
    block-size: 393216
    disable-promisc: no

and output of lscpu:

# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    1
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz
Stepping:              2
CPU MHz:               1299.224
CPU max MHz:           3600.0000
CPU min MHz:           1200.0000
BogoMIPS:              4599.99
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              40960K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb invpcid_single ssbd rsb_ctxsw ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d

Any ideas?

Are they both viewing the same traffic and been running the same amount of time?

One could have more traffic, or just a greater ratio of more expensive traffic like SMB

Yes, the same traffic is running through both boxes. suri-2 has been running for the same amount of time as suri-1, but suricata was not actually running on it until last week, when I started experimenting…the interface on suri-2 just went down, so still a possibility of a software issue somewhere =( This is all I could find in various system logs right before suricata.log logs the interface dropping.

dmesg:

[Tue Aug 29 12:35:37 2023] ixgbe 0000:01:00.0: removed PHC on ixgbe2
[Tue Aug 29 12:35:37 2023] ixgbe 0000:01:00.0: registered PHC device on ixgbe2
[Tue Aug 29 12:35:37 2023] IPv6: ADDRCONF(NETDEV_UP): ixgbe2: link is not ready
[Tue Aug 29 12:35:37 2023] ixgbe 0000:01:00.0 ixgbe2: detected SFP+: 4
[Tue Aug 29 12:35:39 2023] ixgbe 0000:01:00.0 ixgbe2: NIC Link is Up 10 Gbps, Flow Control: None
[Tue Aug 29 12:35:39 2023] IPv6: ADDRCONF(NETDEV_CHANGE): ixgbe2: link becomes ready

/var/log/messages:

Aug 29 12:35:38 itsec-prod-suri-2 kernel: ixgbe 0000:01:00.0: removed PHC on ixgbe2
Aug 29 12:35:38 itsec-prod-suri-2 kernel: ixgbe 0000:01:00.0: registered PHC device on ixgbe2
Aug 29 12:35:38 itsec-prod-suri-2 kernel: IPv6: ADDRCONF(NETDEV_UP): ixgbe2: link is not ready
Aug 29 12:35:38 itsec-prod-suri-2 setup-cap-nic.sh: RX flow hash indirection table for ixgbe2 with 14 RX ring(s):
Aug 29 12:35:38 itsec-prod-suri-2 kernel: ixgbe 0000:01:00.0 ixgbe2: detected SFP+: 4

We were able to narrow this down to our chef-client stomping on firewalld–stopping the service has fixed the issue. Appreciate everyone’s time looking into this, sorry for the red herring.