I have a little problem getting my Suricata install stable in my IDS environment.
First, let me explain what my setup is:
I have fibertap devices that copy both the outgoing and the incoming traffic on my internet feeds to seperate interfaces on my IDS probe servers. On these serves I have PF_Ring with the ZC license and I use this to create multiple data streams of the traffic so both Zeek and Suricata can listen to the traffic (zbalance_ipc clusters).
Zeek has been running fine on this setup for almost 2 years now, so I’m convinced that the setup in itself is sane and working.
When I have rebooted the server and I try to start Suricata, it works fine most of the time, as long as I wait long enough for some things in the zbalance_ipc cluster to settle down. Then, when all the correct credentials and rights are set on the hugemem files of the zbalance_ipc cluster I’m able to start Suricata.
After some time when I try to do a restart Suricata it fails to start again with an error:
4/11/2020 – 13:09:10 - - [ERRCODE: SC_ERR_PF_RING_OPEN(34)] - Failed to open zc:0@2: pfring_open error. Check if zc:0@2 exists and pf_ring module is loaded.
4/11/2020 – 13:09:11 - - [ERRCODE: SC_ERR_PF_RING_OPEN(34)] - Failed to open zc:0@3: pfring_open error. Check if zc:0@3 exists and pf_ring module is loaded.
4/11/2020 – 13:09:11 - - [ERRCODE: SC_ERR_THREAD_INIT(49)] - thread “W#01-zc:0@2” failed to initialize: flags 0145
4/11/2020 – 13:09:11 - - [ERRCODE: SC_ERR_FATAL(171)] - Engine initialization failed, aborting…
After this happens, pfcount also returns an error that it can’t read the interfaces anymore, so it looks like some internal structures in the pf_ring zbalance_ipc cluster are destroyed, which can only be fixed by restarting the whole server.
Could someone help me debugging this issue? I suspect something like a memory leak in the Suricata process that results in this problem state.
Nov 04 15:56:44 idsprobe02.ids.be.nl kernel: W#01-zc:0@2[5119]: segfault at 130 ip 000055c4cfc71c08 sp 00007f4a18fa2418 error 4 in suricata[55c4cfa4d000+61c000]
Nov 04 15:56:44 idsprobe02.ids.be.nl systemd[1]: suricata@0.service: main process exited, code=killed, status=11/SEGV
Nov 04 15:56:44 idsprobe02.ids.be.nl systemd[1]: Unit suricata@0.service entered failed state.
Nov 04 15:56:44 idsprobe02.ids.be.nl systemd[1]: suricata@0.service failed.
I’m going to create a ticket in the Suricata redmine.
The problem happening multiple times a day in my IDS cluster and I should be able to create a coredump and some additional information.
I’m also in touch with them.
But according to my debugger the error is in:
Core was generated by `/sbin/suricata -c /etc/suricata/cluster0.yaml --pidfile /var/run/suricata/clust’.
Program terminated with signal 11, Segmentation fault. #0 0x000055efde4e7c08 in StorageGetById (storage=storage@entry=0x128, type=type@entry=STORAGE_FLOW, id=1) at util-storage.c:224
224 return storage[id];
No. As we are not seeing this in general usage I wonder if PF_RING is part of the issue. Would be good to know if it still fails for you when not using PF_RING, if that is an option.
Very difficult, because I won’t have any substantial traffic to get through it then.
But have you looked at the issue / backtrace and identified why it is happening? To me it looks like there is some fault state not handled properly. Besides that I see a memory out of bound error, which suggest to me that this could also have a security impact.
I am running 6.0.1 on Redhat 7.9. NIC is 10G intel. We consistently see 50-70% utilization.
I am seeing a similar SEGV fault on a couple of my sensors. This just started over the last few days. Sensors have been running on 6.0.1 for nearly a month. Since upgrading version I have noticed a spike s in capture_kernel_packet drops, typically short in duration, and then drop rate falls. I also see consistent 5-10% tcp.segment_memcap_drop. Which I believe is due to stream memcap settings.
Previously we were on 4.18.
suricata.service: main process exited, code=killed, status=11/SEGV
I will see what I can do. I am planning on updating to 6.0.2 the sensors that are exhibiting the segv fault. I am hoping the various fixes in 6.0.2 will address this.