Suricata breaks after a little time running fine

Jan_Hugo_Prins · November 4, 2020, 12:28pm

Hello everyone,

I have a little problem getting my Suricata install stable in my IDS environment.
First, let me explain what my setup is:
I have fibertap devices that copy both the outgoing and the incoming traffic on my internet feeds to seperate interfaces on my IDS probe servers. On these serves I have PF_Ring with the ZC license and I use this to create multiple data streams of the traffic so both Zeek and Suricata can listen to the traffic (zbalance_ipc clusters).
Zeek has been running fine on this setup for almost 2 years now, so I’m convinced that the setup in itself is sane and working.
When I have rebooted the server and I try to start Suricata, it works fine most of the time, as long as I wait long enough for some things in the zbalance_ipc cluster to settle down. Then, when all the correct credentials and rights are set on the hugemem files of the zbalance_ipc cluster I’m able to start Suricata.
After some time when I try to do a restart Suricata it fails to start again with an error:

4/11/2020 – 13:09:10 - - [ERRCODE: SC_ERR_PF_RING_OPEN(34)] - Failed to open zc:0@2: pfring_open error. Check if zc:0@2 exists and pf_ring module is loaded.
4/11/2020 – 13:09:11 - - [ERRCODE: SC_ERR_PF_RING_OPEN(34)] - Failed to open zc:0@3: pfring_open error. Check if zc:0@3 exists and pf_ring module is loaded.
4/11/2020 – 13:09:11 - - [ERRCODE: SC_ERR_THREAD_INIT(49)] - thread “W#01-zc:0@2” failed to initialize: flags 0145
4/11/2020 – 13:09:11 - - [ERRCODE: SC_ERR_FATAL(171)] - Engine initialization failed, aborting…

After this happens, pfcount also returns an error that it can’t read the interfaces anymore, so it looks like some internal structures in the pf_ring zbalance_ipc cluster are destroyed, which can only be fixed by restarting the whole server.

Could someone help me debugging this issue? I suspect something like a memory leak in the Suricata process that results in this problem state.

Thanks,
Jan Hugo Prins

Jan_Hugo_Prins · November 4, 2020, 3:13pm

I just noticed that Suricata entered this state where it isn’t able to start anymore after a segfault:

Nov 04 15:50:31 idsprobe02.ids.be.nl suricata[5105]: 4/11/2020 – 15:50:31 - - all 2 packet processing threads, 4 management threads initialized, engine started.

Nov 04 15:56:44 idsprobe02.ids.be.nl kernel: W#01-zc:0@2[5119]: segfault at 130 ip 000055c4cfc71c08 sp 00007f4a18fa2418 error 4 in suricata[55c4cfa4d000+61c000]

Nov 04 15:56:44 idsprobe02.ids.be.nl systemd[1]: suricata@0.service: main process exited, code=killed, status=11/SEGV
Nov 04 15:56:44 idsprobe02.ids.be.nl systemd[1]: Unit suricata@0.service entered failed state.
Nov 04 15:56:44 idsprobe02.ids.be.nl systemd[1]: suricata@0.service failed.

Jan Hugo

Jan_Hugo_Prins · November 4, 2020, 4:14pm

I’m going to create a ticket in the Suricata redmine.
The problem happening multiple times a day in my IDS cluster and I should be able to create a coredump and some additional information.

Jan Hugo Prins

Jan_Hugo_Prins · November 4, 2020, 4:37pm

ish · November 4, 2020, 5:11pm

If you haven’t already you might want to reach out to the PF_RING community as well. They are the maintainers of our PF_RING support.

Jan_Hugo_Prins · November 4, 2020, 5:37pm

Hi,

I’m also in touch with them.
But according to my debugger the error is in:

Core was generated by `/sbin/suricata -c /etc/suricata/cluster0.yaml --pidfile /var/run/suricata/clust’.
Program terminated with signal 11, Segmentation fault.
#0 0x000055efde4e7c08 in StorageGetById (storage=storage@entry=0x128, type=type@entry=STORAGE_FLOW, id=1) at util-storage.c:224
224 return storage[id];

Jan_Hugo_Prins · November 4, 2020, 5:59pm

I have added 2 gdb output files to the ticket.

Jan_Hugo_Prins · November 5, 2020, 9:16am

@ish, do you have any idea how much work it would be to create a patch / workaround?

Jan Hugo Prins

ish · November 5, 2020, 2:53pm

No. As we are not seeing this in general usage I wonder if PF_RING is part of the issue. Would be good to know if it still fails for you when not using PF_RING, if that is an option.

Jan_Hugo_Prins · November 5, 2020, 3:49pm

Very difficult, because I won’t have any substantial traffic to get through it then.
But have you looked at the issue / backtrace and identified why it is happening? To me it looks like there is some fault state not handled properly. Besides that I see a memory out of bound error, which suggest to me that this could also have a security impact.

Jan_Hugo_Prins · November 5, 2020, 7:47pm

I have added valgrind output of the crash to the ticket.

Jan_Hugo_Prins · November 18, 2020, 11:12am

A bug has been identified and will be fixed in version 6.0.1.

w2n1ck · March 10, 2021, 10:20am

the same problem for me

suricata --pfring-int=eth5 --pfring-int=eth4 --pfring-cluster-id=99 --pfring-cluster-type=cluster_flow -c /etc/suricata/suricata.yaml --runmode=workers -D

suricata version：6.0.2

hubs · April 5, 2021, 4:42pm

I am running 6.0.1 on Redhat 7.9. NIC is 10G intel. We consistently see 50-70% utilization.

I am seeing a similar SEGV fault on a couple of my sensors. This just started over the last few days. Sensors have been running on 6.0.1 for nearly a month. Since upgrading version I have noticed a spike s in capture_kernel_packet drops, typically short in duration, and then drop rate falls. I also see consistent 5-10% tcp.segment_memcap_drop. Which I believe is due to stream memcap settings.

Previously we were on 4.18.

suricata.service: main process exited, code=killed, status=11/SEGV

suricata -c /etc/suricata/suricata.yaml --pidfile /var/run/suricata.pid -I eth1 --user suricata

I am not using pfring. Sensors are running with af-packet.
af-packet:

interface: eth1
threads: auto
cluster-id:99
cluster-type: cluster-flow
defrag: yes
use-mmap: yes
packet-v3: yes
ring-size: 100000
block-size: 1048576
use-emergency-flush: yes

Andreas_Herz · April 20, 2021, 8:43pm

If you see those spikes could you post the perf top -p $(pidof suricata) output?

hubs · April 22, 2021, 9:34pm

I will see what I can do. I am planning on updating to 6.0.2 the sensors that are exhibiting the segv fault. I am hoping the various fixes in 6.0.2 will address this.

Jeff_Lucovsky · April 25, 2021, 1:33pm

Can you post or DM the backtraces when the SEGV occurs?

The best way would be to create an issue with the information at https://redmine.openinfosecfoundation.org/