Suricata breaks after a little time running fine

Hello everyone,

I have a little problem getting my Suricata install stable in my IDS environment.
First, let me explain what my setup is:
I have fibertap devices that copy both the outgoing and the incoming traffic on my internet feeds to seperate interfaces on my IDS probe servers. On these serves I have PF_Ring with the ZC license and I use this to create multiple data streams of the traffic so both Zeek and Suricata can listen to the traffic (zbalance_ipc clusters).
Zeek has been running fine on this setup for almost 2 years now, so I’m convinced that the setup in itself is sane and working.
When I have rebooted the server and I try to start Suricata, it works fine most of the time, as long as I wait long enough for some things in the zbalance_ipc cluster to settle down. Then, when all the correct credentials and rights are set on the hugemem files of the zbalance_ipc cluster I’m able to start Suricata.
After some time when I try to do a restart Suricata it fails to start again with an error:

4/11/2020 – 13:09:10 - - [ERRCODE: SC_ERR_PF_RING_OPEN(34)] - Failed to open zc:0@2: pfring_open error. Check if zc:0@2 exists and pf_ring module is loaded.
4/11/2020 – 13:09:11 - - [ERRCODE: SC_ERR_PF_RING_OPEN(34)] - Failed to open zc:0@3: pfring_open error. Check if zc:0@3 exists and pf_ring module is loaded.
4/11/2020 – 13:09:11 - - [ERRCODE: SC_ERR_THREAD_INIT(49)] - thread “W#01-zc:0@2” failed to initialize: flags 0145
4/11/2020 – 13:09:11 - - [ERRCODE: SC_ERR_FATAL(171)] - Engine initialization failed, aborting…

After this happens, pfcount also returns an error that it can’t read the interfaces anymore, so it looks like some internal structures in the pf_ring zbalance_ipc cluster are destroyed, which can only be fixed by restarting the whole server.

Could someone help me debugging this issue? I suspect something like a memory leak in the Suricata process that results in this problem state.

Thanks,
Jan Hugo Prins

I just noticed that Suricata entered this state where it isn’t able to start anymore after a segfault:

Nov 04 15:50:31 idsprobe02.ids.be.nl suricata[5105]: 4/11/2020 – 15:50:31 - - all 2 packet processing threads, 4 management threads initialized, engine started.

Nov 04 15:56:44 idsprobe02.ids.be.nl kernel: W#01-zc:0@2[5119]: segfault at 130 ip 000055c4cfc71c08 sp 00007f4a18fa2418 error 4 in suricata[55c4cfa4d000+61c000]

Nov 04 15:56:44 idsprobe02.ids.be.nl systemd[1]: suricata@0.service: main process exited, code=killed, status=11/SEGV
Nov 04 15:56:44 idsprobe02.ids.be.nl systemd[1]: Unit suricata@0.service entered failed state.
Nov 04 15:56:44 idsprobe02.ids.be.nl systemd[1]: suricata@0.service failed.

Jan Hugo

I’m going to create a ticket in the Suricata redmine.
The problem happening multiple times a day in my IDS cluster and I should be able to create a coredump and some additional information.

Jan Hugo Prins

If you haven’t already you might want to reach out to the PF_RING community as well. They are the maintainers of our PF_RING support.

Hi,

I’m also in touch with them.
But according to my debugger the error is in:

Core was generated by `/sbin/suricata -c /etc/suricata/cluster0.yaml --pidfile /var/run/suricata/clust’.
Program terminated with signal 11, Segmentation fault.
#0 0x000055efde4e7c08 in StorageGetById (storage=storage@entry=0x128, type=type@entry=STORAGE_FLOW, id=1) at util-storage.c:224
224 return storage[id];

I have added 2 gdb output files to the ticket.

@ish, do you have any idea how much work it would be to create a patch / workaround?

Jan Hugo Prins

No. As we are not seeing this in general usage I wonder if PF_RING is part of the issue. Would be good to know if it still fails for you when not using PF_RING, if that is an option.

Very difficult, because I won’t have any substantial traffic to get through it then.
But have you looked at the issue / backtrace and identified why it is happening? To me it looks like there is some fault state not handled properly. Besides that I see a memory out of bound error, which suggest to me that this could also have a security impact.

I have added valgrind output of the crash to the ticket.

A bug has been identified and will be fixed in version 6.0.1.