Suricata breaks after a little time running fine

Hello everyone,

I have a little problem getting my Suricata install stable in my IDS environment.
First, let me explain what my setup is:
I have fibertap devices that copy both the outgoing and the incoming traffic on my internet feeds to seperate interfaces on my IDS probe servers. On these serves I have PF_Ring with the ZC license and I use this to create multiple data streams of the traffic so both Zeek and Suricata can listen to the traffic (zbalance_ipc clusters).
Zeek has been running fine on this setup for almost 2 years now, so I’m convinced that the setup in itself is sane and working.
When I have rebooted the server and I try to start Suricata, it works fine most of the time, as long as I wait long enough for some things in the zbalance_ipc cluster to settle down. Then, when all the correct credentials and rights are set on the hugemem files of the zbalance_ipc cluster I’m able to start Suricata.
After some time when I try to do a restart Suricata it fails to start again with an error:

4/11/2020 – 13:09:10 - - [ERRCODE: SC_ERR_PF_RING_OPEN(34)] - Failed to open zc:0@2: pfring_open error. Check if zc:0@2 exists and pf_ring module is loaded.
4/11/2020 – 13:09:11 - - [ERRCODE: SC_ERR_PF_RING_OPEN(34)] - Failed to open zc:0@3: pfring_open error. Check if zc:0@3 exists and pf_ring module is loaded.
4/11/2020 – 13:09:11 - - [ERRCODE: SC_ERR_THREAD_INIT(49)] - thread “W#01-zc:0@2” failed to initialize: flags 0145
4/11/2020 – 13:09:11 - - [ERRCODE: SC_ERR_FATAL(171)] - Engine initialization failed, aborting…

After this happens, pfcount also returns an error that it can’t read the interfaces anymore, so it looks like some internal structures in the pf_ring zbalance_ipc cluster are destroyed, which can only be fixed by restarting the whole server.

Could someone help me debugging this issue? I suspect something like a memory leak in the Suricata process that results in this problem state.

Thanks,
Jan Hugo Prins

I just noticed that Suricata entered this state where it isn’t able to start anymore after a segfault:

Nov 04 15:50:31 idsprobe02.ids.be.nl suricata[5105]: 4/11/2020 – 15:50:31 - - all 2 packet processing threads, 4 management threads initialized, engine started.

Nov 04 15:56:44 idsprobe02.ids.be.nl kernel: W#01-zc:0@2[5119]: segfault at 130 ip 000055c4cfc71c08 sp 00007f4a18fa2418 error 4 in suricata[55c4cfa4d000+61c000]

Nov 04 15:56:44 idsprobe02.ids.be.nl systemd[1]: suricata@0.service: main process exited, code=killed, status=11/SEGV
Nov 04 15:56:44 idsprobe02.ids.be.nl systemd[1]: Unit suricata@0.service entered failed state.
Nov 04 15:56:44 idsprobe02.ids.be.nl systemd[1]: suricata@0.service failed.

Jan Hugo

I’m going to create a ticket in the Suricata redmine.
The problem happening multiple times a day in my IDS cluster and I should be able to create a coredump and some additional information.

Jan Hugo Prins

If you haven’t already you might want to reach out to the PF_RING community as well. They are the maintainers of our PF_RING support.

Hi,

I’m also in touch with them.
But according to my debugger the error is in:

Core was generated by `/sbin/suricata -c /etc/suricata/cluster0.yaml --pidfile /var/run/suricata/clust’.
Program terminated with signal 11, Segmentation fault.
#0 0x000055efde4e7c08 in StorageGetById (storage=storage@entry=0x128, type=type@entry=STORAGE_FLOW, id=1) at util-storage.c:224
224 return storage[id];

I have added 2 gdb output files to the ticket.

@ish, do you have any idea how much work it would be to create a patch / workaround?

Jan Hugo Prins

No. As we are not seeing this in general usage I wonder if PF_RING is part of the issue. Would be good to know if it still fails for you when not using PF_RING, if that is an option.

Very difficult, because I won’t have any substantial traffic to get through it then.
But have you looked at the issue / backtrace and identified why it is happening? To me it looks like there is some fault state not handled properly. Besides that I see a memory out of bound error, which suggest to me that this could also have a security impact.

I have added valgrind output of the crash to the ticket.

A bug has been identified and will be fixed in version 6.0.1.

the same problem for me

suricata --pfring-int=eth5 --pfring-int=eth4 --pfring-cluster-id=99 --pfring-cluster-type=cluster_flow -c /etc/suricata/suricata.yaml --runmode=workers -D

suricata version:6.0.2

I am running 6.0.1 on Redhat 7.9. NIC is 10G intel. We consistently see 50-70% utilization.

I am seeing a similar SEGV fault on a couple of my sensors. This just started over the last few days. Sensors have been running on 6.0.1 for nearly a month. Since upgrading version I have noticed a spike s in capture_kernel_packet drops, typically short in duration, and then drop rate falls. I also see consistent 5-10% tcp.segment_memcap_drop. Which I believe is due to stream memcap settings.

Previously we were on 4.18.

suricata.service: main process exited, code=killed, status=11/SEGV

suricata -c /etc/suricata/suricata.yaml --pidfile /var/run/suricata.pid -I eth1 --user suricata

I am not using pfring. Sensors are running with af-packet.
af-packet:

  • interface: eth1
    threads: auto
    cluster-id:99
    cluster-type: cluster-flow
    defrag: yes
    use-mmap: yes
    packet-v3: yes
    ring-size: 100000
    block-size: 1048576
    use-emergency-flush: yes

If you see those spikes could you post the perf top -p $(pidof suricata) output?

I will see what I can do. I am planning on updating to 6.0.2 the sensors that are exhibiting the segv fault. I am hoping the various fixes in 6.0.2 will address this.

Can you post or DM the backtraces when the SEGV occurs?

The best way would be to create an issue with the information at https://redmine.openinfosecfoundation.org/