Asymmetric Traffic within Suricata

rikkitikkitavi · January 29, 2021, 7:22am

Hello.

I’m trying to get Suricata working properly but it spews lots of erroneous alerts in fast.log suggesting asymmetric traffic flows. Some simple sigs work but overwhelmingly the majority are suggesting a problem with flow assembly. Here are the top 10 sigs firing over 2 minutes, sorted by count:

154942 [1:2210010:2] SURICATA STREAM 3way handshake wrong seq wrong ack
113010 [1:2210020:2] SURICATA STREAM ESTABLISHED packet out of window
68463 [1:2210045:2] SURICATA STREAM Packet with invalid ack
66827 [1:2210029:2] SURICATA STREAM ESTABLISHED invalid ack
21143 [1:2010935:3] ET SCAN Suspicious inbound to MSSQL port 1433
20006 [1:2101411:13] GPL SNMP public access udp
17376 [1:2200040:2] SURICATA UDP invalid header length
11412 [1:2210056:1] SURICATA STREAM bad window update
7782 [1:2210027:2] SURICATA STREAM ESTABLISHED SYN resend with different seq
4492 [1:2260002:1] SURICATA Applayer Detect protocol only one direction

Troubleshooting suggests the problem is specific to Suricata. The upstream tap and packet broker (pf_ring) has been verified with tcpdump, symmetric flows are working as expected. Oddly, Suricata fails to work properly.

I am using PF_RING (ZC) with an Intel x520 NIC and AMD Opteron architecture, 48-cores over four numa nodes. The OS is Ubuntu. Years ago this system worked well but was since reinstalled, at the time it had little packet loss while capturing about 1m pps during peak.

Since that time Suricata has moved from 5.0.1 to 5.0.5. PF_RING moved from 7.6 to 7.8. I conclude there is not a problem with the tap traffic (tcpdump on raw interface), nor is there a problem with the packet broker (pf_ring virtual interfaces). Both tests prove symmetric flows are being delivered as expected. This is why I suspect something with Suricata.

The NIC and PF_RING are loaded according to PF_RING standards - a single RSS queue, moderate buffer size, hugepages, ethtool to disable all offloading, set_irq_balance script, etc. zbalance fans out to 46 virtual NICs. Symmetric flows are found when attaching tcpdump to these zbalance queues with a filter for a specific IP, the results prove both directions of a flow are received on the same queue. The next component is Suricata reading from those same queues, albeit incorrectly.

I’ve tried:

toggling workers vs autofp and adjusting related knobs
toggling vlan tracking (there are vlans but they work properly)
Adjusting cpu affinity settings (local default was/is 46 cpus on mgmt, recv, worker)
toggling stream checksum checks
toggling interface checksum checks (via suricata pf_ring config)

stats.log shows about 5% packet loss via kernel drops. A minor concern at this point and not significant enough to explain the problem. I didn’t notice anything egregious in the stats.log.

After reading all of that, what do you think I should review ?

Thanks !

vjulien · January 29, 2021, 6:21pm

Are you able to capture a pcap on one of those zbalance queues and manually inspect that? Esp of interest would be if you can find one that triggers such alerts in suricata.

rikkitikkitavi · February 3, 2021, 5:47pm

Good idea. I configured Suricata to log pcap and alert debug logs. (Very useful features). Then I reconfigured suricata.yaml to only listen on two of the 46 queues. (That limits the deluge of information.) The end result is essentially what you suggested, a sampled dataset of alerts and the corresponding pcap.

I then used tshark to review the pcap and compare it to the alerts, although that’s still under review. There are some messages about “ACKed unseen segment, previous segment not captured”. I’ll have to compare with a pre-suricata capture of the traffic to really know if Suricata’s packet processing pipeline is dropping traffic. If not then it may involve dropped packets much farther upstream.

At this point I’m reading the source code and documentation to better understand the packet processing pipeline and stream engine with the approach I’m using. If there’s a system or suricata configuration induced problem I’d expect to find an indication in stats.log. (maybe undersized buffers… or cpu worker config. )

Fwiw I’m seeing the ioctl errors described here [1] but I don’t think they are related. I reviewed runmode-pfring.c and it suggests the errors are being generated because the pfring virtual interfaces are not capable of reporting/setting ethtool interface configuration info via ioctls. To be sure I compared all of the interface settings once more and find they are congruent with what runmode-pfring.c would expect.

[1] Failure when trying to set feature via ioctl - #3 by zlvinas

rikkitikkitavi · February 4, 2021, 1:51am

I think the answer is lol. From what I can tell I previously disabled the stream and decoder events.

Who keeps stream/decoder events enabled for legitimate research ? Maybe they should be disabled by default. Some limited reading suggests a lot of the stream and decoder events trigger from poor or buggy TCP/IP stacks used by the endpoints.

Thanks again for the help, awesome software, and continued support.