I’m trying to get Suricata working properly but it spews lots of erroneous alerts in fast.log suggesting asymmetric traffic flows. Some simple sigs work but overwhelmingly the majority are suggesting a problem with flow assembly. Here are the top 10 sigs firing over 2 minutes, sorted by count:
154942 [1:2210010:2] SURICATA STREAM 3way handshake wrong seq wrong ack
113010 [1:2210020:2] SURICATA STREAM ESTABLISHED packet out of window
68463 [1:2210045:2] SURICATA STREAM Packet with invalid ack
66827 [1:2210029:2] SURICATA STREAM ESTABLISHED invalid ack
21143 [1:2010935:3] ET SCAN Suspicious inbound to MSSQL port 1433
20006 [1:2101411:13] GPL SNMP public access udp
17376 [1:2200040:2] SURICATA UDP invalid header length
11412 [1:2210056:1] SURICATA STREAM bad window update
7782 [1:2210027:2] SURICATA STREAM ESTABLISHED SYN resend with different seq
4492 [1:2260002:1] SURICATA Applayer Detect protocol only one direction
Troubleshooting suggests the problem is specific to Suricata. The upstream tap and packet broker (pf_ring) has been verified with tcpdump, symmetric flows are working as expected. Oddly, Suricata fails to work properly.
I am using PF_RING (ZC) with an Intel x520 NIC and AMD Opteron architecture, 48-cores over four numa nodes. The OS is Ubuntu. Years ago this system worked well but was since reinstalled, at the time it had little packet loss while capturing about 1m pps during peak.
Since that time Suricata has moved from 5.0.1 to 5.0.5. PF_RING moved from 7.6 to 7.8. I conclude there is not a problem with the tap traffic (tcpdump on raw interface), nor is there a problem with the packet broker (pf_ring virtual interfaces). Both tests prove symmetric flows are being delivered as expected. This is why I suspect something with Suricata.
The NIC and PF_RING are loaded according to PF_RING standards - a single RSS queue, moderate buffer size, hugepages, ethtool to disable all offloading, set_irq_balance script, etc. zbalance fans out to 46 virtual NICs. Symmetric flows are found when attaching tcpdump to these zbalance queues with a filter for a specific IP, the results prove both directions of a flow are received on the same queue. The next component is Suricata reading from those same queues, albeit incorrectly.
- toggling workers vs autofp and adjusting related knobs
- toggling vlan tracking (there are vlans but they work properly)
- Adjusting cpu affinity settings (local default was/is 46 cpus on mgmt, recv, worker)
- toggling stream checksum checks
- toggling interface checksum checks (via suricata pf_ring config)
stats.log shows about 5% packet loss via kernel drops. A minor concern at this point and not significant enough to explain the problem. I didn’t notice anything egregious in the stats.log.
After reading all of that, what do you think I should review ?