High overhead from rs_dns_state_get_tx causing packet loss

After first starting Suricata up, everything runs fine for a few hours but eventually I get (seemingly) unrecoverable packet loss.

Monitoring the processes with perf top, rs_dns_state_get_tx' and AppLayerDefaultGetTxIterator’ slowly creep up in overhead% and eventually overtake `DetectRun.part.16’. Once this happens, I start getting packet loss.

This was after 15hrs:
Samples: 2M of event ‘cycles’, 4000 Hz, Event count (approx.): 1171695080430 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
46.41% suricata [.] rs_dns_state_get_tx
28.55% suricata [.] AppLayerDefaultGetTxIterator
9.87% suricata [.] FlowGetProtoMapping
4.15% suricata [.] DetectRun.part.16
1.01% suricata [.] DetectEnginePktInspectionRun
0.91% suricata [.] DetectEngineInspectRulePacketMatches
0.73% suricata [.] rs_sip_state_get_tx

I have the stats output from the same time attached as text file. statsout.log (6.6 KB)

OS: CentOS8 stream
CPU: 2x Xeon E5-2699 v4 (88 HT cores)
RAM: 128GB
NIC: Napatech NE40E3-4
Data rate: 10gbps sustained (pushing bigFlows.pcap from tcpreplay.appneta.com through a packet broker to the napatech)

It’s slow to diagnose as it takes hours for the function to creep up to the top of the list in perf.

What version of Suricata are you using?

Whoops, that seems like a pretty important detail.

Suricata 6.0.1, compiled from source. Build info attached. buildinfo.log (6.7 KB)

I disabled the DNS parsers and have been running for 17hrs with average 0.2% packet loss, versus the average 7.2% packet loss over my last 15hr run with DNS parsers enabled. When the packet loss starts with DNS parsers, I see about 30% packet loss and I cannot recover until I kill the feed or restart Suricata.

I see that the master branch has some recent changes to src/app-layer-parser.c, including some transaction cleanup. I may try to merge those changes into the 6.0.1 build locally and see how it goes.

I can see that on some deployments as well, we will keep an eye on that. If you could test it with Suricata 5.0.5 that would help us to narrow it down to changes from 5 to 6.

After an 18hr run with 5.0.5, I have 0% packet loss. Same host, same settings, still 10gbps sustained.

We have some fixes in the just released 6.0.2 that might help. Are you able to try it out? (See Suricata 6.0.2 and 5.0.6 released)

I am still seeing the same issue with 6.0.2. Averaging 8.1% packet loss after 14 hours at 10gbps.

I’ll attach the main perf top and annotations for ‘rs_dns_state_get_tx’ and ‘AppLayerDefaultGetTxIterator
602-1

Are you able to provide a perf top screenshot after running it with the -g option? I’d like to see if we can find out which path leads to these calls.

Are you willing to try a patch or 2? I’ve somewhat replicated this by crafting a misbehaving DNS client, but I’ve seen similar in the real world.

Yes, I can try some local patches.

Here’s the updated output from perf:

I’ve found a few cases with TCP DNS where this can happen, in particular where there are DNS TCP streams that are long lived and messages may be lost, or the client floods the server (which I have seen on the real internet).

This patch should help with the issue, but we’re looking at better ways as well. Please let me know. If you know there is no TCP DNS in your traffic, this is unlikely to help.

https://github.com/jasonish/suricata/commit/ddb78e60de5a35f09548b6d93e55a57accfb4e05.patch

Thanks.