Measuring kernel capture drop rate

hubs · March 29, 2021, 4:04pm

Hello,

Looking for some input/clarification.

I recently inherited a large installation of Suricata. I just updated to 6.0.1. I am curious about best way to measure kernel capture drop rate.

Suricata outputs a stats.log, and we also output stats.json that is used to feed into Splunk. Sample rate is every 30sec. The previous admin had setup to calculate the drop rate between the 30 sec sample in a Splunk Dashboard to monitor the platform.

previous_sample (t-30sec)
capture.kernel_packets
capture.kernel_drops

current_sample (current)
capture.kernel_packets
capture.kernel_drops

delta_packets = current_capture.kernel_packets - previous_capture.kernel_packets
delta_drops = current_capture.kernel_drops - previous_capture.kernel_drops

drop_rate = delta_drops / delta_packets

Is calculating the delta_drop rate this way a valid measure?

If I monitor the stats.log file, and calculate drop % I consistently see 1-2%. Strictly calculating from values in stats.log for each 30sec measure. capture.kernel_drops / capture.kernel_packets.

However if I calculate the delta drop % between stats.log measures I notice spikes in the drop rate that can be quite significant. (10-40%). As I started sifting through the stats.log I seem to only see the spikes when I calculate the delta between stats.log measures. But if I calculate strictly based on the capture.kernel_packets and the capture.kernel_drops values from each stats.log measure the drop % is 1-2%.

Andreas_Herz · March 29, 2021, 7:56pm

IMHO that is the correct approach. Do you see spikes there as well? I would correlate that to other values like load, network throughput etc. that might explain spikes.

hubs · June 29, 2021, 2:23pm

I do see spikes in the drop rate when calculating the delta. What is strange to me is the drop rate spikes every 10 minutes, and then the rate drops back to 1-2%. The spike in the kernel drops is typically only seen on 1 30s interval. The drops do not persist over several minutes. On the next 10min mark I will see the spike in drop rate again.

I have 4 sensors each with 10Gb NIC. The sensors are connected to an Ixia Load Balanced port group (40G). At the Ixia device I see transmit utilization at 50% on average. I do see transmit spikes at the Ixia, but it typically does not exceed 70% transmit, which I would anticipate my sensors should be able to handle. The spike at the Ixia seems to correlate to the spike at the sensor.

I also have 1 sensor connected directly to an Ixia 10G port. No load balance. Single sensor. The sensor also has a 10G NIC, and it is receiving a consistent 9-10G. I see the same kernel drop spikes every 10 minutes or so on this sensor. Drops will spike every 10min and then drop back down to 1-2%.

I have tried upping my ring-size from 100000 ->200000 with no change in the behavior. Any thoughts on next steps?

My sensors:
Suricata 6.0.2 IDS mode. (Installed rpm from copr repo.)
ET PRO Ruleset - 52000 rules.
RHEL 7.9
28 cores / 56 threads. (2 sockets)
250G ram
10G Intel 82599ES

af-packet:
threads: auto
cluster-type: cluster_flow
defrag: yes
use-mmap: yes
tpacket-v3: yes
ring-size: 200000
block-size: 1048576
use-emergency-flush: yes
checksum-checks: no

Thanks

Andreas_Herz · June 29, 2021, 7:47pm

Try to narrow down what type of traffic is responsible for those spikes, if it’s some sort of elephant flow you might want to shunt those.
You could try to use cluster_qm mode and pin the CPU cores to further improve performance, see 9.5. High Performance Configuration — Suricata 6.0.2 documentation

hubs · July 2, 2021, 1:52pm

Hi,

Thank you Andreas for the information. I am trying to identify if I am seeing any elephant flows. I do not have a lot of experience on this, but if elephant flows are the issue wouldn’t I expect to see consist drop rates as opposed to the short duration spikes I am seeing? The kernel drops spikes are typically only seen during one stats log interval and then drop back down. Are there any indicators I can look for in the stats that would help determine “elephant flows”??

On cluster_qm…This would require setting up symmetric hashing and defining the key for the interface. The key value is arbitrary as long as I have the correct number of bytes right?

In the example on the 9.5 Performance page, The documentation shows setting cpu affinity. That looks pretty straight forward. The suricata.yaml settings I am a bit curious about; It shows the interface block twice for the same interface eth1. Each block defines 18 threads. Is this required since we set cpu affinity? Why not define 36 threads for the interface?

Thank you!

Andreas_Herz · July 9, 2021, 12:53pm

It’s hard to tell just within stats.log so it’s better to use other system metrics to possibly narrow it down if you see traffic spikes for example. Ideally you can look into the traffic itself or you know potential flows like that in the network.

The key needs to be exactly the one provided in the documentation.

The reason why it’s spliited is due to numa node awareness, since the cpu numbers are split. But feel free to test it with just one section. That’s an open topic to find the perfect setting there as well.