High Suricata capture.kernel_drops

Hi,

I seem to be having a weird problem with Suricata, after around 10~20 minutes (depending on the amount of traffic, sometimes it takes longer, other times it takes shorter) Suricata starts losing packets and it keeps rising and rising. I should also say I’m using Suricata as part of the Security Onion installation, however I don’t believe this to be Security Onion’s fault.

Here are the system/OS specs:
Running on Oracle Linux 9.3, using AF-Packet mode in Suricata 7.0.3

I7 8700 cpu, 64gb ram, Intel I350-T4 Network card.
Average traffic is around ~600mbps
Here are some outputs from commands:

ethtool -i enp1s0f3
driver: igb
version: 5.15.0-203.146.5.1.el9uek.x86_6
firmware-version: 1.59, 0x800008f8
expansion-rom-version:
bus-info: 0000:01:00.3
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
ethtool -l enp1s0f3
Channel parameters for enp1s0f3:
Pre-set maximums:
RX:             n/a
TX:             n/a
Other:          1
Combined:       8
Current hardware settings:
RX:             n/a
TX:             n/a
Other:          1
Combined:       1

I’ve attached other files as well, such as my Suricata.yaml config, Suricata log file, latest entry of stats.log, ethtool -k output, and ethtool -S output. Any help is appreciated, even if it’s troubleshooting steps as I seem to be at a loss as to what could be going on. It’s like a buffer get’s filled up somewhere and after it’s full it just starts dropping packets somewhere.
suricata.yaml (8.2 KB)
suricata.log (1.4 KB)
stats.log (36.9 KB)
ethtool-k.log (1.9 KB)
ethtool-S.log (1.5 KB)

Hi,

is there a reason why you use bond0 instead of enp1s0f3? Maybe it’s related to the bonding.

How is the tcp.reassembly_gap stat before it happens? Maybe you can correlate this to it as well.

You could also run perf top -p $(pidof suricata) before it happens and while it happens and send the output, maybe something is highlighted there as well.

Hi Andreas,

Thanks for the quick reply! The reason for bond0 is just how Security Onion setups up and monitors all the interfaces, it doesn’t seem to be related to the bonding. Here’s the ifconfig of bond0:

bond0: flags=5443<UP,BROADCAST,RUNNING,PROMISC,MASTER,MULTICAST>  mtu 9000
        txqueuelen 1000  (Ethernet)
        RX packets 4688447922  bytes 2419995258327 (2.2 TiB)
        RX errors 0  dropped 38  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

And the config of bond0 as well:

Ethernet Channel Bonding Driver: v5.15.0-203.146.5.1.el9uek.x86_64

Bonding Mode: load balancing (round-robin)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

Slave Interface: enp1s0f3
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 5
Slave queue ID: 0

The tcp.reassembly_gap value before the drops is sitting at 17442 and during it goes to 18956. I’ve attached the before and during stats.log below:
stats-right-before-packetloss.log (37.1 KB)
stats-during-packetloss.log (37.1 KB)

I’ve had a look at perf top -p $(pidof suricata) and it seems like nothing changes from before it happens and while it happens. I’ve attached some screenshots below.
Before Packet Loss:

During Packet Loss:

The kernel overhead looks quite high, also the Suricata build is missing the debug symbols. So the perf output currently doesn’t help much.

I would also try to manually switch to the actual interface instead of the bond if that’s possible in the suricata.yaml.

Not too sure if I can rebuild Suricata with the debug options.

I’ve changed the interface from bond0 to enp1s0f3 in the suricata.yaml file and it still performs the same. Fine for ~5 minutes then the capture.kernal_drops rise.

It’s not with debug options, but with the debug symbols. Maybe the OS provides those, for example Debian provides those via " suricata-dbgsym".

Based on your log output, you don’t use any rules, correct?

[11 - Suricata-Main] 2024-02-23 07:05:06 Warning: detect: 1 rule files specified, but no rules were loaded!

So signatures shouldn’t be the issue.

Do you see anything in dmesg?
How does htop look like, maybe a core at 100%?

I don’t think Oracle provides the symbols in the OS.

Yes, that is correct, no rules are loaded - I disabled those to see if that was the problem.

Nothing in dmesg.

Although, I just checked glances instead and I do see that my CPU is between 1-5% usually. Currently the interface is getting hit with 500Mb and Suricata is only using between 0.1-1% of CPU and around 54.8% of Memory. Could this be the issue? or is this normal when no rules are loaded?

The CPU usage seems to be very low, so what might actually happen that Suricata is idle and the kernel has an issue and thus the drops. The memory usage should be fine, keep in mind that Linux uses caches so would be more precise to see what actual RAM is used.

You could try to see if Suricata 6.0 or maybe the master branch 8.0 behave differently. But I tend more and more to another issue that might be more related to the kernel, driver or firmware. You could also try a slower runmode like the pcap mode instead of AF_PACKET. But that’s just ideas, cause the behavior is really strange.

Yes, it’s very interesting behavior. Here’s some screenshots:

During Suricata restart:

During packet processing:

During packet loss:

Is there any way I could easily see what might be the issue with the Kernel?

Hmm the screenshots show a different picture. The second one shows 100% on several cores, so this could lead to drops. What’s odd that it falls down afterwards, to the 3rd screenshot looks more like a potential bug.

How do you forward the traffic to the interface? Maybe even something strange within the traffic.

Also, could you run htop and focus on the output related to Suricata for each thread spawned? I think that is a better view compared to glances.

Here’s the htop screenshots:
During Restart:

During Processing:

During Packet Loss:

All the Suricata processes are now at 0.0% cpu with the odd one here and there jumping to 0.7%.

The traffic is forwarded to the interface via a SPAN port on a switch. Very weird behavior.

Is the traffic encapsulated or “clean” and fully bidirectional?

But seeing the cores spike at 100% is not good, this will lead to drops for sure. When you see those cpu spikes, can you do another run of sudo perf top -p $(pidof suricata) maybe even sudo perf top -g -p $(pidof suricata)

Also post suricata --build-info.

Sorry, what do you mean by encapsulated or clean? and yes the traffic is fully bidirectional.

suricata --build-info
This is Suricata version 7.0.3 RELEASE
Features: PCAP_SET_BUFF AF_PACKET HAVE_PACKET_FANOUT LIBCAP_NG LIBNET1.1 HAVE_HTP_URI_NORMALIZE_HOOK PCRE_JIT HAVE_NSS HTTP2_DECOMPRESSION HAVE_LUA HAVE_LUAJIT HAVE_LIBJANSSON TLS TLS_C11 MAGIC RUST
SIMD support: SSE_4_2 SSE_4_1 SSE_3
Atomic intrinsics: 1 2 4 8 16 byte(s)
64-bits, Little-endian architecture
GCC version 11.4.1 20230605 (Red Hat 11.4.1-2.1.0.1), C version 201112
compiled with _FORTIFY_SOURCE=0
L1 cache line size (CLS)=64
thread local storage method: _Thread_local
compiled with LibHTP v0.5.46, linked against LibHTP v0.5.46

Suricata Configuration:
  AF_PACKET support:                       yes
  AF_XDP support:                          no
  DPDK support:                            no
  eBPF support:                            no
  XDP support:                             no
  PF_RING support:                         no
  NFQueue support:                         no
  NFLOG support:                           no
  IPFW support:                            no
  Netmap support:                          no
  DAG enabled:                             no
  Napatech enabled:                        no
  WinDivert enabled:                       no

  Unix socket enabled:                     yes
  Detection enabled:                       yes

  Libmagic support:                        yes
  libjansson support:                      yes
  hiredis support:                         no
  hiredis async with libevent:             no
  PCRE jit:                                yes
  LUA support:                             yes, through luajit
  libluajit:                               yes
  GeoIP2 support:                          yes
  Non-bundled htp:                         no
  Hyperscan support:                       no
  Libnet support:                          yes
  liblz4 support:                          yes
  Landlock support:                        yes

  Rust support:                            yes
  Rust strict mode:                        no
  Rust compiler path:                      /usr/bin/rustc
  Rust compiler version:                   rustc 1.71.1 (eb26296b5 2023-08-03) (Red Hat 1.71.1-1.el9)
  Cargo path:                              /usr/bin/cargo
  Cargo version:                           cargo 1.71.1

  Python support:                          yes
  Python path:                             /usr/bin/python3
  Install suricatactl:                     yes
  Install suricatasc:                      yes
  Install suricata-update:                 yes

  Profiling enabled:                       no
  Profiling locks enabled:                 no
  Profiling rules enabled:                 no

  Plugin support (experimental):           yes
  DPDK Bond PMD:                           no

Development settings:
  Coccinelle / spatch:                     no
  Unit tests enabled:                      no
  Debug output enabled:                    no
  Debug validation enabled:                no
  Fuzz targets enabled:                    no

Generic build parameters:
  Installation prefix:                     /opt/suricata
  Configuration directory:                 /etc/suricata/
  Log directory:                           /var/log/suricata/

  --prefix                                 /opt/suricata
  --sysconfdir                             /etc
  --localstatedir                          /var
  --datarootdir                            /opt/suricata/share

  Host:                                    x86_64-pc-linux-gnu
  Compiler:                                gcc (exec name) / g++ (real)
  GCC Protect enabled:                     no
  GCC march native enabled:                no
  GCC Profile enabled:                     no
  Position Independent Executable enabled: no
  CFLAGS                                   -g -O2 -fPIC -std=c11 -I${srcdir}/../rust/gen -I${srcdir}/../rust/dist
  PCAP_CFLAGS
  SECCFLAGS

I had another look at sudo perf top -p $(pidof suricata) during the 100% spikes and this was the output:

I then tried sudo perf top -g -p $(pidof suricata) and these two caught my eye:

and

Are these normal values? I’ve attached to full output to sudo perf top -g -p $(pidof suricata) here:
perf-top-g-p.log (151.9 KB)

If the traffic is encapsulated in any other protocols, fragementation, non-rfc compliant etc.

Also the build info shows no hyperscan support which is also a drawback on performance.

I would also hunt further on the perf details, those functions have a very high overhead which I don’t expect on normal systems. Could be related to how Suricata is running on the OS, maybe play around with that as well. Although this might be more Security Onion related at one point.
Maybe some sort of additional kernel option is enabled or optimization not obvious to be running but causing issues.

As I said you can also try another runmode like PCAP mode, which is slower but maybe doesn’t have the symptom that you see currently.

Depending on the system and options you have, you could boot a different OS like Debian or Ubuntu and run vanilla Suricata and see if the same issue shows up.