Have Suricata best practices on AWS?

I used C5 instances on AWS. Recently in my network, when the peak traffic reached 5Gbps, Suricata had a kernel drop.I have tried AF-PACKET and PFRING, the test result PFRING kernel drop is smaller. My host CPU is fully loaded. Are there any users who use Suricata on AWS?Are there any best practices to refer to? AWS cannot optimize network cards too much.How should I optimize Suricata to improve performance?

C5 instances

Instance Name	vCPUs     RAM         EBS Bandwidth	  Network Bandwidth
c5n.4xlarge     16	      42 GiB	  3.5 Gbp         Up to 25 Gbps

Have you tried any of the optimizations suggested/explained in the docs here - https://suricata.readthedocs.io/en/suricata-5.0.2/performance/tuning-considerations.html ?

@pevma Yes, I have optimized it according to the documentation.

threading

$ suricata --dump-config | grep threading
threading = (null)
threading.set-cpu-affinity = yes
threading.cpu-affinity = (null)
threading.cpu-affinity.0 = management-cpu-set
threading.cpu-affinity.0.management-cpu-set = (null)
threading.cpu-affinity.0.management-cpu-set.cpu = (null)
threading.cpu-affinity.0.management-cpu-set.cpu.0 = 0
threading.cpu-affinity.1 = worker-cpu-set
threading.cpu-affinity.1.worker-cpu-set = (null)
threading.cpu-affinity.1.worker-cpu-set.cpu = (null)
threading.cpu-affinity.1.worker-cpu-set.cpu.0 = 1-15
threading.cpu-affinity.1.worker-cpu-set.mode = exclusive
threading.cpu-affinity.1.worker-cpu-set.prio = (null)
threading.cpu-affinity.1.worker-cpu-set.prio.low = (null)
threading.cpu-affinity.1.worker-cpu-set.prio.medium = (null)
threading.cpu-affinity.1.worker-cpu-set.prio.medium.0 = 0
threading.cpu-affinity.1.worker-cpu-set.prio.high = (null)
threading.cpu-affinity.1.worker-cpu-set.prio.high.0 = 1-15
threading.cpu-affinity.1.worker-cpu-set.prio.default = high
threading.detect-thread-ratio = 1.0

pfring

$ suricata --dump-config | grep pfring
pfring = (null)
pfring.0 = interface
pfring.0.interface = ens5
pfring.0.threads = 15
pfring.0.cluster-id = 99
pfring.0.cluster-type = cluster_flow
pfring.0.checksum-checks = no
pfring.1 = interface
pfring.1.interface = default

af-packet

$ suricata --dump-config | grep af-packet
af-packet = (null)
af-packet.0 = interface
af-packet.0.interface = ens5
af-packet.0.threads = 15
af-packet.0.cluster-id = 99
af-packet.0.cluster-type = cluster_flow
af-packet.0.defrag = yes
af-packet.0.use-mmap = yes
af-packet.0.mmap-locked = yes
af-packet.0.tpacket-v3 = yes
af-packet.0.ring-size = 100000
af-packet.0.checksum-checks = no
af-packet.1 = interface
af-packet.1.interface = default

mem

$ suricata --dump-config | grep mem
app-layer.protocols.http.memcap = 8gb
defrag.memcap = 16gb
flow.memcap = 16gb
stream.memcap = 16gb
stream.reassembly.memcap = 32gb
host.memcap = 32mb

ethtool

$ ethtool -g ens5
Ring parameters for ens5:
Pre-set maximums:
RX:		16384
RX Mini:	0
RX Jumbo:	0
TX:		1024
Current hardware settings:
RX:		16384
RX Mini:	0
RX Jumbo:	0
TX:		1024

MTU

$ ifconfig
ens5: flags=4419<UP,BROADCAST,RUNNING,PROMISC,MULTICAST>  mtu 9001
        inet x.x.x.x  netmask 255.255.255.0  broadcast x.x.x.x
        inet6 fe80::4e6:12ff:feb6:8bb0  prefixlen 64  scopeid 0x20<link>
        ether 06:e6:12:b6:8b:b0  txqueuelen 1000  (Ethernet)
        RX packets 28385192619  bytes 35152476276451 (35.1 TB)
        RX errors 0  dropped 14294  overruns 0  frame 0
        TX packets 1696  bytes 86651 (86.6 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Can you please share the last update in the stats.log ?

@pevma
yes.
stats.tgz (441.1 KB)

Two things:

  1. In the stats log I see several stream related invalid counter up, also tcp.reassembly_gap is noticeable.

  2. When the drops start to increase, can you try to observe the system load via htop or even perf top? Maybe it’s just some big elephant flow or specific traffic type that results in the drops.

Something i noticed that could be of interest -
I dont see any drops in the stats log provided after 14hrs of running :

------------------------------------------------------------------------------------
Date: 4/8/2020 -- 15:32:54 (uptime: 0d, 14h 58m 21s)
------------------------------------------------------------------------------------
Counter                                       | TM Name                   | Value
------------------------------------------------------------------------------------
capture.kernel_packets                        | Total                     | 14635092639
decoder.pkts                                  | Total                     | 14635409432
decoder.bytes                                 | Total                     | 18071874052868
decoder.ipv4                                  | Total                     | 29270816828
....

The first run seems to have really negligible drop rate as well.

I would also suggest bringing those two down to 1-2GB (rarely anything bigger than that is needed)

defrag.memcap = 16gb
flow.memcap = 16gb

@Andreas_Herz hi:
One:
‘tcp. reassembly_gap’ because AWS MTU(internal data interworking):9001, traffic mirroring using VXLAN and adding 50 bytes caused MTU to overload, resulting in incomplete data loss.

Two:
My current traffic mirror is Nginx data, and only TCP data is sent during the traffic mirror, which will include HTTP and HTTPS data.

Maybe it’s just some big elephant flow or specific traffic type that results in the drops.
Could you tell me what I should do? I really want to know what particular traffic is in my traffic

This is my previous screenshot under high load.

hi, @pevma
I think maybe because we are a multinational e-commerce, because of the time difference, we always peak traffic in the early morning.

I would also suggest bringing those two down to 1-2GB (rarely anything bigger than that is needed)

My current traffic mirror is Nginx data, and only TCP data is sent during the traffic mirror, which will include HTTP and HTTPS data.
If it is for HTTP data, can I try to set it to 2G?

I would suggests increasing the MTU default packet size to accommodate for the vxlan tags etc…

Can you upload a stats log after a full 24 hr run please?

Before that, I have modified the MTU.

# Preallocated size for packet. Default is 1514 which is the classical
# size for pcap on ethernet. You should adjust this value to the highest
# packet size (MTU + hardware header) on your system.
default-packet-size: 9015

already modified ok.

I look forward tot hat 24hr full run for the stats.

One other thing i noticed - you are running pfring so the af-packet config is irrelevant in this case - how big of ring size buffers do you set up when you insert the module?

yes, because af-packet kernel_drop is more serious than pfring.

What is your command when you enable pfring module in kernel?

start suricata

$ suricata -vvv --pfring -k none -c /etc/suricata/suricata.yaml

pf_ring

$ cat /proc/net/pf_ring/info
PF_RING Version          : 7.5.0 (dev:14f62e0edb2b54cd614ab9d1f6467ccb8c6c9c32)
Total rings              : 15

Standard (non ZC) Options
Ring slots               : 65536
Slot version             : 17
Capture TX               : No [RX only]
IP Defragment            : No
Socket Mode              : Standard
Cluster Fragment Queue   : 0
Cluster Fragment Discard : 0

new stats.tgz
stats.tgz (842.9 KB)

Interested in MTU settings here – have you had success with this default-packet-size?

I’ve tried modifying the MTU “default-packet-size”, no improvement.

Late Feedback.

AWS c5n.4xlarge instance, using PFRING in my environment, when traffic exceeds 5Gbps The stats.log shows “kernel drop”. My countermeasure is to use multiple Suricata to share the HTTP traffic of Nginx.

Hopefully more people will discuss best practices for deploying Suricata for traffic analysis in AWS.

Let me rephrase your question below:
What kind of recommended instance type on AWS will be the best for running Suricata with enabling pfring module?
I’ll check with more professional experts on that. Thanks.