Feedback for 100Gbit/s Elastic SIEM design (which includes Suricata)

UPPERCASE · November 1, 2021, 2:17pm

Introduction

Below I’ve collected some information for what I think is a serious Elastic SIEM setup. I would like to share the train of thought behind this design and would appreciate any feedback. Namely in terms of scalability and hardware requirements and utilization. I’m also in particular interested how others distribute the network load over multiple systems. I’m looking at the moment at this appliance.

Related work

A packet broker solution is discussed in this LRZ presentation to distribute the load over multiple nodes.

Preliminary results

Current hardware can handle 20+ Gbps without any problems – even without bypass or shunting of „elephant flows“
Commodity hardware can achieve 40Gbps, with dedicated NIC and shunting, even more.
If you use the NIC only for Suricata, buy a cheaper NIC and invest the money into more cores (SEPTun recommendation) still true up to 40Gbps. At 100Gbps it will a be different story
100Gbps in a Single Server is still difficult – bypass / filter required
NUMA Node placement is important
PCIe Bottleneck at 100Gbps – again, bypass / filter required
2x 40Gbps on Dual CPU should be possible on commodity hardware

In this related work the following points are discussed and summarized below.

RSS

Researchers in the related work used an Intel 82599ES 10-Gigabit/x520/x540 for RSS symmetric hashing on the NIC.

It [RSS] does a general load balancing of network data over multiple cores or CPUs by using IP tuple to calculate a hash value. The problem from the IDS/IPS perspective is that it needs to see the traffic as the end client will to do its job correctly. The challenge with RSS is that it RSS has been made for another purpose for example scaling of large web/filesharing installations and thus not needing the same “flow” consistency as an IDS/IPS deployment. As explained clearly in the Suricata documentation:

“Receive Side Scaling is a technique used by network cards to distribute incoming
traffic over various queues on the NIC. This is meant to improve performance, but it is important to realize that it was designed for average traffic, not for the IDS packet capture scenario. RSS using a hash algorithm to distribute the incoming traffic over the various queues. This hash is normally not symmetrical. This means that when receiving both sides of flow, each side may end up in a different queue. Sadly, when deploying Suricata, this is the typical scenario when using span ports or taps. “

More information about RSS by Intel in this white paper.

CPU affinity

Make sure irqbalance is disabled. Pin interrupts to the CPUs on a NUMA node, same as the card. Use CPU affinity with Suricata (suricata.yaml) and make sure the af-packet worker threads are running on anything but NUMA 0 in case that one is used (see related work for more details).

In case of no RSS symmetric hashing available you can try xdp_cpumap Make sure
you are sitting on Linux kernel 4.15.+

use xdp-cpu-redirect: [xx-xx] (af_packet section of suriata.yaml)
use cluster-type: cluster_cpu (af_packet section of suriata.yaml)
make sure you pin all interrupts of the NIC to one CPU (make sure it is on the
same NUMA as on the NIC)
have CPU affinity enabled and used on the cores of the appropriate NUMA node
wise for the NIC location cores.

XDP

XDP or eXpress Data Path provides a high performance, programmable network data path in the Linux kernel as part of the IO Visor Project. XDP provides bare metal packet processing at the lowest point in the software stack which makes it ideal for speed without compromising programmability. Furthermore, new functions can be implemented dynamically with the integrated fast path without kernel modification. Source.

XDP provides a serious performance boost for native Linux drivers by introducing the XDP bypass functionality to Suricata which in turn allows for dealing with elephant flows much earlier in the critical packet path. Thus offloading significant work from Suricata and the kernel regarding the capability to bypass 15 flows before Suricata process them - minimizing the perf intensive work needed to be done.

Intel NIC interrupt bugs

One of the most critical bugs as of the moment of writing this article is the IRQs reset one (on some kernel version/ Intel NIC combo). It seems right after Suricata starts and the eBPF script code gets injected all interrupts are pinned to CPU0 (and we need them to be spread - not like that.)

To counter the problem above (and the case of no symmetric RSS availability) in a more permanent basis - the xdp_cpumap concept code was ported and introduced into Suricata by Eric Leblond in January 2018. The xdp_cpumap is a recent Linux kernel xdp feature that (among other things) enables redirection XDP frames to remote CPUs (page 13) without breaking the flow.

Filtering (shunting)

No need to waste CPU resources on Netflix streams or encrypted traffic, this may be filtered out. XDP allows us to drop packets at various stages:

On the NIC directly (requires compatible hardware, best option)
At driver level (requires XDP compatible driver)
At kernel level (fallback, slowest option)

Hardware requirements

Some more details concerning tuning of NICs and CPU resources: 11.5. High Performance Configuration — Suricata 8.0.0-dev documentation
Elastic hardware requirements: Hardware prerequisites | Elastic Cloud Enterprise Reference [3.6] | Elastic

Buildup phases

To slowly get a feeling of the hardware and needed performance, it would be great to gradually expose the cluster to network and log loads. With that strategy we can also add hardware in phases and see how well the horizontal scaling is going.

Phase 1

Packet broker
1x Suricata node with 2x 25Gbit/s NICs, where each CPU is pinned to a NIC
1x Logstash node
1x Elastic node (which also hosts the Kibana web interface)

Phase 2… 3… 4…

Add more Suricata nodes when the bandwidth reaches the maximum (benchmark if 25Gbit/s is even possible per NIC)
Once the single Logstash node starts to have performance issues, add another Logstash node, which uses a Kafka node in between to store and forward messages to the primary Logstash node, which then stores data into Elastic
Once performance starts to throttle in Elastic, add another node
Have a dedicated Kibana node, which also acts as a non-data quorum node for the Elastic nodes

Architectural drawing

UPPERCASE · November 2, 2021, 7:10pm

Reaction from Reddit:

I am curious where you end up with the number of log stash and elastic search boxes. In my experience we ended up needing a lot more log stash systems than I was expecting. In my head it made more sense that we would need more Elasticsearch systems. This also changes how crazy your log stash rules are. What are you expecting to be your final system count? If I had to Kentucky windage it… 3x Suricata systems, 4 or 5 log stash and maybe 3 elastic search of a decently powerful modern system.
Some thoughts:

Kafka is great, but I haven’t found a spectacular way to get it working with Suricata. Eve supports Redis by default and works really well if you don’t need some of the more advanced things that Kafka gives you. Only way I know how to get events from Suricata to Kafka is first write them to disk (or maybe a RAM disk?) and read it with log stash or something else. Wastes resources.
Why a SPAN port? Can you stick the packet broker inline? You will end up losing less packets. SPAN ports aren’t ideal for packet loss.

Keep us informed of the process!

cthomas · November 3, 2021, 2:26pm

Welcome to the forum.

Regarding the intel cards. I have used the Intel OOT driver in the past. Was easy enough to build and distribute.

As for Elastic - what size are your shards? How do you have the data split up per index? What’s the total volume? I remember barely keeping 1PB for a month.

UPPERCASE · November 3, 2021, 2:55pm

Thanks for your reply. I’ll probably go for Mellanox cards, since those are also quite well supported and we have more experience with that.

The setup discussed mention in the top post is still in development. I’m still trying to figure out what would work with what kind of system resources and use cases. My goal with this post was to exchange design ideas. If someone has also has a cluster that needs to process about 100Gbit/s, then it would be really cool to compare my theoretical ideas with that. So that I can already make some changes or confirm existing ideas.

UPPERCASE · March 23, 2022, 9:39am

Anyone else? If there is some more feedback then I’ll include it in my final design. Thanks!