Offload and Acceleration ideas

vjulien · June 30, 2020, 10:41am

Goal of this document is to collect as many potential offload/acceleration cases as possible, no matter how small or trivial. There are lots of different capabilities we can consider:

(regular) NICs often offer some offloads (ethtool -k)
SmartNICs come in many varieties:
- FPGA
- General compute
- Flow processors
- Flexible header parsing
Packet brokers may be able to assist “on a distance”
CPU features (e.g. Intel Quick Assist)
GPU and other ‘co-processors’ (failed for us in the past, but who knows)

If you have anything to add please add comment below and I will update the document. We may turn this doc in a ‘wiki post’ later.

Correctness & Tuning

Assisting in correct and optimal deployment of Suricata wrt flow load balancing, NUMA awareness, throughput, etc.

Flow load balancing

Suricata’s threading expects packets from the same flow to be processed by the same thread (symmetric RSS). In practice this is harder with commodity hardware and drivers than it may sound.

Status: supported using Napatech and some Intel

NUMA

Suricata will currently not do anything specific on multi-NUMA-node hardware.

Capture methods can help steer traffic to static nodes and keep it there for optimal locality.

Note: NUMA in Suricata is actively being researched.

Status: supported using Napatech.

Ignoring/Bypassing traffic

Speed up Suricata by avoiding inspection/processing of parts of the traffic that is deemed uninteresting. Typical examples are video streams, a nightly backup run, encrypted portion traffic.

BPF

The well known BPF to filter what Suricata should and should not inspect. BPFs are used in lots of deployments to ignore certain protocols, hosts, ports, sections of the network or a combination of the above.

Flow based bypass (Flow Shunting)

The traffic that Suricata doesn’t care about is bypassed based on Suricata settings (stream depth, encrypted traffic setting) and/or rule matches (bypass keyword).

Status: implemented
- Internally (flow engine)
- eBPF (linux kernel incl hw offload on Netronome)
- PF_RING & Napatech
- NF_QUEUE (with special ruleset)
Gain: depends on traffic. In case of lots of uninteresting traffic there is a lot that can be bypassed.
- Best case: we bypass (almost) everything.
- Worst case: we bypass nothing.
Limits: depends on rule language for expressing conditions, plus some hard coded logic. Can’t bypass on what we can’t express.
Risks: reduced visibility

Packet broker bypass

This is a variant of “Flow Based Bypass”, except in this case there would be a back channel to the external packet broker.

Status: not implemented

Flow Slicing

The idea of slicing is that for a part of the traffic Suricata would get only partial packets (packet headers). Suricata does not support this mode as it expects full packets.

Ticket: none
Status: not supported and not interested in adding support.
Gain: Increase performance
Risk: Loose visibility of later payload

Offloading

Accelate Suricata by handling parts of the processing in another place than the Suricata process running on the Host CPU.

It is important to note that there are lossy and lossless offloads. Examples: the NIC pre-calculating checksums is lossless. IP-defrag w/o anomaly events in case of overlaps is lossy.

RX CSUM

Use NIC csum offload to avoid recalculating the csums in Suricata. Suricata validates checksums in the stream engine by default. Otherwise only if rule keywords are used to match on good/bad csums.

Ticket: TODO
Status: not implemented. AF_PACKET doesn’t have this. None of our capture methods do. Suricata does internally support this using (PKT_IGNORE_CHECKSUM)
Gain: avoid fairly expensive work when csum validation is enabled (default)
Limits: NIC/driver may not validate every layer in case of encapsulation

Flow Hash

Use a calculated hash for the flow from the capture method. Currently Suricata calculates a hash value based on the packet header. A NIC may already have a hash for RSS/load balancing purposes.

Ticket: https://redmine.openinfosecfoundation.org/issues/1741
Status: not supported.
Gain: avoid fairly expensive hash calculations.
Risk: mismatch between capture method flow and suricata flow.
Risk: minimal gain due to cache miss during flow compare?

Packet Decoding

Use pre-parsed packets. Some capture methods are capable of sharing the results of packet decoding they have already done. This could be used by Suricata to bypass certain internal checks (like size checks) or to bypass packet decoding completely. Some ideas:

Get Header offsets from Capture methods
Avoid TCP options decoding (Suricata needs the values for various options though)
Avoid IPv6 exthdr decoding

Fast tracks

Suricata has to take into account many evasion possibilities, however most traffic is not using anything like that. Example: HTTP method can have leading spaces, but how often does this really happen? If offload can deal with the common case but have a fallback for anything anomalous, it could lead to gains.

Packet Decoding
- Skip size checks
Protocol Detection
- First packet of flow will normally contain the full pattern.
App-Layer decoding
- DNS request will have a single query - normally.
- HTTP request line using single spaces, not tabs.
- HTTP end of line is \r\n normally

TCP Stream Normalization

Stream processing such as packet defragmentation, and stream re-assembly could be offloaded and present Suricata with a normalized TCP stream with no overlaps. If Suricata was aware of this it could fast-track the decoding of the stream.

Issues: Handling evasions, or detecting evasion attempts is now on the offload processor, not Suricata.
Limits: handling of encapsulation may not be supported
Opportunity: this could also apply our new AppLayerResult::incomplete logic, where the hw would queue TCP data for a stream until a threshold is reached.

GRO/GSO

Suricata by default will disable the GRO & GSO NIC offloads as it needs to be able to the original packet sizes (for the dsize keyword). A possible optimization would be to re-enable it when no dsize rules are in use.

Status: not implemented
Gain: expect better performance
Limits: default ET ruleset makes heavy use of dsize

IP defragmentation

handle IP defrag outside of Suricata

Status: supported for AF_PACKET (default enabled).
Issues: loss of visibility into evasion attempts
Gain: no need for expensive bookkeeping for fragments and trackers
Limits: frags within tunnels (e.g. VXLAN) might not be covered so we’d still need to process those.

Pipeline splitting

Splitting handling of pure packet detection and stream+app-layer detection. The idea that a general compute capable SmartNIC would run a part of the packet processing that is not (very) stateful.

Status: not supported
Gain: improve performance by avoiding work on the CPU, reducing active code size
Limits: for encapsulation we may still need full processing on the host
Risks: adds significant complexity

Flow Table Handling

including flow table management offload

Instead of Suri having a relatively expensive flow manager (garbage collector) we could rely on flow messages (NEW, DESTROY, etc).

Ticket: N/A
Status: not supported
Related: might be able to rely on conntrack (through libconntrack) for nfqueue/nflog setups.

Fast Capture Support

Suricata supports various capture methods. From the generic libpcap based support to more advanced AF_PACKET. More specialized methods like PF_RING, windivert, etc. are also available.

Next there are some specific vendor APIs that are supported: Endace and Napatech.

While capture methods are not necessarily about offloads/acceleration, its still useful to track which parts we are missing.

DPDK

Status: community effort in progress
Gain: efficient capture
Limits: focus of DPDK seems to be more on empowering the CPU, less on (Smart)NIC offloads

An interesting DPDK development is explained here https://www.youtube.com/watch?v=S7WA-r3V9FI It would create accelaration/offload APIs for various parts of the Suricata processing pipeline, while being vendor neutral.

FD . io

TODO: needs explanation of what it is.

Status: not supported

vjulien · July 6, 2020, 12:30pm

Added:

Last week: DPDK, FDio
Today: AppLayerResult::incomplete offload.

bdfreeman1421 · January 4, 2021, 3:27pm

I’m interested in sessionOffload to a SmartNIC or programmable switch - github.com/att/sessionOffload - is that an example of Packet Broker Bypass in Suricata ?

bdfreeman1421 · February 24, 2021, 4:30pm

I hacked together a simple hack/demonstration of using the gRPC API for bypass/offload by adding an AFPOPOFBypassCallback to source-af-packet.c (OpenOffload - OPOF) and various other plumbing files. I modeled it after the pattern in AFPBypassCallback and AFPXDPBypassCallback. Havent tested it with a real device but it looks like its doing the right thing when I send tcp and udp traffic to suricata , I can see the api calls to a simulated offload daemon.

vjulien · February 26, 2021, 10:41am

@bdfreeman1421 are you planning to submit a pull request?

bdfreeman1421 · March 1, 2021, 8:59pm

Its no where near production quality but I’d be happy to submit a pull request once I test it against a real device (I’m sure I’ll find things to fix) and add a few of the minimal things needed to make it more fully baked.

srini38 · July 17, 2021, 4:50am

Couple of ideas for DPDK, which has a good collection of libraries and API’s. I think a good start would be to selectively use DPDK API’s which provide the maximum benefit with least regression aspects

Generic Regex offload API as Victor suggested, so that SmartNIC/Hardware/Software offload can be transparently used
Use of hugepages where needed [Eg: for Hypescan etc]
Use of compression API’s so that SmartNIC/Hardware/Software offload could be used transparently
Fast packet acquisition, use DPDK netmap compatible layer (more work will be needed for multiqueue support).