Goal of this document is to collect as many potential offload/acceleration cases as possible, no matter how small or trivial. There are lots of different capabilities we can consider:
- (regular) NICs often offer some offloads (ethtool -k)
- SmartNICs come in many varieties:
- General compute
- Flow processors
- Flexible header parsing
- Packet brokers may be able to assist “on a distance”
- CPU features (e.g. Intel Quick Assist)
- GPU and other ‘co-processors’ (failed for us in the past, but who knows)
If you have anything to add please add comment below and I will update the document. We may turn this doc in a ‘wiki post’ later.
Correctness & Tuning
Assisting in correct and optimal deployment of Suricata wrt flow load balancing, NUMA awareness, throughput, etc.
Flow load balancing
Suricata’s threading expects packets from the same flow to be processed by the same thread (symmetric RSS). In practice this is harder with commodity hardware and drivers than it may sound.
- Status: supported using Napatech and some Intel
Suricata will currently not do anything specific on multi-NUMA-node hardware.
Capture methods can help steer traffic to static nodes and keep it there for optimal locality.
Note: NUMA in Suricata is actively being researched.
- Status: supported using Napatech.
Speed up Suricata by avoiding inspection/processing of parts of the traffic that is deemed uninteresting. Typical examples are video streams, a nightly backup run, encrypted portion traffic.
The well known BPF to filter what Suricata should and should not inspect. BPFs are used in lots of deployments to ignore certain protocols, hosts, ports, sections of the network or a combination of the above.
Flow based bypass (Flow Shunting)
The traffic that Suricata doesn’t care about is bypassed based on Suricata settings (stream depth, encrypted traffic setting) and/or rule matches (bypass keyword).
- Status: implemented
- Internally (flow engine)
- eBPF (linux kernel incl hw offload on Netronome)
- PF_RING & Napatech
- NF_QUEUE (with special ruleset)
- Gain: depends on traffic. In case of lots of uninteresting traffic there is a lot that can be bypassed.
- Best case: we bypass (almost) everything.
- Worst case: we bypass nothing.
- Limits: depends on rule language for expressing conditions, plus some hard coded logic. Can’t bypass on what we can’t express.
- Risks: reduced visibility
Packet broker bypass
This is a variant of “Flow Based Bypass”, except in this case there would be a back channel to the external packet broker.
- Status: not implemented
The idea of slicing is that for a part of the traffic Suricata would get only partial packets (packet headers). Suricata does not support this mode as it expects full packets.
- Ticket: none
- Status: not supported and not interested in adding support.
- Gain: Increase performance
- Risk: Loose visibility of later payload
Accelate Suricata by handling parts of the processing in another place than the Suricata process running on the Host CPU.
It is important to note that there are lossy and lossless offloads. Examples: the NIC pre-calculating checksums is lossless. IP-defrag w/o anomaly events in case of overlaps is lossy.
Use NIC csum offload to avoid recalculating the csums in Suricata. Suricata validates checksums in the stream engine by default. Otherwise only if rule keywords are used to match on good/bad csums.
- Ticket: TODO
- Status: not implemented. AF_PACKET doesn’t have this. None of our capture methods do. Suricata does internally support this using (PKT_IGNORE_CHECKSUM)
- Gain: avoid fairly expensive work when csum validation is enabled (default)
- Limits: NIC/driver may not validate every layer in case of encapsulation
Use a calculated hash for the flow from the capture method. Currently Suricata calculates a hash value based on the packet header. A NIC may already have a hash for RSS/load balancing purposes.
- Ticket: https://redmine.openinfosecfoundation.org/issues/1741
- Status: not supported.
- Gain: avoid fairly expensive hash calculations.
- Risk: mismatch between capture method flow and suricata flow.
- Risk: minimal gain due to cache miss during flow compare?
Use pre-parsed packets. Some capture methods are capable of sharing the results of packet decoding they have already done. This could be used by Suricata to bypass certain internal checks (like size checks) or to bypass packet decoding completely. Some ideas:
- Get Header offsets from Capture methods
- Avoid TCP options decoding (Suricata needs the values for various options though)
- Avoid IPv6 exthdr decoding
Suricata has to take into account many evasion possibilities, however most traffic is not using anything like that. Example: HTTP method can have leading spaces, but how often does this really happen? If offload can deal with the common case but have a fallback for anything anomalous, it could lead to gains.
- Packet Decoding
- Skip size checks
- Protocol Detection
- First packet of flow will normally contain the full pattern.
- App-Layer decoding
- DNS request will have a single query - normally.
- HTTP request line using single spaces, not tabs.
- HTTP end of line is \r\n normally
TCP Stream Normalization
Stream processing such as packet defragmentation, and stream re-assembly could be offloaded and present Suricata with a normalized TCP stream with no overlaps. If Suricata was aware of this it could fast-track the decoding of the stream.
- Issues: Handling evasions, or detecting evasion attempts is now on the offload processor, not Suricata.
- Limits: handling of encapsulation may not be supported
- Opportunity: this could also apply our new
AppLayerResult::incompletelogic, where the hw would queue TCP data for a stream until a threshold is reached.
Suricata by default will disable the GRO & GSO NIC offloads as it needs to be able to the original packet sizes (for the
dsize keyword). A possible optimization would be to re-enable it when no
dsize rules are in use.
- Status: not implemented
- Gain: expect better performance
- Limits: default ET ruleset makes heavy use of
handle IP defrag outside of Suricata
- Status: supported for AF_PACKET (default enabled).
- Issues: loss of visibility into evasion attempts
- Gain: no need for expensive bookkeeping for fragments and trackers
- Limits: frags within tunnels (e.g. VXLAN) might not be covered so we’d still need to process those.
Splitting handling of pure packet detection and stream+app-layer detection. The idea that a general compute capable SmartNIC would run a part of the packet processing that is not (very) stateful.
- Status: not supported
- Gain: improve performance by avoiding work on the CPU, reducing active code size
- Limits: for encapsulation we may still need full processing on the host
- Risks: adds significant complexity
Flow Table Handling
including flow table management offload
Instead of Suri having a relatively expensive flow manager (garbage collector) we could rely on flow messages (NEW, DESTROY, etc).
- Ticket: N/A
- Status: not supported
- Related: might be able to rely on conntrack (through libconntrack) for nfqueue/nflog setups.
Fast Capture Support
Suricata supports various capture methods. From the generic
libpcap based support to more advanced
AF_PACKET. More specialized methods like
windivert, etc. are also available.
Next there are some specific vendor APIs that are supported: Endace and Napatech.
While capture methods are not necessarily about offloads/acceleration, its still useful to track which parts we are missing.
- Status: community effort in progress
- Gain: efficient capture
- Limits: focus of DPDK seems to be more on empowering the CPU, less on (Smart)NIC offloads
An interesting DPDK development is explained here https://www.youtube.com/watch?v=S7WA-r3V9FI It would create accelaration/offload APIs for various parts of the Suricata processing pipeline, while being vendor neutral.
FD . io
TODO: needs explanation of what it is.
- Status: not supported