Discussion for DPDK API support into Suricata

Hi All,

Following is an initial attempt for discussion on how to go ahead of integrating or using DPDK API in Suricata.

note: I am new to Suricata code flow, hence requesting for feedback and correction in choosing the steps.

Motivation:
Starting with suricata3.0, initial motivation was to use DPDK rx_burst and tx_burst to allow line-rate capture and measure the threshold limit for single worker thread with limited zero-copy. The goal was to identify

  1. Identify the zero packet drop scenario for varying packet size.
  2. Run multiple instances in VM/dockers for scale up/down as demand.
  3. The maximum number of worker thread for 40Gbps processing.

The initial work started out on using Intel e1000, which later got ported to tap, ixgbe, i40e and vhost. Based on rule/signature addition we extend the filtering to RULE matched packet. Thus allowing packets to be forwarded to copy-interface and worker thread process for matched rules. This allowed on high data rate scenario where there are no rules, packets simply gets into BYPASS mode and statistics updated.

Note: Current sample can be found with https://github.com/vipinpv85/DPDK-Suricata_3.0/. Not all scenario is tested or validated too.

With release of suricata 4.1.1, goals were set for

  1. Full worker mode for multiple threads.
  2. Packet reassembly for ipv4/ipv6 fragments.
  3. Static HW-RSS with worker pinning.
  4. Deterministic flow to worker pining.
  5. Flatten MBUF for full zero-copy.

The ongoing work can be found in https://github.com/vipinpv85/DPDK_SURICATA-4_1_1

My current goals are to

  1. add seamless interface for integrating DPDK.
  2. Allow use of DPDK-19.11.1 LTS.
  3. Use HW offloads, eBPF based Filtering/Clone action from DPDK HW/SW.

In the new model, would like some feedback for the same

  1. Merged mode - DPDK threads and Suricata Threads runs under same process.
  2. Split mode - DPDK threads are process P1 and Suricata Threads are process P2.

Advantages of Split mode

  • No HW or vendor specific code.
  • Suricata base line will have minimalist and generic DPDK API.
  • Easy to implement packet clone feature either from HW/SW DPDK API.
  • Allow Suricata to update ACL entries for new rule addition.

To achieve the same we can use configuration or YAML entries specific for DPDK interfacing. Initial pull request is shared as https://github.com/OISF/suricata/pull/4902

Hi Vipin, thanks for your efforts!

Some initial thoughts:

  • please work with our git master, as we’ll not consider such a large feature for our 4.1 or 5.0 stable branches.
  • Do you think it is possible to have a reasonably generic way of supporting DPDK? I’ve seem multiple attempts and all of them were for quite specific scenario’s.
  • support should probably be added in smaller steps. First a basic packet source with runmode. Then bigger changes to other parts of Suricata.

Hi Victor, I am only focusing to add to current development branch and not back porting to 4.1 or 5.0 stable.

The split mode can help us to achieve the same. This would be the minimalistic DPDK code and library dependency.

Summary: description on DPDK build, library and run mode integration into dev branch of Suricata

Introduction:

  1. DPDK is set of Hardware and Software library, that helps to run on userspace.
  2. DPDK process is classified as Primary & Secondary, where huge pages and devices are shared between them.
  3. For an existing application to be support DPDK; both build and code is to be changed.
  4. One can skip the DPDK lcore-threads and service-threads too. But has to invoke rte_eal_init and relevant library call.

image.png

What mode to use: (to be decided - which one to support and start)

  1. Primary (singleton/monolithic):
    Pros:
    a) Suricata will run as Primary managing all DPDK PMD and Libraries.
    b) Requires access to hugepages and root permission.
    c) Does not need ASLR to be disabled.
    d) can run in baremetal CPU, VM, Docker too.
    e) can make use DPDK secondary apps like proc-info, pdump, any other custom secondary application.
    Cons:
    a) pausible to run as non root, but requires DPDK familirity.
    b) code becomes bulky.
    c) HW vendor or device offload, code needs to updated with generic API or SW fallback.

  2. Seocndary:
    Pros:
    a) Suricata will run as Secondary with zero or a little managment and setup code for PMD and Libraries.
    b) Requires access to hugepages and root permission.
    c) ASLR needs to be disabled, for consistent or hiigher chance of start.
    d) can run in baremetal CPU, VM, Docker too.
    e) Code becomes lighter.
    Cons:
    a) plausible to run as non-root, but requires DPDK familiarity.
    b) cannot make use of DPDK secondary apps like proc-info, pdump, any other custom secondary application.
    c) Need to probe the configuration settings for HW vendor or device offload.

  3. Detached Primary:
    Pros:
    a) Suricata will run as Primary, getting packets from another DPDK primary via memif/vhost/AF_XDP interface.
    b) Requires access to huge pages and root permission.
    c) can run in bare-metal CPU, VM, Docker too.
    d) Code becomes lighter because we are using SW generic NIC and offloads.
    e) all vendor-specific and non DPDK offloads can be run on the alternative process.
    f) Useful in scenarios where selected packet mirror can be implemented in HW, or SW and fed to DPDK.
    g)
    Cons:
    a) plausible to run as non-root, but requires DPDK familiarity.
    b) secondary apps like proc-info, pdump, any other custom secondary application works.
    c) can make use of XDP (eBPF) to redirect selected traffic too.

How to do:

  1. There are ABI and API changes across DPDK releases.
  2. Use a long term stable release as de-facto for DPDK. example 19.11.1 LTS.
  3. Depending upon individual or distro releases not all NIC, HW or features are enabled.
  4. Identify and choose the most common NIC like memif/Pcap/tap/vhost for ease of build.
  5. Update [configure.ac](http://configure.ac/) to reflect
    a) $RTE_SDK & $RTE_TARGET for custom or distro dpdk package.
    b) edit for new field --enable-dpdk as flag
    c) add necessary changes for CFLAGS and LDFLAGS if flag is enabled.
  6. Add Compiler flag HAVE_DPDK to build for DPDK mode.
  7. Start for single and multi worker mode.
  8. Code changes in
    a) suricata.c: for DPDK initialization, run-mode registration, parse of suricata.yaml for DPDK sections and add-hook to Rules Add for DPDK ACL.
    b) source-dpdk, run mode-dpdk: new files to support DPDK configuration and worker threads.

Proof of concept with single worker mode: https://github.com/vipinpv85/DPDK-Suricata_3.0
ongoing work with multi-worker: https://github.com/vipinpv85/DPDK_SURICATA-4_1_1

Performance: by redirecting rule match packets to worker threads

  1. 1 worker can perform around 2.5 Mpps
  2. 10Gbps with line rate (64byte) we need to 6 workers with each worker having 2.5Mpps flows
  3. 40Gbps with line rate (128byte) we need 16 worker threads with 2.5Mpps flows each.

Please try to upload the image directly here, so it’s correctly embedded. If it doesn’t work, let us know.

On what machine with what ruleset? It might be interesting to establish a solid comparions with same input and the different options how to run Suricata so we can see how much of a value DPDK could be.

I see that Ubuntu 20.04 has DPDK 19.11.1 packaged, so I think this would be a nice target.

Yes, hence trying to LTS of DPDK

Hi Andreas,

thanks for the suggestion.

1 Like

Machine: Intel® Xeon® CPU E5-2699 v4 @ 2.20GHz
NIC: Ethernet Controller X710 for 10GbE SFP+ ( 2 * 10G), driverversion=2.1.14-k, firmware=6.01
packet generator: DPDK pktgen for arp, icmp, tcp, udp
test scenario: 64B, 128B, 512B (line rate)
Rule set: single rule with alert/drop action (to stress single worker)

alert udp any any -> any any (ttl:123; prefilter; sid:1;)
drop udp any any -> any any 
drop tcp any any -> any any (msg:"Dir Command - Possible Remote Shell"; content:"dir"; sid:10000001;)

Note: goal is stress the environment with no match, or full match.

How do the merged/split modes relate to the primary/secondary/etc?

In general if we can keep the integration footprint small it would be great. But this also depends on the performance. If deeper, more intrusive, integration has clear performance and/or usability advantages, then we might still want to go that route.

I think maybe it would be a good idea to start with the simplest and least intrusive integration first and then evaluate when that is complete? Then in a phase 2 we could consider the more complex modes?

How do the merged/split modes relate to the primary/secondary/etc?
Answer> DPDK can either work in standalone mode (as Primary) or multi-process mode (as primary-secodnary).

If we got with Primary/Single process model, the suricata application needs to have both management code and data processing code in same binary. As mentioned in CONS this increases the code size and increases effort in maintaining the same.

To address this gap we have 2 options

  1. Make suricata as secondary process: this reduces code size as it only houses data processing code. A separate DPDK primary is responsible for configuration setup and addition for new NIC, libraries and features. As mentioned in earlier comment these come with its cons too.

  2. Make suricata as another seperate primary process: now we have 2 DPDK primary process, lets call them APP-1 and APP-2 (Suricata). APP-1 is responsible for HW NIC interface, configuration, Library setup and others. APP-1 connects to APP-2 (suricata) via memif/vhost interfaces only. This reduces the code for management and library reduce the build dependencies and distro release issue (DPDK pkg).

In general if we can keep the integration footprint small it would be great.
Answer> my recommendation is to use split mode with primary-1 & primary-2.

But this also depends on the performance. If deeper, more intrusive, integration has clear performance and/or usability advantages, then we might still want to go that route.
Answer> as mentioned in earlier current goal is to add the infrastructure required to fetch packets directly to user space via DPDK. Following are features targeted

Phase 1:

  1. Allow PMD poll to receive threads in DPDK run mode - Done
  2. Filter out non IP packets - done
  3. Use ACL to classify/mark packets which needs suricata processing - Done
  4. Use RSS flow to distribute to multiple workers - Done
  5. Allow zero copy - done
  6. avoid Packet_t alloc - done
  7. Run on general processor (x86 Xeon) and Smart NIC - done

Phase 2:

  1. allow autofp - possible
  2. Allow primary-secondary model - possible
  3. perf or vtune for decode/stream/output for possible dpdk acceleration - to do
  4. use Hyper-scan to better matching - not doing, as suricata already supports the same.
  5. Reassemble fragements before recieve thread - to do
  6. Use user space packet copy - to do
  7. Use DPDK eBPF for pkt-clone, tunnel parsing/decap/encap, mark - protoype is ready, need to add to suricata

I think maybe it would be a good idea to start with the simplest and least intrusive integration first
Answer> the standalone model with memif/vhost is best choice as it can be deployed on dockers/VM/baremetal alike. I will add a simplest
DPDK reference code too.

and then evaluate when that is complete? Then in a phase 2 we could consider the more complex modes?
Answer> Sure

Thanks. I think we can probably start the code contribution process around the ‘phase 1’ features?

2 further thoughts:

  • autofp support doesn’t seem very important.
  • timing: we are approaching a ‘freeze’ as we approach the suricata 6 release date. If you want to get the initial support into 6 there are just over 2 weeks left to get it done. No pressure :smiley:

sure, thanks for the update. Then let me focus on multi worker mode only.

updates:

  1. removed dependency for having ini file.

  2. added dpdk configuration part of suricata.yaml

  3. Focus on multi-worker mode.

  4. removed dependency and lock for separate RX and TX thread.

  5. started work to fetch the new baseline for suricata release to integrate the DPDK source|mode changes.

# DPDK configuration
dpdk:

  pre-acl: yes
  post-acl: yes
  tx-fragment: no
  rx-reassemble: no
  # BYPASS, IDS, IPS
  mode: IPS
  #mode: BYPASS
  # port index
  input-output-map: ["0-1", "1-0", "2-3", "3-2"]
  # EAL args
  eal-args: ["--log-level=eal,1", "-l 0", "--file-prefix=suricata-dpdk", "-m 2048"]
  # mempool
  mempool-port-common: [name=suricata-port,n=24000,elt_size=2000,private_data_size=0]
  mempool-reas-common: [name=suricatareassembly,n=8000,elt_size=10000,private_data_size=0]
  # port config
  port-config-0: [mempool=portpool,queues=4,rss-tuple=3,ebpf=NULL,jumbo=no,mtu=1500]
  port-config-1: [mempool=portpool,queues=4,rss-tuple=3,ebpf=NULL,jumbo=no,mtu=1500]
  port-config-2: [mempool=portpool,queues=4,rss-tuple=3,ebpf=NULL,jumbo=no,mtu=1500]
  port-config-3: [mempool=portpool,queues=4,rss-tuple=3,ebpf=NULL,jumbo=no,mtu=1500]
  # DPDK pre-acl
  ipv4-preacl: 1024
  ipv6-preacl: 1024

Thanks for the update!

When you’re ready to share the (draft) code, you can do a PR to the official repo. Select the ‘draft PR’-option if applicable.

starting work for DPDK packet acquisition layer for upcoming release https://github.com/OISF/suricata. Will update the PR within a week.

Thanks for Vipin Varghese’s work.
Machine: Intel® Xeon® CPU Gold 5115 @ 2.40GHz
NIC: Ethernet Controller X722 for 10GbE SFP+ ( 2 * 10G), driverversion=2.4.6, firmware=3.33
When I was testing suricata-dpdk, the packet loss of the network card continued to increase when the packet rate was greater than 1Gpps. Use perf to observe function hotspots , ‘common_ring_mp_enqueue’ ‘DpdkReleaseacket’ ‘ReceiveDpdkLoop’ functions take up a large proportion rather than 50%。Can you tell me what I can do to reduce the nic packet drop.

I am bit at a loss here, I have not shared any Pull Reuqest to the forum for testing. Hence I have to assume the code base you are referring is one of my earlier works in gitthub. If this is true I think you are using pretty old version of the code base because

  1. current code base in github does not use DPDK RING Enqueue Dequeue

  2. as per my internal testing with 28 Mpps, I do not find DpdkReleaseacket to be bottle neck.

  3. with out context ReceiveDpdkLoop is using >50% is bit misleading. If this is based on older ring implementation this might be true.