No performance improvements with Hyperscan

zappasodi · October 25, 2023, 2:55pm

Hello,
I have just introduced the hyperscan support in my Suricata 7.0.0, but I don’t see any improvements. I tested it with few rules (i.e. “rules_loaded”:115) or more rules (“rules_loaded”: 35203) and I don’t see any change on performance compared to the same Suricata 7.0.0 without hyperscan support.

suricata --build-info | grep Hyper
Hyperscan support: yes

In suricata.yaml there are:
mpm-algo: hs
spm-algo: hs

The CPU is Intel Atom C3338 x86_64

Is it possible to verify if hyperscan is properly installed and properly used by Suricata?

Andreas_Herz · October 25, 2023, 4:01pm

If build-info shows hyperscan it looks correct, you could run Suricata with -vvvv and check the log output as wlel if it mentions hyperscan.

How did you actually measure the performance impact and are you certain that it was not enabled before, since Hyperscan is enabled by default if available?

In additon to that Intel Atom C3338 is a 2 Core CPU without HT which is not a very fast one. So could be just the CPU being the bottleneck.

zappasodi · October 26, 2023, 7:49am

Hi Andreas,
thanks for your reply.

In the log is not mentioned the word “hyperscan”, but I can see Debug messages from “mpm-hs”, I don’t know if this enough to say Suricata is working with hyperscan.

I have tested 2 different Suricata on the same board, the first one compiled without hyperscan support:
Hyperscan support: no

and I did the test with this Suricata before the hyperscan installation on the board.

I tested several UDP traffic profiles (fixed frame 1528, fixed frame 85, iMix 420, …) and in all cases the maximum throughput decreases when I load more rules (decreases drastically with 35203 loaded rules).

With more rules loaded (from 115 rules on) I can see the {W#01} and {W#02} threads taking the biggest part of CPU resource.
I know this behaviour is normal, but what looks strange to me is that nothing changes with or without hyperscan (neither better, neither worst).

Yes, of course I don’t expect great performance on this kind of board, but all things considered, I supposed that the bottleneck is the rules handling so I also supposed that introducing the hyperscan should help.

Additional info:

Suricata is configured in IPS mode with NFQ.

I have upgraded Suricata from 7.0.0 to 7.0.2 yesterday and I’m observing better performances (the maximum throughput increases around 25%), but there are still no differences with or without Hyperscan.

Andreas_Herz · October 26, 2023, 7:58am

How does your suricata.yaml look like and how do you start it?
Especially given the amount of queues for the NFQ setup.

There is also a limit what the pure switch to Hyperscan can achieve, the bottleneck could be something else since adding more and more rules is always increasing the pressure on the engine.
It also depends on how the signatures look like, some benefit more and some benefit less from Hyperscan.

How much throughput are we talking about?

zappasodi · October 26, 2023, 3:35pm

suricataIPS_HS.yaml (4.2 KB)

Hello Andreas,

I took some times to answer because I wanted to repeat the test with Suricata 7.0.2, since this release is going much better compared to the 7.0.0, at least in my test bed.

In the attached file suricataIPS_HS.yaml you can see the suricata.yaml.

This is the command line:


suricata --pidfile /tmp/suricatatest/suricata.pid -c /tmp/suricatatest/MysuricataminIPSXV_HS.yaml -s /tmp/suricatatest/suricatadir/<rule file>.rules -q 0 -D

The Maximum throughput in the best (and unreal) condition (Fixed Frame length 1518)

35236 rules: ~160 Mbps
Rules downloaded from Open/ET source

I can achieve the same value without hyperscan support (setting mpm-algo=ac-ks; spm-algo=bm).

I can have something better with hyperscan only if I set the worker threads affinity on different CPUs (W#01 - CPU 0 ; W#02 - CPU 2), but I should verify better:

hs + hs: ~200 Mbps

ac-ks + bm: ~180 Mbps

Note:

I consider the maximum throughput when there are the first dropped packets, but the above values are theoretical (the cpu idle is 0%, packets flows without threat, only big packets 1512, stats and log disabled), in a real network scenario the maximum throughput would be much less.
I don’t have better values if I use more queues.
I can provide also the maximum throughput with less rules:
2 rules: ~ 780 Mbps suricata (ftp-events.rules)
115 rules: ~ 400 Mbps (decoder-events.rules)

but these 2 values are obtained in different condition: in order to achieve more than ~ 400 Mbps I need to enable dpdk (my Suricata instance doesn’t have the dpdk support) and I need to reserve 1 CPU to it, so there is just 1 core for Suricata.

samiux · October 26, 2023, 4:05pm

Your CPU is not good enough to run Suricata in my opinion.

My CPU is ARM (4xA76 + 4xA55) and the throughput is about 941Mbps with Hyperscan while my another CPU is ARM (4xA55) and the throughput is about 500Mbps with Hyperscan.

More powerful CPU with more cores is better. Just for your reference.

Andreas_Herz · October 26, 2023, 4:14pm

You don’t have to go with DPDK right away.

Running in worker mode should already provide better performance, as your tests showed.

Instead of NFQUEUE, unless you need it, you could also try the AF_PACKET IPS mode.

Using CPU affinity and maybe at least 2 queues instead of 1 could also improve performance.

Please also provide the stats.logs from your run and attach htop output as well as perf top -p $(pidof suricata) to see what the most overhead is.

But given this CPU I think around 200Mbit/s with the full ruleset in IPS is not totally off what I would expect.

zappasodi · October 27, 2023, 3:55pm

Move to AF_PACKET it’s complicated in the particular router configuration I’m testing now.

htop and perf are not available on the board. I can provide the top output (top -b -H), you can see 2 different Suricata configuration examples with 35236 loaded rules in the attached files.

Suricata with Hyperscan + 1 queue + worker threads pinned on 2 different CPUs (at moment this is the best configuration):

topSuricataHS.log = top output

statsHS.log: stats.log

Suricata with Hyperscan + runmode=workers + 2 queues + worker threads pinned on 2 different CPUs:

topSuricataHS_workers.log = top outptut

statsHS_workers.log = stats.log

topSuricataHS_workers.log (6.8 KB)
topSuricataHS.log (8.1 KB)
statsHS_workers.log (124.1 KB)
statsHS.log (125.8 KB)

yes, I don’t expect much more as well (and I could agree with Samiux: this CPU is not the best option to host Suricata, at least as IPS with the full ruleset), but since I can obtain almost the same throughput without Hyperscan, I supposed I should get higher values with Hyperscan.

Could it depends on the lack of AVX instructions in Intel Atom C3338?

Andreas_Herz · October 27, 2023, 4:36pm

How does the nftables config look like regarding the queues?

The pure top output is not enough, it won’t show the exact details where performance could be lost. I would recommend installing the perf tools to get a better picture.

It could be that some optimizations might not apply with the CPU but in general hyperscan should always improve the performance.

Thus I still suspect another bottleneck which we should see once we have more debugging infos (with perf).

samiux · October 28, 2023, 10:38pm

Hyperscan requires at least SSSE3 extensions.

zappasodi · November 6, 2023, 4:58pm

I collected the data gather with perf.

You can find the output from “perf top -p $(pidof suricata)” command in the file perf_output.log.

perf_output.log contains data related to 2 different configurations:

1 queue

suricata … -q 0

iptables -I FORWARD -p udp -j NFQUEUE

2 queues

suricata … -q 3 -q 4

iptables -A FORWARD -p udp -j NFQUEUE -m statistic --mode nth --every 2 --packet 0 --queue-num 3

iptables -A FORWARD -p udp -j NFQUEUE -m statistic --mode nth --every 2 --packet 1 --queue-num 4

iptables -A FORWARD -p udp -j NFQUEUE --queue-num 4

I have iptables, not nftables.

For both scenarios you can find the perf output related to a normal traffic and when there are dropped packets.

As you can see, when there are dropped packets the nfqnl_recv_verdict function is on top.

Let me know if I can provide more data.
perf_output.log (17.9 KB)