Dataset : Mutex vs SpinLock vs RwLocks

Hi,

We are using Suricata dataset intensively.
On our test system, we have 24 core performing dataset lookup operation in dpdk run-mode, and 1 writer performing update.

We have noted that, even when the writer is disabled, half of the datasetlookup time is spent on the pthread_mutex_lock/pthread_mutex_unlock.

I plan to test using spinlocks (#define HRLOCK_SPIN), but feels like switching to use rwlocks would be the most beneficial.

What would be your suggestion?
Thanks!

Hi,

could you give us a bit more details? We’re also looking into some cases where datasets might have an impact on performance like that due to locks.

What version?

How does the suricata.yaml look like?

What type of datasets are you using exactly?

Do you have the option to do some tests?
One simple test would be to just run 1 dataset and observe it and afterwards add a second one in a new run. I’ve seen scenarios where the performance hit was already seen with 2 datasets.

Could you also run perf top -p $(pidof suricata) and share the output in those cases?

Our built is based off Suricata 7.0.3,

Yaml summary

dpdk ips
only http/dns/tls protocol enabled

flow:
memcap: 8192mb
hash-size: 65536
prealloc: 65536
emergency-recovery: 30

stream:
memcap: 32gb
checksum-validation: yes
inline: auto
midstream: true
async-oneside: true
reassembly:
memcap: 32gb
toserver-chunk-size: 2560
toclient-chunk-size: 2560
randomize-chunk-size: yes
threading:
set-cpu-affinity: yes
cpu-affinity:
- management-cpu-set:
cpu: [ 1 ]
- receive-cpu-set:
cpu: [ 3 ]
- worker-cpu-set:
cpu: [0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46]
mode: “exclusive”
prio:
low: [ 0 ]
medium: [ “1-2” ]
high: [ 3 ]
default: “medium”
detect-thread-ratio: 1.0

For our current test,
We have 10 ipv4 datasets and 5 ipv6 datasets.
Totalling around ~400 000 entries, spread among these.
(and we need other string dataset, but they are disabled for the moment)

Yeah we have the option to perform test of different configs.
I have perf output already, see the attached flamegraph.

I guess my only suggestion is to try both and see how it works. If you do, please report your results here.

Could you do 3 different runs and compare the perf top results?

  1. Without any IP Dataset
  2. WIth just 1 IP Dataset
  3. With just 2 IP Datasets

I want to see if your results compare to mine where I saw a big increase in the overhead once the second IP dataset was active.

Hi,
Here are some benchmark, my original question was mainly focused on the impact of using locks on the datasets query, Targeted testing were done on the dataset only within in a specialized unit test.

Mutex vs Spinlock: your milleage may vary depending on the access patterns, number of threads, but here are some stats for runs done on a one numa node (24 cores) of an old xeon processor.
It shows the number of query done per thread per second. Overall adding more threads ends up having more work done, but there is an impact on indivual thread performance as the number of thread increase. The numbers are very good for both Mutex and Spinlock, nothing justifiying change the code to spin locks.

Query against 1 dataset contains 128 000k ipv4
Legend
X: number of threads
Y: Dataset Query per second per thread.
M_: mutex, S_: spinlock
#% : number of query that return true
20Write/sec: That test had 1 thread writing 20 updates per seconds to the dataset while other threads were querying.

Conclusion is that I won’t worry about the mutex lock for the moment. Performance overall are great and does not requires us to spend dev time on optimizing the datasets.

Here is another one, just using Mutex
with multiple combination of number of Threads, number of datasets, number of datasets queried on each loop.
This test focus only on the 0% dataset query return true.

Query against datasets containing 128 000k ipv4 each
Legend
X: Number of threads
Y: Number of query per second per thread.
#_DS: number of datasets
0%: percentage dataset query returning true
1_rr: search one of the dataset, alternate dataset between each loop
#_ss: searches # datasets one after the other on each loop
ex: 4_DS_0%_4_ss: 4 datasets configured, search one after the other on each loop.
ex: 2_DS_0%_1_rr: 2 Dataset configured, one dataset searched in alternance on each loop

I should be able to provide more numbers on a full run of suricata configured with multiple datasets with real traffic packets at some point in the near future.
Again, number are very good, I don’t have an explanation for the big drop, but It might be related to the cache size, number of element in the dataset. Even after the drop numbers are good.

Thanks for the detailed analysis.

On the cutoff after 12 threads: is the test cpu perhaps 12 core / 24 ht?

It’s a 48 cores on 2 sockets/numa
We are using numa 1 for this tests, so we bind the process to core 1,3,5,7,9,11…,23
But I will double check the tests…

Also the most interesting discovery those test provided us, it is not shown here, is that
on cpu intensive benchmark like those, a docker container will run 47% slower then native.
If we disable seccomp (syscall filtering for security), using host network and privileged mode, things get better and it runs at 32% slower then native. We are still looking for way to get that 32 % back, but will run natively until then.