We are using Suricata dataset intensively.
On our test system, we have 24 core performing dataset lookup operation in dpdk run-mode, and 1 writer performing update.
We have noted that, even when the writer is disabled, half of the datasetlookup time is spent on the pthread_mutex_lock/pthread_mutex_unlock.
I plan to test using spinlocks (#define HRLOCK_SPIN), but feels like switching to use rwlocks would be the most beneficial.
could you give us a bit more details? Weâre also looking into some cases where datasets might have an impact on performance like that due to locks.
What version?
How does the suricata.yaml look like?
What type of datasets are you using exactly?
Do you have the option to do some tests?
One simple test would be to just run 1 dataset and observe it and afterwards add a second one in a new run. Iâve seen scenarios where the performance hit was already seen with 2 datasets.
Could you also run perf top -p $(pidof suricata) and share the output in those cases?
For our current test,
We have 10 ipv4 datasets and 5 ipv6 datasets.
Totalling around ~400 000 entries, spread among these.
(and we need other string dataset, but they are disabled for the moment)
Hi,
Here are some benchmark, my original question was mainly focused on the impact of using locks on the datasets query, Targeted testing were done on the dataset only within in a specialized unit test.
Mutex vs Spinlock: your milleage may vary depending on the access patterns, number of threads, but here are some stats for runs done on a one numa node (24 cores) of an old xeon processor.
It shows the number of query done per thread per second. Overall adding more threads ends up having more work done, but there is an impact on indivual thread performance as the number of thread increase. The numbers are very good for both Mutex and Spinlock, nothing justifiying change the code to spin locks.
Query against 1 dataset contains 128 000k ipv4
Legend
X: number of threads
Y: Dataset Query per second per thread.
M_: mutex, S_: spinlock
#% : number of query that return true
20Write/sec: That test had 1 thread writing 20 updates per seconds to the dataset while other threads were querying.
Conclusion is that I wonât worry about the mutex lock for the moment. Performance overall are great and does not requires us to spend dev time on optimizing the datasets.
Here is another one, just using Mutex
with multiple combination of number of Threads, number of datasets, number of datasets queried on each loop.
This test focus only on the 0% dataset query return true.
Query against datasets containing 128 000k ipv4 each
Legend
X: Number of threads
Y: Number of query per second per thread. #_DS: number of datasets
0%: percentage dataset query returning true
1_rr: search one of the dataset, alternate dataset between each loop #_ss: searches # datasets one after the other on each loop
ex: 4_DS_0%_4_ss: 4 datasets configured, search one after the other on each loop.
ex: 2_DS_0%_1_rr: 2 Dataset configured, one dataset searched in alternance on each loop
I should be able to provide more numbers on a full run of suricata configured with multiple datasets with real traffic packets at some point in the near future.
Again, number are very good, I donât have an explanation for the big drop, but It might be related to the cache size, number of element in the dataset. Even after the drop numbers are good.
Itâs a 48 cores on 2 sockets/numa
We are using numa 1 for this tests, so we bind the process to core 1,3,5,7,9,11âŚ,23
But I will double check the testsâŚ
Also the most interesting discovery those test provided us, it is not shown here, is that
on cpu intensive benchmark like those, a docker container will run 47% slower then native.
If we disable seccomp (syscall filtering for security), using host network and privileged mode, things get better and it runs at 32% slower then native. We are still looking for way to get that 32 % back, but will run natively until then.