Suricata may have issues with the matching of .* in some PCRE patterns

.* is a regular expression pattern used to match any character (except for newline characters) zero or more times. During some incidental testing, I discovered that Suricata seems to have problems matching certain whitespace characters (such as spaces, \r , \t , etc.) in some packets when using .* . A specific example is as follows:
In the official Snort2 rule set provided by Snort, there is a rule as follows. A simple packet that can trigger this rule is shown below. Please note this part of the PCRE in the rule: (?P<id2>.+?)(?P=m2)(\s|>).* . This means matching m2 , followed by a whitespace character or > , then matching zero or more arbitrary characters before proceeding with further matching. However, during testing, it was found that if the matching of arbitrary characters involves whitespace characters other than newlines, only up to nine such characters can be matched, while more characters fail to be matched. Relevant packet examples, PCAP files, and rule files are provided below:
consider this snort2 official rule:

alert tcp $EXTERNAL_NET $HTTP_PORTS -> $HOME_NET any (msg:"BROWSER-PLUGINS IBM Access Support ActiveX clsid access"; flow:to_client,established; file_data; content:"74FFE28D-2378-11D5-990C-006094235084"; fast_pattern:only; pcre:"/(<object\s*[^>]*\s*id\s*=\s*(?P<m1>\x22|\x27|)(?P<id1>.+?)(?P=m1)(\s|>)[^>]*\s*classid\s*=\s*(?P<q1>\x22|\x27|)\s*clsid\s*\x3a\s*{?\s*74FFE28D-2378-11D5-990C-006094235084\s*}?\s*(?P=q1)(\s|>).*(?P=id1)\s*\.\s*(GetXMLValue)|<object\s*[^>]*\s*classid\s*=\s*(?P<q2>\x22|\x27|)\s*clsid\s*\x3a\s*{?\s*74FFE28D-2378-11D5-990C-006094235084\s*}?\s*(?P=q2)(\s|>)[^>]*\s*id\s*=\s*(?P<m2>\x22|\x27|)(?P<id2>.+?)(?P=m2)(\s|>).*(?P=id2)\.(GetXMLValue))/siO"; metadata:policy max-detect-ips drop, service http; reference:bugtraq,34228; reference:cve,2009-0215; classtype:attempted-user; sid:16746; rev:10;)

A simple http packet which could trigger the rule is like :

triggered packet

<object classid='clsid:17A54E7D-A9D4-11D8-9552-00E04CB09903' id=testid>testid.SceneURL

Now, if we modify the aforementioned packet by simply adding ten spaces after the > (corresponding to the .* in the regular expression), the packet that originally triggered the alert will no longer do so.

untriggered packet
<object classid='clsid:17A54E7D-A9D4-11D8-9552-00E04CB09903' id=testid>           testid.SceneURL

triggered.pcap (1.5 KB)
untriggered.pcap (1.3 KB)
If this issue is widespread, it could potentially allow bypassing a significant number of Suricata alerts simply by using spaces in these specific places.

Based on this issue, I have continued to expand and modify the packets, uncovering many peculiar phenomena. While I am unable to attribute these phenomena to specific causes in detail, I still believe they are related to Suricata’s PCRE matching mechanism.

  1. In the matching of PCRE packets, if an official rule’s PCRE can match multiple types of packets, the performance of the first type of packet matching is significantly better than that of the second.
    Taking this rule as an example, its PCRE is divided into two very similar parts. The first part is:/(<object\s*[^>]*\s*id\s*=\s*(?P<m1>\x22|\x27|)(?P<id1>.+?)(?P=m1)(\s|>)[^>]*\s*classid\s*=\s*(?P<q1>\x22|\x27|)\s*clsid\s*\x3a\s*{?\s*17A54E7D-A9D4-11D8-9552-00E04CB09903\s*}?\s*(?P=q1)(\s|>).*(?P=id1)\s*\.\s*(SceneURL), and the second part is: <object\s*[^>]*\s*classid\s*=\s*(?P<q2>\x22|\x27|)\s*clsid\s*\x3a\s*{?\s*17A54E7D-A9D4-11D8-9552-00E04CB09903\s*}?\s*(?P=q2)(\s|>)[^>]*\s*id\s*=\s*(?P<m2>\x22|\x27|)(?P<id2>.+?)(?P=m2)(\s|>).*(?P=id2)\.(SceneURL). The only difference between the two is that they match the reversed order of classid and id. However, the first part can trigger alerts even with multiple whitespace characters, while the second part, on my computer, can only match up to ten. For instance, Packet 1 below can trigger an alert, whereas Packet 2 cannot. The corresponding PCAP is shown below.
triggered packet (pattern 1 with 20 spaces)
<object id=testid classid='clsid:17A54E7D-A9D4-11D8-9552-00E04CB09903'>                    testid.SceneURL
untriggered packet (pattern 2 with only 10 spaces)
<object classid='clsid:17A54E7D-A9D4-11D8-9552-00E04CB09903' id=testid>          testid.SceneURL

But by observing the two parts of the pcre we can know that they are totally the same beside the only difference I mentioned below
triggered1.pcap (1.6 KB)
untriggered1.pcap (1.6 KB)

  1. Regarding optional PCRE symbols, such as (\s|>), it is unclear whether the > symbol affects the integrity of the packet. When using \s for matching and reproducing the aforementioned whitespace issue, it was found that fewer spaces are required to trigger the alert compared to using >. Continuing with my initial attempts under these circumstances, I discovered that as few as nine spaces could result in the issue of not triggering an alert. Although this is only a difference of one space, I believe this phenomenon arises when the regular expression matching confuses the content of (\s|>) and .* (both representing spaces).
    This could also imply that if two adjacent parts of a regular expression can match the same characters, there might be instances of missed matches in the regular expression. (This is merely a personal speculation.)
untriggered packet ( with 9 spaces && without >)
<object classid='clsid:17A54E7D-A9D4-11D8-9552-00E04CB09903' id=testid         testid.SceneURL

untriggered2.pcap (1.6 KB)
3. Excess characters after the regular expression match can interfere with detection efficiency. For example, if several non-whitespace characters are appended to the end of the aforementioned packet, only three spaces are needed to prevent Suricata from triggering an alert.

untriggered packet (only use 3 spaces and a long inrelevant tail)
<object classid='clsid:17A54E7D-A9D4-11D8-9552-00E04CB09903' id=testid>   testid.SceneURLaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

untriggered3.pcap (1.6 KB)

The testing environment I used is Suricata 8.0.0-dev with the default suricata.yaml configuration. The phenomena and simple reflections I mentioned above are merely some interesting observations discovered during the testing process. If these issues are caused by my configuration, I would like to ask how to adjust the configuration to resolve them. If not, I believe these phenomena might provide some insights for developers to further improve Suricata, which is why I feel it is necessary to share them here. If you would like to reproduce the issues mentioned above, you can use the following command:

sudo suricata -k none -c /etc/suricata/suricata.yaml -S <the_rule_below> -r triggered/untriggered.pcap -l logs

my configuration:
suricata.yaml (83.8 KB)

I would greatly appreciate any attention or feedback on this matter. Thank you very much for your time and consideration.