Commit graph

106 commits

Author SHA1 Message Date
Ali Mohammad Pur 27a38932da LibRegex: Account for extra explicit And/Or in class parser assertion
Fixes #23691.
2024-03-24 08:24:46 +01:00
Ali Mohammad Pur e265d81277 LibRegex: Correct And/Or and inversion interplay semantics
This commit also fixes an incorrect test case from very early on, our
behaviour now matches the ECMA262 spec in this case.

Fixes #21786.
2024-01-11 11:36:09 +01:00
Ali Mohammad Pur 267040dde7 LibRegex: Error out on Eof when parsing nonempty class range elements
Fixes #22507.
2023-12-31 15:36:42 +01:00
Ali Mohammad Pur 5e1499d104 Everywhere: Rename {Deprecated => Byte}String
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).

This commit is auto-generated:
  $ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
    Meta Ports Ladybird Tests Kernel)
  $ perl -pie 's/\bDeprecatedString\b/ByteString/g;
    s/deprecated_string/byte_string/g' $xs
  $ clang-format --style=file -i \
    $(git diff --name-only | grep \.cpp\|\.h)
  $ gn format $(git ls-files '*.gn' '*.gni')
2023-12-17 18:25:10 +03:30
Timothy Flynn e122039c99 LibRegex: Support non-ASCII case-insensitive character comparisons
Specifically, when the Unicode flag is set, use Unicode-aware case
folding to case-insensitively compare code points.
2023-11-08 12:54:26 -05:00
Timothy Flynn 3fbf33bd37 LibRegex: Change a couple function parameters to east-const
Automatically done by clang-format-17 (and clang-format-16 leaves these
alone afterwards).
2023-11-08 12:54:26 -05:00
Ali Mohammad Pur 4d71f4edc4 LibRegex: Don't add the Repeat instruction size to its jump target
This was causing the calculated jump target to become invalid, leading
to possibly invalid optimisations and (more likely) crashes.
Fixes #21047.
2023-09-15 18:07:23 +03:30
Ali Mohammad Pur 4d27257c45 LibRegex: Treat backwards jumps to IP 0 as normal backwards jumps too
This shows up in something like /\d+|x/, where the `+` ends up jumping
to the start of its own alternative.
2023-08-16 22:20:24 +03:30
Ali Mohammad Pur e689422564 LibRegex: Keep track of instruction positions for backwards tree jumps 2023-08-05 16:40:04 +02:00
Ali Mohammad Pur 4e69eb89e8 LibRegex: Generate a search tree when patterns would benefit from it
This takes the previous alternation optimisation and applies it to all
the alternation blocks instead of just the few instructions at the
start.
By generating a trie of instructions, all logically equivalent
instructions will be consolidated into a single node, allowing the
engine to avoid checking the same thing multiple times.
For instance, given the pattern /abc|ac|ab/, this optimisation would
generate the following tree:
    - a
    | - b
    | | - c
    | | | - <accept>
    | | - <accept>
    | - c
    | | - <accept>
which will attempt to match 'a' or 'b' only once, and would also limit
the number of backtrackings performed in case alternatives fails to
match.

This optimisation is currently gated behind a simple cost model that
estimates the number of instructions generated, which is pessimistic for
small patterns, though the change in performance in such patterns is not
particularly large.
2023-07-31 05:31:33 +02:00
Timothy Flynn 8b668da9d5 LibRegex: Bail parsing class set characters upon early EOF
Otherwise, we reach a skip() invocation at the end of this function,
which crashes due to EOF. Caught by test262.
2023-06-23 20:22:45 +02:00
Ali Mohammad Pur b1ca2e5e39 LibRegex: Do not treat repeats followed by fallthroughs as atomic 2023-06-14 06:41:17 +02:00
Ali Mohammad Pur eba466b8e7 LibRegex: Avoid calling GenericLexer::consume() past EOF
The consume(size_t) overload consumes "at most" as many bytes as
requested, but consume() consumes exactly one byte.
This commit makes sure to avoid consuming past EOF.

Fixes #18324.
Fixes #18325.
2023-04-14 12:33:54 +02:00
Ali Mohammad Pur 6fc9f5fa28 LibRegex: Make ^ and $ accept all LineTerminators instead of just '\n'
Also adds a couple tests.
2023-03-25 15:44:05 +01:00
Ali Mohammad Pur 7f530c0753 LibRegex: Bail out of atomic rewrite if a block doesn't contain compares
If a block jumps before performing a compare, we'd need to recursively
find the first of the jumped-to block. While this is doable, it's not
really worth spending the time as most such cases won't actually qualify
for atomic loop rewrite anyway.
Fixes an invalid rewrite when `.+` is followed by an alternation, e.g.
/.+(a|b|c)/.
2023-02-15 10:14:26 +01:00
Ali Mohammad Pur af441bb939 LibRegex: Consider the inverse=true case when finding pattern overlap
Previously we were only checking for overlap when the range wasn't in
inverse mode, which made us miss things like /[^x]x/; this patch makes
it so we don't miss that.
2023-02-15 10:14:26 +01:00
Ali Mohammad Pur 936a9fd759 LibRegex: Make '.' reject matching LF / LS / PS as per the ECMA262 spec
Previously we allowed it to match those, but the ECMA262 spec disallows
these (except in DotAll).
2023-02-15 10:14:26 +01:00
Ali Mohammad Pur 1e022295c4 Tests: Use .is_flag_set() instead of bitwise & in Regex flag tests
The default flag might not be zero, so don't assume masking off flags
will yield zero.
2023-02-15 10:14:26 +01:00
Linus Groh 6e7459322d AK: Remove StringBuilder::build() in favor of to_deprecated_string()
Having an alias function that only wraps another one is silly, and
keeping the more obvious name should flush out more uses of deprecated
strings.
No behavior change.
2023-01-27 20:38:49 +00:00
Timothy Flynn 1edb96376b AK+Everywhere: Make UTF-8 and UTF-32 to UTF-16 converters fallible
These could fail to allocate the underlying storage needed to store the
UTF-16 data. Propagate these errors.
2023-01-08 12:13:15 +01:00
Ben Wiederhake 3281050359 Everywhere: Remove "LibC/" includes, add lint-rule against it 2023-01-07 10:01:37 -07:00
Eli Youngs 87a961534f LibRegex: Prevent patterns from matching the empty string twice
Previously, if a pattern matched the empty string (e.g. ".*"), it would
match the string twice instead of once. Among other issues, this caused
a Regex replacement to duplicate its expected output, since it would
replace "both" empty matches.
2023-01-06 13:52:21 -07:00
Ben Wiederhake 8a331d4fa0 Everywhere: Move AK/Debug.h include to using files or remove 2023-01-02 20:27:20 -05:00
Ben Wiederhake b83cb09db1 Everywhere: Fix badly-formatted includes
In 7c5e30daaa, the focus was "only" on
Userland/Libraries/, whereas this commit cleans up the remaining
headers in the repo, and any new badly-formatted include.
2023-01-02 11:06:15 -05:00
Linus Groh 57dc179b1f Everywhere: Rename to_{string => deprecated_string}() where applicable
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.

One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
2022-12-06 08:54:33 +01:00
Linus Groh 6e19ab2bbc AK+Everywhere: Rename String to DeprecatedString
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
2022-12-06 08:54:33 +01:00
Linus Groh d26aabff04 Everywhere: Run clang-format 2022-12-03 23:52:23 +00:00
Ali Mohammad Pur 00326a63ed LibRegex: Don't treat ForkReplace* as new forks 2022-11-09 21:28:54 +01:00
Andrew Kaster 51ebf20200 Tests: Remove LibRegex benchmark test file that has become stale
This test file had #ifdef macros at the top that caused none of the
content to be compiled unless a developer manually wanted to run the
specific benchmarks within. As such, it has become stale. Remove it for
now, if someone wants to restore it in an always-runnable state, we can
restore the specific tests it's trying to benchmark.
2022-10-10 12:23:12 +02:00
Ali Mohammad Pur 660d2b53b1 LibRegex: Account for eof after \<x> when 'x' leads to legacy behaviour 2022-09-12 16:03:57 +04:30
Timothy Flynn fc8bf7ac3e LibUnicode+Userland: Migrate generated CLDR data to LibLocaleData
Currently, LibUnicodeData contains the generated UCD and CLDR data. Move
the UCD data to the main LibUnicode library, and rename LibUnicodeData
to LibLocaleData. This is another prepatory change to migrate to
LibLocale.
2022-09-05 14:37:16 -04:00
Timothy Flynn 48cb15283a LibRegex: Explicitly check if a character falls into a table-based range
Previously, for a regex such as /[a-sy-z]/i, we would incorrectly think
the character "u" fell into the range "a-s" because neither of the
conditions "u > s && U > s" or "u < a && U < a" would be true, resulting
in the lookup falling back to assuming the character is in the range.

Instead, first explicitly check if the character falls into the range,
rather than checking if it falls outside the range. If the explicit
checks fail, then we know the character is outside the range.
2022-08-29 16:34:47 -04:00
Ali Mohammad Pur 598dc74a76 LibRegex: Partially implement the ECMAScript unicodeSets proposal
This skips the new string unicode properties additions, along with \q{}.
2022-07-20 21:25:59 +01:00
sin-ack c8585b77d2 Everywhere: Replace single-char StringView op. arguments with chars
This prevents us from needing a sv suffix, and potentially reduces the
need to run generic code for a single character (as contains,
starts_with, ends_with etc. for a char will be just a length and
equality check).

No functional changes.
2022-07-12 23:11:35 +02:00
sin-ack 3f3f45580a Everywhere: Add sv suffix to strings relying on StringView(char const*)
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).

No functional changes.
2022-07-12 23:11:35 +02:00
Ali Mohammad Pur d348eaf305 LibRegex: Treat inverted Compare entries as disjunctions
[^XYZ] is not(X | Y | Z), we used to translate this to
not(X) | not(Y) | not(Z), this commit makes LibRegex interpret this
pattern as not(X) & not(Y) & not(Z).
2022-07-10 14:26:03 +02:00
Ali Mohammad Pur b85666b3d2 LibRegex: Fix lookup table-based range checks in Compare
The lowercase version of a range is not required to be a valid range,
instead of casefolding the range and making it invalid, check twice with
both cases of the input character (which are the same as the input if
not insensitive).
This time includes an actual test :^)
2022-07-09 01:00:44 +00:00
Ali Mohammad Pur 7d01ee63d6 LibRegex: Use proper CharRange constructor instead of bit_casting
Otherwise the range order would be inverted.
2022-07-05 07:19:13 +02:00
Ali Mohammad Pur 6e655b7f89 LibRegex: Fully interpret the Compare Op when looking for overlaps
We had a really naive and simplistic implementation, which lead to
various issues where the optimiser incorrectly rewrote the regex to use
atomic groups; this commit fixes that.
2022-07-04 23:09:53 +02:00
Ali Mohammad Pur 1409a48da6 LibRegex: Check inverse_matched after every op, not just at the end
Fixes #13755.

Co-Authored-By: Damien Firmenich <fir.damien@gmail.com>
2022-04-22 10:02:39 +02:00
Idan Horowitz 086969277e Everywhere: Run clang-format 2022-04-01 21:24:45 +01:00
Ali Mohammad Pur 97a333608e LibRegex: Make codegen+optimisation for alternatives much faster
Just a little thinking outside the box, and we can now parse and
optimise a million copies of "a|" chained together in just a second :^)
2022-02-20 11:53:59 +01:00
Ali Mohammad Pur 4be7239626 LibRegex: Make parse_disjunction() consume all disjunctions in one frame
This helps us not blow up when too many disjunctions are chained togther
in the regex we're parsing.
Fixes #12615.
2022-02-20 11:53:59 +01:00
Ali Mohammad Pur 627bbee055 LibRegex: Allow quantifiers after quantifiable assertions
While quantifying assertions is very much meaningless, the specification
allows them with annex B's extended grammar for browsers, so read and
apply the quantifiers.
Fixes #12373.
2022-02-20 11:53:59 +01:00
Ali Mohammad Pur 3b0943d24c LibRegex: Correct the alternative matching order when one is empty
Previously we were compiling `/a|/` into what effectively would be
`/|a`, which is clearly incorrect.
2022-02-14 11:30:50 +01:00
Ali Mohammad Pur 6a4c8a66ae LibRegex: Only skip full instructions when optimizing alternations
It makes no sense to skip half of an instruction, so make sure to skip
only full instructions!
2022-02-09 21:02:24 +00:00
Timothy Flynn 2212aa2388 LibRegex: Support non-ASCII whitespace characters when matching \s or \S
ECMA-262 defines \s as:

    Return the CharSet containing all characters corresponding to a code
    point on the right-hand side of the WhiteSpace or LineTerminator
    productions.

The LineTerminator production is simply: U+000A, U+000D, U+2028, or
U+2029. Unfortunately there isn't a Unicode property that covers just
those code points.

The WhiteSpace production is: U+0009, U+000B, U+000C, U+FEFF, or any
code point with the Space_Separator general category.

If the Unicode generators are disabled, this will fall back to ASCII
space code points.
2022-02-05 22:30:10 +03:30
Ali Mohammad Pur a962ee020a LibJS+LibRegex: Don't repeat regex match in regexp_exec()
LibRegex already implements this loop in a more performant way, so all
LibJS has to do here is to return things in the right shape, and not
loop over the input string.
Previously this was a quadratic operation on string length, which lead
to crazy execution times on failing regexps - now it's nice and fast :^)

Note that a Regex test has to be updated to remove the stateful flag as
it repeats matching on multiple strings.
2022-02-05 00:09:32 +01:00
Ali Mohammad Pur 2b028f6faa LibRegex+LibJS: Avoid searching for more than one match in JS RegExps
All of JS's regular expression APIs only want a single match, so avoid
trying to produce more (which will be discarded anyway).
2022-02-05 00:09:32 +01:00
Ali Mohammad Pur 5fac41f733 LibRegex: Implement ECMA262 multiline matching without splitting lines
As ECMA262 regex allows `[^]` and literal newlines to match newlines in
the input string, we shouldn't split the input string into lines, rather
simply make boundaries and catchall patterns capable of checking for
these conditions specifically.
2022-01-26 00:53:09 +03:30