This is just a straight (and fairly inefficient) implementation of IPv6
parsing and serialization from the URL spec.
Note that we don't use AK::IPv6Address here because the URL spec
requires a specific serialization behavior.
This now searches the memory in blocks, which should be slightly more
efficient. However, it doesn't make much difference (e.g. ~1% in LZMA
compression) in most real-world applications, as the non-hint function
is more expensive by orders of magnitude.
The "operation modes" of this function have very different focuses, and
trying to combine both in a way where we share the most amount of code
probably results in the worst performance.
Instead, split up the function into "existing distances" and "no
existing distances" so that we can optimize either case separately.
We will be adding extra logic to the CircularBuffer to optimize
searching, but this would negatively impact the performance of
CircularBuffer users that don't need that functionality.
I was debugging a different issue in Ladybird, and noticed that
completing relative file URLs with URL::complete_url didn't seem to work
right. This test case covers both the working https case, as well as the
file URL case fixed by the previous commit.
Change the name and return type of
`IPv6Address::to_deprecated_string()` to `IPv6Address::to_string()`
with return type `ErrorOr<String>`.
It will now propagate errors that occur when writing to the
StringBuilder.
There are two users of `to_deprecated_string()` that now use
`to_string()`:
1. `Formatted<IPv6Address>`: it now propagates errors.
2. `inet_ntop`: it now sets errno to ENOMEM and returns.
That's what this class really is; in fact that's what the first line of
the comment says it is.
This commit does not rename the main files, since those will contain
other time-related classes in a little bit.
Note that in some cases (in particular SQL::Result and PDFErrorOr),
there is no Formatter defined for the error type, hence TRY_OR_FAIL
cannot work as-is. Furthermore, this commit leaves untouched the places
where MUST could be replaced by TRY_OR_FAIL.
Inspired by:
https://github.com/SerenityOS/serenity/pull/18710#discussion_r1186892445
"The official project language is American English […]."
5d2e915623/CONTRIBUTING.md?plain=1#L30
Here's a short statistic of the occurrences of the word "behavio(u)r":
$ git grep -IPioh 'behaviou?r' | sort | uniq -c | sort -n
2 BEHAVIOR
24 Behaviour
32 behaviour
407 Behavior
992 behavior
Therefore, it is clear that "behaviour" (56 occurrences) should be
regarded a typo, and "behavior" (1401 occurrences) should be preferred.
Note that The occurrences in LibJS are intentionally NOT changed,
because there are taken verbatim from the specification. Hence:
$ git grep -IPioh 'behaviou?r' | sort | uniq -c | sort -n
2 BEHAVIOR
10 behaviour
24 Behaviour
407 Behavior
1014 behavior
GCC 13 was released on 2023-04-26. This commit fixes Lagom build errors
when using an updated host toolchain:
- Adds a workaround for a bug in constraint handling, which made LibJS
fail to compile: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109683
- Silences the new `-Wdangling-reference` diagnostic globally. It
produces multiple false positives with no clear way to silence them
without `#pragmas`.
- Silences `-Wself-move` in `RefPtr` tests as GCC 13 adds this
previously Clang-exclusive warning.
We now null out smart pointers *before* calling unref on the pointee.
This ensures that the same smart pointer can't be used to acquire a new
reference to the pointee after its destruction has begun.
I ran into this when destroying a non-empty IntrusiveList of RefPtrs,
but the problem was more general so this fixes it for all of RefPtr,
NonnullRefPtr, OwnPtr and NonnullOwnPtr.
This allows accessing and looping over the path segments in a URL
without necessarily allocating a new vector if you want them percent
decoded too (which path_segment_at_index() has an option for).
This now defaults to serializing the path with percent decoded segments
(which is what all callers expect), but has an option not to. This fixes
`file://` URLs with spaces in their paths.
The name has been changed to serialize_path() path to make it more clear
that this method will generate a new string each call (except for the
cannot_be_a_base_url() case). A few callers have then been updated to
avoid repeatedly calling this function.
Previously, if we copied the last byte for a length of 100, we'd
recalculate the read span 100 times and memmove one byte 100 times,
which resulted in a lot of overhead.
Now, if we know that we have two consecutive copies of the data, we just
extend the distance to cover both copies, which halves the number of
times that we recalculate the span and actually call memmove.
This takes the running time of the attached benchmark case from 150ms
down to 15ms.
As noted in serval comments doing this goes against the WC3 spec,
and breaks parsing then re-serializing URLs that contain percent
encoded data, that was not encoded using the same character set as
the serializer.
For example, previously if you had a URL like:
https:://foo.com/what%2F%2F (the path is what + '//' percent encoded)
Creating URL("https:://foo.com/what%2F%2F").serialize() would return:
https://foo.com/what//
Which is incorrect and not the same as the URL we passed. This is
because the re-serializing uses the PercentEncodeSet::Path which
does not include '/'.
Only doing the percent encoding in the setters fixes this, which
is required to navigate to Google Street View (which includes a
percent encoded URL in its URL).
Seems to fix#13477 too
`vformat()` can now accept format specifiers of the form
{:'[numeric-type]}. This will output a number with a comma separator
every 3 digits.
For example:
`dbgln("{:'d}", 9999999);` will output 9,999,999.
Binary, octal and hexadecimal numbers can also use this feature, for
example:
`dbgln("{:'x}", 0xffffffff);` will output ff,fff,fff.
This is very similar to the LittleEndianInputBitStream bit buffer change
from 8e834d4bb2.
We currently buffer one byte of data for the underlying stream. And when
we put bits onto that buffer, we do so 1 bit at a time.
This replaces the u8 buffer with a u64. And instead of looping at all,
we perform bitwise operations to write the desired number of bits.
Using the "enwik8" file as a test (100MB uncompressed, commonly used in
benchmarks: https://www.mattmahoney.net/dc/enwik8.zip), compression time
decreases from:
13.62s to 10.9s on Serenity (cold)
13.62s to 9.22s on Serenity (warm)
2.93s to 2.32s on Linux
One caveat is that this requires explicitly flushing any leftover bits
when the caller is done with the stream. The byte buffer implementation
implicitly flushed its data every time the buffer was byte-aligned, as
doing so would always fill the byte. This is no longer the case. But for
now, this should be fine as the one user of this class, DEFLATE, already
has a "flush everything now that we're done" finalizer.
With Clang, the previous/next pointers in buckets of an
`OrderedHashTable` are not cleared when a bucket is being shifted up as
a result of a removed bucket. As a result, an unfortunate pointer mixup
could lead to an infinite loop in the `HashTable` iterator, which was
exposed in `HashMap::keys()`.
Co-authored-by: Luke Wilde <lukew@serenityos.org>
Similar to POSIX read, the basic read and write functions of AK::Stream
do not have a lower limit of how much data they read or write (apart
from "none at all").
Rename the functions to "read some [data]" and "write some [data]" (with
"data" being omitted, since everything here is reading and writing data)
to make them sufficiently distinct from the functions that ensure to
use the entire buffer (which should be the go-to function for most
usages).
No functional changes, just a lot of new FIXMEs.
We currently fully casefold the left- and right-hand sides to compare
two strings with case-insensitivity. Now, we casefold one code point at
a time, storing the result in a view for comparison, until we exhaust
both strings.
For example, the code point U+002F could be encoded as UTF-8 with the
bytes 0x80 0xAF. This trick has historically been used to bypass
security checks.
This is needed to have code for creating an in-memory sRGB profile using
the (floating-ppoint) numbers from the sRGB spec and having the
fixed-point values in the profile match what they are in other software
(such as GIMP).
It has the side effect of making the FixedPoint ctor no longer constexpr
(which seems fine; nothing was currently relying on that).
Some of FixedPoint's member functions don't round yet, which requires
tweaking a test.
`consume_until(foo)` stops before foo, and so does
`ignore_until(Predicate)`, so let's make the other `ignore_until()`
overloads consistent with that so they're less confusing.
The output of the DeprecatedString::bijective_base_from() is now
correct for numbers larger than base^2.
This makes column names display correctly in Spreadsheet.
This naming scheme matches Vector.
This also changes `take_last` to move the value it takes, and delete by
known pointer, avoiding a full lookup and potential copies.
Instead of rehashing on collisions, we use Robin Hood hashing: a simple
linear probe where we keep track of the distance between the bucket and
its ideal position. On insertion, we allow a new bucket to "steal" the
position of "rich" buckets (those near their ideal position) and move
them further down.
On removal, we shift buckets back up into the freed slot, decrementing
their distance while doing so.
This behavior automatically optimizes the number of required probes for
any value, and removes the need for periodic rehashing (except when
expanding the capacity).
This approximation tries to generate values within 0.1% of their actual
expected value. Microbenchmarks indicate that this iterative SIMD
version can be up to 60x faster than `AK::SIMD::exp`.
For example the words "can't" and "32.3" should not have boundaries
detected on the "'" and "." code points, respectively.
The String test cases fixed here are because "b'ar" is now considered
one word.
This is done by providing Traits<ByteBuffer>::equals functions for
(Readonly)Bytes, as the base GenericTraits<T>::equals is unable to
convert the ByteBuffer to (Readonly)Bytes to then use Span::operator==
This allows us to check if a Vector<ByteBuffer> contains a
(Readonly)Bytes without having to making a copy of it into a ByteBuffer
first. The initial use of this is in LibWeb with CORS-preflight, where
we check the split contents of the Access-Control headers with
Fetch::Infrastructure::Request::method() and static StringViews
such as "*"sv.bytes().
The AnyString concept is currently broken because it checks whether a
StringView is constructible from a type T. The StringView constructors,
however, only accept constant rvalue references - i.e. `T const&`.
This also adds a test to ensure this continues to work.
Having an alias function that only wraps another one is silly, and
keeping the more obvious name should flush out more uses of deprecated
strings.
No behavior change.
As a nearby comment says, "This is a terrible approximation".
This doesn't make things less terrible, but it does make things
more correct in the given framework of terribleness.
Fixes#17156.
In cases where we know a string literal will fit in the short string
storage, we can do so at compile time without needing to handle error
propagation. If the provided string literal is too long, a compilation
error will be emitted due to the failed VERIFY statement being a non-
constant expression.
The Unicode spec defines much more complicated caseless matching
algorithms in its Collation spec. This implements the "basic" case
folding comparison.
The existing `is_i32()` and friends only check if `i32` is their
internal type, but a value such as `0` could be literally any integer
type internally. `is_integer<T>()` instead determines whether the
contained value is an integer and can fit inside T.
This error was introduced by 9a7accdd and had a significant impact on
`BufferedFile` behavior. Hence, we started seeing crash in test262.
By itself, the issue was a wrong calculation of the internal reading
spans when using the `read` and `until` parameters. Which can lead to
at worse crash in VERIFY and at least weird behaviors as missed needles
or detections out of bounds.
It was also accompanied by an erroneous test.
This patch fixes the bug, the test and also provides more tests.
This parameter allows to start searching after an offset. For example,
to resume a search.
It is unfortunately a breaking change in API so this patch also modifies
one user and one test.
This implements a FlyString that will de-duplicate String instances. The
FlyString will store the raw encoded data of the String instance: If the
String is a short string, FlyString holds the String::ShortString bytes;
otherwise FlyString holds a pointer to the Detail::StringData.
FlyString itself does not know about String's storage or how to refcount
its Detail::StringData. It defers to String to implement these details.
Since AK can't refer to LibUnicode directly, the strategy here is that
if you need case transformations, you can link LibUnicode and receive
them. If you try to use either of these methods without linking it, then
you'll of course get a linker error (note we don't do any fallbacks to
e.g. ASCII case transformations). If you don't need these methods, you
don't have to link LibUnicode.
DeprecatedFlyString relies heavily on DeprecatedString's StringImpl, so
let's rename it to A) match the name of DeprecatedString, B) write a new
FlyString class that is tied to String.
This allows us to make all comparision operators on the class constexpr
without pulling in a bunch of boilerplate. We don't use the `<compare>`
header because it doesn't compile in the main serenity cross-build due
to the include paths to LibC being incompatible with how libc++ expects
them to be for clang builds.
These instances were detected by searching for files that include
AK/StdLibExtras.h, but don't match the regex:
\\b(abs|AK_REPLACED_STD_NAMESPACE|array_size|ceil_div|clamp|exchange|for
ward|is_constant_evaluated|is_power_of_two|max|min|mix|move|_RawPtr|RawP
tr|round_up_to_power_of_two|swap|to_underlying)\\b
(Without the linebreaks.)
This regex is pessimistic, so there might be more files that don't
actually use any "extra stdlib" functions.
In theory, one might use LibCPP to detect things like this
automatically, but let's do this one step after another.
These instances were detected by searching for files that include
AK/Format.h, but don't match the regex:
\\b(CheckedFormatString|critical_dmesgln|dbgln|dbgln_if|dmesgln|FormatBu
ilder|__FormatIfSupported|FormatIfSupported|FormatParser|FormatString|Fo
rmattable|Formatter|__format_value|HasFormatter|max_format_arguments|out
|outln|set_debug_enabled|StandardFormatter|TypeErasedFormatParams|TypeEr
asedParameter|VariadicFormatParams|v_critical_dmesgln|vdbgln|vdmesgln|vf
ormat|vout|warn|warnln|warnln_if)\\b
(Without the linebreaks.)
This regex is pessimistic, so there might be more files that don't
actually use any formatting functions.
Observe that this revealed that Userland/Libraries/LibC/signal.cpp is
missing an include.
In theory, one might use LibCPP to detect things like this
automatically, but let's do this one step after another.
In practice, this function does not take any perceptible amount of time.
However, this benchmark demonstrates that for extreme values, the
internal for-loop does matter.
Using policy based design `SinglyLinkedList` and
`SinglyLinkedListWithCount` can be combined into one class which takes
a policy to determine how to keep track of the size of the list. The
default policy is to use list iteration to count the items in the list
each time. The `WithCount` form is a different policy which tracks the
size, but comes with the overhead of storing the count and
incrementing/decrementing on each modification.
This model is extensible to have other forms of counting by
implementing only a new policy instead of implementing a totally new
type.
In 7c5e30daaa, the focus was "only" on
Userland/Libraries/, whereas this commit cleans up the remaining
headers in the repo, and any new badly-formatted include.
The class is very similar to `CircularDuplexStream` in its behavior.
Main differences are that `CircularBuffer`:
- does not inherit from `AK::Stream`
- uses `ErrorOr` for its API
- is heap allocated (and OOM-Safe)
This patch also add some tests.
Previously any backslash and the character following it were ignored.
This commit adds a fall through to match the character following the
backslash without checking whether it is "special".
This allows callers to use the following semantics:
using MyVariant = Variant<Empty, int>;
template<typename T>
size_t size() { return TypeList<T>::size; }
auto s = size<MyVariant>();
This will be needed for an upcoming IPC change, which will result in us
knowing the Variant type, but not the underlying variadic types that the
Variant holds.
`OwnPtrWithCustomDeleter` was a decorator which provided the ability
to add a custom deleter to `OwnPtr` by wrapping and taking the deleter
as a run-time argument to the constructor. This solution means that no
additional space is needed for the `OwnPtr` because it doesn't need to
store a pointer to the deleter, but comes at the cost of having an
extra type that stores a pointer for every instance.
This logic is moved directly into `OwnPtr` by adding a template
argument that is defaulted to the default deleter for the type. This
means that the type itself stores the pointer to the deleter instead
of every instance and adds some type safety by encoding the deleter in
the type itself instead of taking a run-time argument.
Note that this still keeps the old behaviour of putting things in std by
default on serenity so the tools can be happy, but if USING_AK_GLOBALLY
is unset, AK behaves like a good citizen and doesn't try to put things
in the ::std namespace.
std::nothrow_t and its friends get to stay because I'm being told that
compilers assume things about them and I can't yeet them into a
different namespace...for now.
Implement insertion sort in AK. The cutoff value 7 is a magic number
here, values [5, 15] should work well. Main idea of the cutoff is to
reduce recursion performed by quicksort to speed up sorting
of small partitions.
DeprecatedString (formerly String) has been with us since the start,
and it has served us well. However, it has a number of shortcomings
that I'd like to address.
Some of these issues are hard if not impossible to solve incrementally
inside of DeprecatedString, so instead of doing that, let's build a new
String class and then incrementally move over to it instead.
Problems in DeprecatedString:
- It assumes string allocation never fails. This makes it impossible
to use in allocation-sensitive contexts, and is the reason we had to
ban DeprecatedString from the kernel entirely.
- The awkward null state. DeprecatedString can be null. It's different
from the empty state, although null strings are considered empty.
All code is immediately nicer when using Optional<DeprecatedString>
but DeprecatedString came before Optional, which is how we ended up
like this.
- The encoding of the underlying data is ambiguous. For the most part,
we use it as if it's always UTF-8, but there have been cases where
we pass around strings in other encodings (e.g ISO8859-1)
- operator[] and length() are used to iterate over DeprecatedString one
byte at a time. This is done all over the codebase, and will *not*
give the right results unless the string is all ASCII.
How we solve these issues in the new String:
- Functions that may allocate now return ErrorOr<String> so that ENOMEM
errors can be passed to the caller.
- String has no null state. Use Optional<String> when needed.
- String is always UTF-8. This is validated when constructing a String.
We may need to add a bypass for this in the future, for cases where
you have a known-good string, but for now: validate all the things!
- There is no operator[] or length(). You can get the underlying data
with bytes(), but for iterating over code points, you should be using
an UTF-8 iterator.
Furthermore, it has two nifty new features:
- String implements a small string optimization (SSO) for strings that
can fit entirely within a pointer. This means up to 3 bytes on 32-bit
platforms, and 7 bytes on 64-bit platforms. Such small strings will
not be heap-allocated.
- String can create substrings without making a deep copy of the
substring. Instead, the superstring gets +1 refcount from the
substring, and it acts like a view into the superstring. To make
substrings like this, use the substring_with_shared_superstring() API.
One caveat:
- String does not guarantee that the underlying data is null-terminated
like DeprecatedString does today. While this was nifty in a handful of
places where we were calling C functions, it did stand in the way of
shared-superstring substrings.
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.
One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)