Commit graph

41 commits

Author SHA1 Message Date
Luke Wilde eaa4048870 LibTextCodec: Add "get output encoding" from the Encoding specification 2023-06-19 06:12:26 +02:00
Timothy Flynn 00fa23237a LibTextCodec: Change UTF-8's decoder to replace invalid code points
The UTF-8 decoder will currently crash if it is provided invalid UTF-8
input. Instead, change its behavior to match that of all other decoders
to replace invalid code points with U+FFFD. This is required by the web.
2023-05-12 05:47:36 +02:00
Andreas Kling a504ac3e2a Everywhere: Rename equals_ignoring_case => equals_ignoring_ascii_case
Let's make it clear that these functions deal with ASCII case only.
2023-03-10 13:15:44 +01:00
Luke Wilde e864444fe3 LibTextCodec/Latin1: Iterate over input string with u8 instead of char
Using char causes bytes equal to or over 0x80 to be treated as a
negative value and produce incorrect results when implicitly casting to
u32.

For example, `atob` in LibWeb uses this decoder to convert non-ASCII
values to UTF-8, but non-ASCII values are >= 0x80 and thus produces
incorrect results in such cases:
```js
Uint8Array.from(atob("u660"), c => c.charCodeAt(0));
```
This used to produce [253, 253, 253] instead of [187, 174, 180].

Required by Cloudflare's IUAM challenges.
2023-02-28 08:46:06 +00:00
Sam Atkins 2db168acc1 LibTextCodec+Everywhere: Port Decoders to new Strings 2023-02-19 17:15:47 +01:00
Sam Atkins 3c5090e172 LibTextCodec: Return Optional<Decoder&> from bom_sniff_to_decoder() 2023-02-19 17:15:47 +01:00
Sam Atkins f2a9426885 LibTextCodec+Everywhere: Return Optional<Decoder&> from decoder_for() 2023-02-19 17:15:47 +01:00
Sam Atkins d6075ef5b5 LibTextCodec+Everywhere: Make TextCodec::decoder_for() take a StringView
We don't need a full String/DeprecatedString inside this function, so we
might as well not force users to create one.
2023-02-15 12:48:26 -05:00
Nico Weber eac2b2382c LibTextCodec: Add a MacRoman decoder
Allows displaying `<meta charset="x-mac-roman">` html files.
(`:set fenc=macroman`, `:w` in vim to save in that encoding.)
2023-01-24 14:37:20 +00:00
Nico Weber b14b5a4d06 LibTextCodec: Simplify Latin1Decoder::process() a tiny bit 2023-01-24 14:37:20 +00:00
Nico Weber 3423b54eb9 LibTextCodec: Make utf-16be and utf-16le codecs actually work
There were two problems:

1. They didn't handle surrogates
2. They used signed chars, leading to eg 0x00e4 being treated as 0xffe4

Also add a basic test that catches both issues.
There's some code duplication with Utf16CodePointIterator::operator*(),
but let's get things working first.
2023-01-22 21:30:44 +00:00
Linus Groh 57dc179b1f Everywhere: Rename to_{string => deprecated_string}() where applicable
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.

One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
2022-12-06 08:54:33 +01:00
Linus Groh 6e19ab2bbc AK+Everywhere: Rename String to DeprecatedString
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
2022-12-06 08:54:33 +01:00
Tim Schumacher 7834e26ddb Everywhere: Explicitly link all binaries against the LibC target
Even though the toolchain implicitly links against -lc, it does not know
where it should get LibC from except for the sysroot. In the case of
Clang this causes it to pick up the LibC stub instead, which might be
slightly outdated and feature missing symbols.

This is currently not an issue that manifests because we pass through
the dependency on LibC and other libraries by accident, which causes
CMake to link against the LibC target (instead of just the library),
and thus points the linker at the build output directory.

Since we are looking to fix that in the upcoming commits, let's make
sure that everything will still be able to find the proper LibC first.
2022-11-01 14:49:09 +00:00
sin-ack 3f3f45580a Everywhere: Add sv suffix to strings relying on StringView(char const*)
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).

No functional changes.
2022-07-12 23:11:35 +02:00
Idan Horowitz 086969277e Everywhere: Run clang-format 2022-04-01 21:24:45 +01:00
Karol Kosek b006a60366 LibTextCodec: Pass code points instead of bytes on UTF-8 string process
Previously we were passing raw UTF-8 bytes as code points, which caused
CSS content properties to display incorrect characters.

This makes bullet separators in Wikipedia templates display correctly.
2022-03-29 01:01:32 +02:00
Hendiadyoin1 6a95df2526 LibTextCodec: Don't allocate Strings on encoding normalisation
This ripples down to LibWeb's HTML and XHR decoders, which therefore
become less allocation heavy.
2022-03-21 10:48:17 +01:00
Jelle Raaijmakers 9c2a7c0e03 LibTextCodec: Add support for the UTF16-LE encoding 2022-03-08 14:51:06 +01:00
Luke Wilde 0e0f98a45e LibTextCodec: Add x-user-defined decoder
It's a pretty simple charset: the bottom 128 bytes (0x00-0x7F) are
standard ASCII, while the top 128 bytes (0x80-0xFF) are mapped to a
portion of the Unicode Private Use Area, specifically 0xF780-0xF7FF.

This is used by Google Maps for certain blobs.
2022-02-12 12:53:28 +01:00
Luke Wilde 835a344337 LibTextCodec: Add decoder function that overrides given decoder on BOM
This functions takes a user-provided decoder and will only use it if no
BOM is in the input.

If there is a BOM, it will ignore the given decoder and instead decode
the input with the appropriate Unicode decoder for the detected BOM.

This is only to be used where it's specifically needed, for example XHR
uses this for compatibility with deployed content. As such, it has an
obnoxious name to discourage usage.
2022-02-12 12:53:28 +01:00
Luke Wilde 94965ba28d LibTextCodec: Add BOM sniffer
This takes the input and sniffs it for a BOM. If it has the UTF-8 or
UTF-16BE BOM, it will return their respective decoder. Currently we
don't have a UTF-16LE decoder, so it will assert TODO if it detects
a UTF-16LE BOM. If there is no recognisable BOM, it will return no
decoder.
2022-02-12 12:53:28 +01:00
Daniel Bertalan 6003b6f4d3 LibTextCodec: Do not allocate the various decoders
These objects contain no data members, so there is no point in creating
1-byte heap allocations for them. We don't need to have them as static
local variables, as they are trivially constructible, so they can simply
be global variables.
2022-01-28 23:31:00 +01:00
Dmitry Petrov 6f5102f435 LibTextCodec: Add alternate Cyrillic (aka Koi8-r) encoding
Fixes #6840.
2021-12-16 22:44:45 +01:00
Andreas Kling 8b1108e485 Everywhere: Pass AK::StringView by value 2021-11-11 01:27:46 +01:00
Sam Atkins d7ffa51424 LibTextCodec: Ignore BYTE ORDER MARK at the start of utf8/16 strings
Before, this was getting included as part of the output text, which was
confusing the HTML parser. Nobody needs the BOM after we have identified
the codec, so now we remove it when converting to UTF-8.
2021-09-15 17:00:18 +02:00
sin-ack e6818388e4 LibTextCodec: Add "process" API for allocation-free code point iteration
This commit adds a new process method to all Decoder subclasses which
do what to_utf8 used to do, and allows callers to customize the handling
of individiual UTF-8 code points through a callback. Decoder::to_utf8
now uses this API to generate a string via StringBuilder, preserving the
original behavior.
2021-08-30 00:08:40 +02:00
Andreas Kling ed7a2f21ff LibTextCodec: Remove unused is_standardized_encoding() 2021-08-20 15:31:46 +02:00
Aatos Majava 3b2a528b33 LibTextCodec: Add Turkish (aka ISO-8859-9, Windows-1254) encoding 2021-06-23 16:32:47 +01:00
Aatos Majava 7597cca5c6 LibTextCodec: Add ISO-8859-15 (aka Latin-9) encoding 2021-06-15 15:12:09 +01:00
Max Wipfli d325403cb5 LibTextCodec: Use Optional<String> for get_standardized_encoding
This patch changes get_standardized_encoding to use an Optional<String>
return type instead of just returning the null string when unable to
match the provided encoding to one of the canonical encoding names.

This is part of an effort to move away from using null strings towards
explicitly using Optional<String> to indicate that the String may not
have a value.
2021-05-18 21:02:07 +02:00
Idan Horowitz 87cabda80d LibTextCodec: Implement a Windows-1251 decoder
This encoding (a superset of ascii that adds in the cyrillic alphabet)
is currently the third most used encoding on the web, and because
cyrillic glyphs were added by Dmitrii Trifonov recently, we can now
support it as well :^)
2021-05-01 17:59:08 +02:00
Brian Gianforcaro 1682f0b760 Everything: Move to SPDX license identifiers in all files.
SPDX License Identifiers are a more compact / standardized
way of representing file license information.

See: https://spdx.dev/resources/use/#identifiers

This was done with the `ambr` search and replace tool.

 ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
2021-04-22 11:22:27 +02:00
Idan Horowitz 4a2c0d721f LibTextCodec: Implement a Windows-1255 decoder.
This is a superset of ascii that adds in the hebrew alphabet.
(Google currently assumes we are running windows due to not
recognizing Serenity as the OS in the user agent, resulting
in this encoding instead of UTF8 in google search results)
2021-04-17 18:13:20 +02:00
Nicholas-Baron c4ede38542 Everything: Add -Wnon-virtual-dtor flag
This flag warns on classes which have `virtual` functions but do not
have a `virtual` destructor.

This patch adds both the flag and missing destructors. The access level
of the destructors was determined by a two rules of thumb:
1. A destructor should have a similar or lower access level to that of a
   constructor.
2. Having a `private` destructor implicitly deletes the default
   constructor, which is probably undesirable for "interface" types
   (classes with only virtual functions and no data).

In short, most of the added destructors are `protected`, unless the
compiler complained about access.
2021-04-15 20:57:13 +02:00
Idan Horowitz c9f25bca04 LibTextCodec: Make UTF16BEDecoder read only up to an even offset
Reading up to the end of the input string of odd length results in
an out-of-bounds read
2021-03-15 16:08:12 +01:00
Luke 7427e3a8a6 LibTextCodec: Fix IBM666 => IBM866 typo 2021-03-14 11:27:52 +01:00
Andreas Kling 0538dc4e28 LibTextCodec: Add a simple UTF-16BE decoder 2021-02-16 17:31:22 +01:00
Ben Wiederhake aa3d915260 LibTextCodec: Avoid duplicate definition of standard encodings 2021-02-01 09:54:32 +01:00
asynts 01879d27c2 Everywhere: Replace a bundle of dbg with dbgln.
These changes are arbitrarily divided into multiple commits to make it
easier to find potentially introduced bugs with git bisect.
2021-01-16 11:54:35 +01:00
Andreas Kling 13d7c09125 Libraries: Move to Userland/Libraries/ 2021-01-12 12:17:46 +01:00