This patch adds a random offset between 0 and 4096 to the initial
stack pointer in new processes. Since the stack has to be 16-byte
aligned, the bottom bits can't be randomized.
Yet another thing to make things less predictable. :^)
We were doing stack and syscall-origin region validations before
taking the big process lock. There was a window of time where those
regions could then be unmapped/remapped by another thread before we
proceed with our syscall.
This patch closes that window, and makes sys$get_stack_bounds() rely
on the fact that we now know the userspace stack pointer to be valid.
Thanks to @BenWiederhake for spotting this! :^)
If we try to align a number above 0xfffff000 to the next multiple of
the page size (4 KiB), it would wrap around to 0. This is most likely
never what we want, so let's assert if that happens.
Let's be a little more expressive when inducing a kernel panic. :^)
PANIC(...) passes any arguments you give it to dmesgln(), then prints
a backtrace and hangs the machine.
Now that we no longer need to support the signal trampolines being
user-accessible inside the kernel memory range, we can get rid of the
"kernel" and "user-accessible" flags on Region and simply use the
address of the region to determine whether it's kernel or user.
This also tightens the page table mapping code, since it can now set
user-accessibility based solely on the virtual address of a page.
The signal trampoline was previously in kernelspace memory, but with
a special exception to make it user-accessible.
This patch moves it into each process's regular address space so we
can stop supporting user-allowed memory above 0xc0000000.
If we're flushing user space pointers and the process only has one
thread, we do not need to broadcast this to other processors as
they will all discard that request anyway.
We were failing to round down the base of partial VM ranges. This led
to split regions being constructed that could have a non-page-aligned
base address. This would then trip assertions in the VM code.
Found by fuzz-syscalls. :^)
If a program attempts to write from more than a million different locations,
there is likely shenaniganery afoot! Refuse to write to prevent kmem exhaustion.
Found by fuzz-syscalls. Can be reproduced by running this in the Shell:
$ syscall writev 1 [ 0 ] 0x08000000
Found by fuzz-syscalls. Can be reproduced by running this in the Shell:
$ syscall exit_thread
This leaves the process in the 'Dying' state but never actually removes it.
Therefore, avoid this scenario by pretending to exit the entire process.
Since the payload size is user-controlled, this could be used to
overflow the kernel stack.
We should probably also be breaking things into smaller packets at a
higher level, e.g TCPSocket::protocol_send(), but let's do that as
a separate exercise.
Fixes#5310.
Not sure why this was 4 MiB in the first place, but that's a lot of
memory to reserve for each thread when we're running with 512 MiB
total in the default testing setup. :^)
* We don't have to lock the "all IPv4 sockets" in exclusive mode, shared mode is
enough for just reading the list (as opposed to modifying it).
* We don't have to lock socket's own lock at all, the IPv4Socket::did_receive()
implementation takes care of this.
* Most importantly, we don't have to hold the "all IPv4 sockets" across the
IPv4Socket::did_receive() call(s). We can copy the current ICMP socket list
while holding the lock, then release the lock, and then call
IPv4Socket::did_receive() on all the ICMP sockets in our list.
These changes fix a deadlock triggered by receiving ICMP messages when using tap
networking setup (as opposed to QEMU's default user/SLIRP networking) on the host.
The way we read/write directories is very inefficient, and this doesn't
solve any of that. It does however reduce memory usage of directory
entry vectors by 25% which has nice immediate benefits.
CLion doesn't understand that we switch compilers mid-build (which I
can understand since it's a bit unusual.) Defining __serenity__ makes
the majority of IDE features work correctly in the kernel context.
sys$fork() already takes care of children inheriting the parent's root
directory, so there was no need to do the same thing when creating a
new user process.
Add a per-process ptrace lock and use it to prevent ptrace access to a
process after it decides to commit to a new executable in sys$execve().
Fixes#5230.
This patch adds Space, a class representing a process's address space.
- Each Process has a Space.
- The Space owns the PageDirectory and all Regions in the Process.
This allows us to reorganize sys$execve() so that it constructs and
populates a new Space fully before committing to it.
Previously, we would construct the new address space while still
running in the old one, and encountering an error meant we had to do
tedious and error-prone rollback.
Those problems are now gone, replaced by what's hopefully a set of much
smaller problems and missing cleanups. :^)
Wrap thread creation in a Thread::try_create() helper that first
allocates a kernel stack region. If that allocation fails, we propagate
an ENOMEM error to the caller.
This avoids the situation where a thread is half-constructed, without a
valid kernel stack, and avoids having to do messy cleanup in that case.
We now build the kernel with partial UBSAN support.
The following -fsanitize sub-options are enabled:
* nonnull-attribute
* bool
If the kernel detects UB at runtime, it will now print a debug message
with a stack trace. This is very cool! I'm leaving it on by default for
now, but we'll probably have to re-evaluate this as more options are
enabled and slowdown increases.
This achieves two things:
- Programs can now intentionally perform arbitrary syscalls by calling
syscall(). This allows us to work on things like syscall fuzzing.
- It restricts the ability of userspace to make syscalls to a single
4KB page of code. In order to call the kernel directly, an attacker
must now locate this page and call through it.
Leaking macros across headers is a terrible thing, but I can't think of
a better way of achieving this.
- We need some way of modifying debug macros from CMake to implement
ENABLE_ALL_THE_DEBUG_MACROS.
- We need some way of modifying debug macros in specific source files
because otherwise we need to rebuild too many files.
This was done using the following script:
sed -i -E 's/#cmakedefine01 ([A-Z0-9_]+)/#ifndef \1\n\0\n#endif\n/' AK/Debug.h.in
sed -i -E 's/#cmakedefine01 ([A-Z0-9_]+)/#ifndef \1\n\0\n#endif\n/' Kernel/Debug.h.in
There's no need for this to be generic and support running from an
arbitrary thread context. Perf events are always generated from within
the thread being profiled, so take advantage of that to simplify the
code. Also use Vector capacity to avoid heap allocations.
We were checking for size_t (unsigned) overflow but the current offset
is actually stored as off_t (signed). Fix this, and also fail with
EOVERFLOW correctly.
This change can be actually seen as two logical changes, the first
change is about to ensure we only read the ATA Status register only
once, because if we read it multiple times, we acknowledge interrupts
unintentionally. To solve this issue, we always use the alternate Status
register and only read the original status register in the IRQ handler.
The second change is how we handle interrupts - if we use DMA, we can
just complete the request and return from the IRQ handler. For PIO mode,
it's more complicated. For PIO write operation, after setting the ATA
registers, we send out the data to IO port, and wait for an interrupt.
For PIO read operation, we set the ATA registers, and wait for an
interrupt to fire, then we just read from the data IO port.
This patch adds sys$msyscall() which is loosely based on an OpenBSD
mechanism for preventing syscalls from non-blessed memory regions.
It works similarly to pledge and unveil, you can call it as many
times as you like, and when you're finished, you call it with a null
pointer and it will stop accepting new regions from then on.
If a syscall later happens and doesn't originate from one of the
previously blessed regions, the kernel will simply crash the process.
We had two ways of creating a new Ext2FS inode. Either they were empty,
or they started with some pre-allocated size.
In practice, the pre-sizing code path was only used for new directories
and it didn't actually improve anything as far as I can tell.
This patch simplifies inode creation by simply always allocating empty
inodes. Block allocation and block list generation now always happens
on the same code path.
This prevented from dmidecode to get the right PAGE_SIZE when using the
sysconf syscall.
I found this bug, when I tried to figure why dmidecode fails to mmap
/dev/mem when I passed --no-procfs, and the conclusion is that it tried
to mmap unaligned physical address 0xf5ae0 (SMBIOS data), and that was
caused by a wrong value returned after using the sysconf syscall to get
the plaform page size, therefore, allowing to send an unaligned address
to the mmap syscall.
Because it was 'static const' and also shared with userland programs,
the default keymap was defined in multiple places. This commit should
save several kilobytes! :^)
The enumeration code is already enumerating all buses, recursively
enumerating bridges (which are buses) makes devices on bridges being
enumerated multiple times. Also, the PCI code was incorrectly mixing up
terminology; let's settle down on bus, device and function because ever
since PCIe came along "slots" isn't really a thing anymore.
We had an exception that allowed SOL_SOCKET + SO_PEERCRED on local
socket to support LibIPC's PID exchange mechanism. This is no longer
needed so let's just remove the exception.
It's useful for programs to change their thread names to say something
interesting about what they are working on. Let's not require "thread"
for this since single-threaded programs may want to do it without
pledging "thread".
This used to be in Kernel/, next to the build-root-filesystem.sh script,
which was then moved to Meta/ during the transition to CMake but has the
working directory set to Build/, effectively expecting it there - which
seems silly.
TL;DR: Very confusing. Use an explicit path relative to SERENITY_ROOT
instead and update the .gitignore files.
This replaces the current disk detection and disk access code with
code based on https://wiki.osdev.org/IDE
This allows the system to boot on VirtualBox with serial debugging
enabled and VMWare Player.
I believe there were several issues with the current code:
- It didn't utilise the last 8 bits of the LBA in 24-bit mode.
- {read,write}_sectors_with_dma was not setting the obsolete bits,
which according to OSdev wiki aren't used but should be set.
- The PIO and DMA methods were using slightly different copy
and pasted access code, which is now put into a single
function called "ata_access"
- PIO mode doesn't work. This doesn't fix that and should
be looked into in the future.
- The detection code was not checking for ATA/ATAPI.
- The detection code accidentally had cyls/heads/spt as 8-bit,
when they're 16-bit.
- The capabilities of the device were not considered. This is now
brought in and is currently used to check if the device supports
LBA. If not, use CHS.
This prevents sys$mmap() and sys$mprotect() from creating executable
memory mappings in pledged programs that don't have this promise.
Note that the dynamic loader runs before pledging happens, so it's
unaffected by this.
The random address proposals were not checked with the size so it was
increasely likely to try to allocate outside of available space with
larger and larger sizes.
Now they will be ignored instead of triggering a Kernel assertion
failure.
This is a continuation of: c8e7baf4b8
This adds another layer of defense against introducing new code into a
running process. The only permitted way of doing so is by mmapping an
open file with PROT_READ | PROT_EXEC.
This does make any future JIT implementations slightly more complicated
but I think it's a worthwhile trade-off at this point. :^)
This patch adds enforcement of two new rules:
- Memory that was previously writable cannot become executable
- Memory that was previously executable cannot become writable
Unfortunately we have to make an exception for text relocations in the
dynamic loader. Since those necessitate writing into a private copy
of library code, we allow programs to transition from RW to RX under
very specific conditions. See the implementation of sys$mprotect()'s
should_make_executable_exception_for_dynamic_loader() for details.
When mounting an Ext2FS, a block device source is required. All other
filesystem types are unaffected, as most of them ignore the source file
descriptor anyway.
Fixes#5153.
`allocate_randomized` assert an already sanitized size but `mmap` were
just forwarding whatever the process asked so it was possible to
trigger a kernel panic from an unpriviliged process just by asking some
randomly placed memory and a size non alligned with the page size.
This fixes this issue by rounding up to the next page size before
calling `allocate_randomized`.
Fixes#5149
This allows us to get rid of the thread lists in SchedulerData.
Also, instead of iterating over all threads to find a thread by id,
just use a lookup table. In the rare case of having to iterate over
all threads, just iterate the lookup table.
This can be used to request random VM placement instead of the highly
predictable regular mmap(nullptr, ...) VM allocation strategy.
It will soon be used to implement ASLR in the dynamic loader. :^)
This broke with the change that gave each process a list of its own
threads. Since threads are removed slightly earlier from that list
during process teardown, we're not able to use it for generating
coredump backtraces. Fortunately we have the "threads for coredump"
list for just this purpose. :^)
This adds an optional argument to get_good_random_bytes that can be
used to only return randomness if it doesn't have to block.
Also add a SpinLock around using FortunaPRNG.
Fixes#5132
Since each Process now has its own list of threads, we don't need
to treat colonel any different anymore. This also means that it
reports all kernel threads, not just the idle threads.
In ab14b0ac64, mmap was changed so that
the size of the region is aligned before it was passed to the device
driver. The previous logic would assert when the framebuffer size was
not a multiple of the page size. I've also taken the liberty of
returning an error on mmap failure rather than asserting.
We need to make sure other processors can grab the MM lock while we
wait, so release it when we might block. Reading the page from
disk may also block, so release it during that time as well.
Rather than walking all Thread instances and putting them into
a vector to be sorted by priority, queue them into priority sorted
linked lists as soon as they become ready to be executed.
Attempt to wake idle processors to get threads to be scheduled more quickly.
We don't want to wait until the next timer tick if we have processors that
aren't doing anything.
This eliminates the window between calling Processor::current and
the member function where a thread could be moved to another
processor. This is generally not as big of a concern as with
Processor::current_thread, but also slightly more light weight.
Change Thread::current to be a static function and read using the fs
register, which eliminates a window between Processor::current()
returning and calling a function on it, which can trigger preemption
and a move to a different processor, which then causes operating
on the wrong object.
We also need to store m_in_critical in the Thread upon switching,
and we need to restore it. This solves a problem where threads
moving between different processors could end up with an unexpected
value.
This allows us to determine what the previous mode (user or kernel)
was, e.g. in the timer interrupt. This is used e.g. to determine
whether a signal handler should be set up.
Fixes#5096
If we find ourselves with a user-accessible, non-shared Region backed by
a SharedInodeVMObject, that's pretty bad news, so let's just panic the
kernel instead of getting abused.
There might be a better place for this kind of check, so I've added a
FIXME about putting more thought into that.
This was exploitable since the shared flag determines whether inode
permission checks are applied in sys$mprotect().
The bug was pretty hard to spot due to default arguments being used
instead. This patch removes the default arguments to make explicit
at each call site what's being done.
When passing nullptr for either promises or execpromises to pledge(),
the expected behaviour is to not change their current value at all - we
were accidentally resetting them to 0, effectively dropping previously
pledge()'d promises.
We now move the execpromises state into the regular promises, and clear
the execpromises state.
Also make sure to duplicate the promise state on fork.
This fixes an issue where "su" would launch a shell which immediately
crashed due to not having pledged "stdio".
Let's force callers to provide a VM range when allocating a region.
This makes ENOMEM error handling more visible and removes implicit
VM allocation which felt a bit magical.
This tells the kernel that the process wants to use pledge, but without
pledging anything - effectively restricting it to syscalls that don't
require a certain promise. This is part of OpenBSD's pledge() as well,
which served as basis for Serenity's.
We were enabling interrupts too early, before the first context switch to
a thread was complete. This could then trigger another context switch
within the context switch, which lead to a crash.
There is a window between acquiring/releasing the lock with the atomic
variables and subsequently waiting or waking threads. With a single
processor this window was closed by using a critical section, but
this doesn't prevent other processors from running these code paths.
To solve this, set a flag in the WaitQueue while holding m_lock which
determines if threads should be blocked at all.
Instead of letting each File subclass do range allocation in their
mmap() override, do it up front in sys$mmap().
This makes us honor alignment requests for file-backed memory mappings
and simplifies the code somwhat.
This was done with the help of several scripts, I dump them here to
easily find them later:
awk '/#ifdef/ { print "#cmakedefine01 "$2 }' AK/Debug.h.in
for debug_macro in $(awk '/#ifdef/ { print $2 }' AK/Debug.h.in)
do
find . \( -name '*.cpp' -o -name '*.h' -o -name '*.in' \) -not -path './Toolchain/*' -not -path './Build/*' -exec sed -i -E 's/#ifdef '$debug_macro'/#if '$debug_macro'/' {} \;
done
# Remember to remove WRAPPER_GERNERATOR_DEBUG from the list.
awk '/#cmake/ { print "set("$2" ON)" }' AK/Debug.h.in
Booting old computers without RDRAND/RDSEED and without a disk makes
the system severely starved for entropy. Uses interrupts as a source
to side-step that issue.
Also warn whenever the system is starved of entropy, because that's
a non-obvious failure mode.
For some reason we were keeping the bits 04777 in file modes. That
doesn't seem right and I can't think of a reason why the set-uid bit
should be allowed to slip through.
(mode & S_IFDIR) is not enough to check if "mode" is a directory,
we have to check all the bits in the S_IFMT mask.
Use the is_directory() helper to fix this bug.
Since devices are enumerable and can compute their own name inside the
/dev hierarchy, there is no need to try and parse "root=/dev/xxx" by
hand.
This also makes any block device a candidate for the boot device, which
now includes ramdisk devices, so SerenityOS can now boot diskless too.
The disk image generated for QEMU is suitable, as long as it fits in
memory with room to spare for the rest of the system.
Besides removing the monolithic DevFSDeviceInode::determine_name()
method, being able to determine a device's name inside the /dev
hierarchy outside of DevFS has its uses.
The kernel ignored the first 8 MiB of RAM while parsing the memory map
because the kmalloc heaps and the super physical pages lived here. Move
all that stuff inside the .bss segment so that those memory regions are
accounted for, otherwise we risk overwriting boot modules placed next
to the kernel.
This was just an alias for "unix" that I added early on back when there
was some belief that we might be compatible with OpenBSD. We're clearly
never going to be compatible with their pledges so just drop the alias.
It was possible to signal a process while it was paging in an inode
backed VM object. This would cause the inode read to EINTR, and the
page fault handler would assert.
Solve this by simply not unblocking threads due to signals if they are
currently busy handling a page fault. This is probably not the best way
to solve this issue, so I've added a FIXME to that effect.
..and allow implicit creation of KResult and KResultOr from ErrnoCode.
This means that kernel functions that return those types can finally
do "return EINVAL;" and it will just work.
There's a handful of functions that still deal with signed integers
that should be converted to return KResults.
This way, if something goes wrong, we get to keep the actual error.
Also, KResults are nodiscard, so we have to deal with that in Ext2FS
instead of just silently ignoring I/O errors(!)
Similar to LibC storing an assertion message before aborting, process
death by pledge violation now sets a "pledge_violation" key with the
respective pledge name as value in its coredump metadata, which the
CrashReporter will then show.
Path resolution will now refuse to follow symlinks in some cases where
you don't own the symlink, or when it's in a sticky world-writable
directory and the link has a different owner than the directory.
The point of all this is to prevent classic TOCTOU bugs in /tmp etc.
Fixes#4934
Both ESP and GDTR are left undefined by the Multiboot specification and
OS images must not rely on these values to be valid. Fix the undefined
behaviors so that booting with PXELINUX does not triple-fault the CPU.