This class had slightly confusing semantics and the added weirdness
doesn't seem worth it just so we can say "." instead of "->" when
iterating over a vector of NNRPs.
This patch replaces NonnullRefPtrVector<T> with Vector<NNRP<T>>.
The handling of page tables is very architecture specific, so belongs
in the Arch directory. Some parts were already architecture-specific,
however this commit moves the rest of the PageDirectory class into the
Arch directory.
While we're here the aarch64/PageDirectory.{h,cpp} files are updated to
be aarch64 specific, by renaming some members and removing x86_64
specific code.
This was previously hardcoded this to be the physical memory range,
since we identity mapped the memory, however we now run the kernel at
a high virtual memory address.
Also changes PageDirectory.h to store up-to 512 pages, as the code now
needs access to more than 4 pages.
Reduce inclusion of limits.h as much as possible at the same time.
This does mean that kmalloc.h is now including Kernel/API/POSIX/limits.h
instead of LibC/limits.h, but the scope could be limited a lot more.
Basically every file in the kernel includes kmalloc.h, and needs the
limits.h include for PAGE_SIZE.
These instances were detected by searching for files that include
AK/Memory.h, but don't match the regex:
\\b(fast_u32_copy|fast_u32_fill|secure_zero|timing_safe_compare)\\b
This regex is pessimistic, so there might be more files that don't
actually use any memory function.
In theory, one might use LibCPP to detect things like this
automatically, but let's do this one step after another.
This step would ideally not have been necessary (increases amount of
refactoring and templates necessary, which in turn increases build
times), but it gives us a couple of nice properties:
- SpinlockProtected inside Singleton (a very common combination) can now
obtain any lock rank just via the template parameter. It was not
previously possible to do this with SingletonInstanceCreator magic.
- SpinlockProtected's lock rank is now mandatory; this is the majority
of cases and allows us to see where we're still missing proper ranks.
- The type already informs us what lock rank a lock has, which aids code
readability and (possibly, if gdb cooperates) lock mismatch debugging.
- The rank of a lock can no longer be dynamic, which is not something we
wanted in the first place (or made use of). Locks randomly changing
their rank sounds like a disaster waiting to happen.
- In some places, we might be able to statically check that locks are
taken in the right order (with the right lock rank checking
implementation) as rank information is fully statically known.
This refactoring even more exposes the fact that Mutex has no lock rank
capabilites, which is not fixed here.
We add this basic functionality to the Kernel so Userspace can request a
particular virtual memory mapping to be immutable. This will be useful
later on in the DynamicLoader code.
The annotation of a particular Kernel Region as immutable implies that
the following restrictions apply, so these features are prohibited:
- Changing the region's protection bits
- Unmapping the region
- Annotating the region with other virtual memory flags
- Applying further memory advises on the region
- Changing the region name
- Re-mapping the region
We were previously using a 32-bit unsigned integer for this, which
caused us to start truncating region sizes when multiplied with
`PAGE_SIZE` on hardware with a lot of memory.
Our implementation for Jails resembles much of how FreeBSD jails are
working - it's essentially only a matter of using a RefPtr in the
Process class to a Jail object. Then, when we iterate over all processes
in various cases, we could ensure if either the current process is in
jail and therefore should be restricted what is visible in terms of
PID isolation, and also to be able to expose metadata about Jails in
/sys/kernel/jails node (which does not reveal anything to a process
which is in jail).
A lifetime model for the Jail object is currently plain simple - there's
simpy no way to manually delete a Jail object once it was created. Such
feature should be carefully designed to allow safe destruction of a Jail
without the possibility of releasing a process which is in Jail from the
actual jail. Each process which is attached into a Jail cannot leave it
until the end of a Process (i.e. when finalizing a Process). All jails
are kept being referenced in the JailManagement. When a last attached
process is finalized, the Jail is automatically destroyed.
For the initial page tables we only need to identity map the kernel
image, the rest of the memory will be managed by the MemoryManager. The
linker script is updated to get the kernel image start and end
addresses.
We no longer require to lock the m_inode_lock in the SharedInodeVMObject
code as the methods write_bytes and read_bytes of the Inode class do
this for us now.
According to Dr. POSIX, we should allow to call mmap on inodes even on
ranges that currently don't map to any actual data. Trying to read or
write to those ranges should result in SIGBUS being sent to the thread
that did violating memory access.
To implement this restriction, we simply check if the result of
read_bytes on an Inode returns 0, which means we have nothing valid to
map to the program, hence it should receive a SIGBUS in that case.
This class is intended to replace all IOAddress usages in the Kernel
codebase altogether. The idea is to ensure IO can be done in
arch-specific manner that is determined mostly in compile-time, but to
still be able to use most of the Kernel code in non-x86 builds. Specific
devices that rely on x86-specific IO instructions are already placed in
the Arch/x86 directory and are omitted for non-x86 builds.
The reason this works so well is the fact that x86 IO space acts in a
similar fashion to the traditional memory space being available in most
CPU architectures - the x86 IO space is essentially just an array of
bytes like the physical memory address space, but requires x86 IO
instructions to load and store data. Therefore, many devices allow host
software to interact with the hardware registers in both ways, with a
noticeable trend even in the modern x86 hardware to move away from the
old x86 IO space to exclusively using memory-mapped IO.
Therefore, the IOWindow class encapsulates both methods for x86 builds.
The idea is to allow PCI devices to be used in either way in x86 builds,
so when trying to map an IOWindow on a PCI BAR, the Kernel will try to
find the proper method being declared with the PCI BAR flags.
For old PCI hardware on non-x86 builds this might turn into a problem as
we can't use port mapped IO, so the Kernel will gracefully fail with
ENOTSUP error code if that's the case, as there's really nothing we can
do within such case.
For general IO, the read{8,16,32} and write{8,16,32} methods are
available as a convenient API for other places in the Kernel. There are
simply no direct 64-bit IO API methods yet, as it's not needed right now
and is not considered to be Arch-agnostic too - the x86 IO space doesn't
support generating 64 bit cycle on IO bus and instead requires two 2
32-bit accesses. If for whatever reason it appears to be necessary to do
IO in such manner, it could probably be added with some neat tricks to
do so. It is recommended to use Memory::TypedMapping struct if direct 64
bit IO is actually needed.
The RTC and CMOS are currently only supported for x86 platforms and use
specific x86 instructions to produce only certain x86 plaform operations
and results, therefore, we move them to the Arch/x86 specific directory.
According to Dr. POSIX, we should allow to call mmap on inodes even on
ranges that currently don't map to any actual data. Trying to read or
write to those ranges should result in SIGBUS being sent to the thread
that did violating memory access.
Otherwise we would be holding the MM global data lock and the Process
address space locks in reversed order to the rest of the system, which
can lead to deadlocks.
This commit updates the lock function from Spinlock and
RecursiveSpinlock to return the InterruptsState of the processor,
instead of the processor flags. The unlock functions would only look at
the interrupt flag of the processor flags, so we now use the
InterruptsState enum to clarify the intent, and such that we can use the
same Spinlock code for the aarch64 build.
To not break the build, all the call sites are updated aswell.
Globally shared MemoryManager state is now kept in a GlobalData struct
and wrapped in SpinlockProtected.
A small set of members are left outside the GlobalData struct as they
are only set during boot initialization, and then remain constant.
This allows us to access those members without taking any locks.
I believe this to be safe, as the main thing that LockRefPtr provides
over RefPtr is safe copying from a shared LockRefPtr instance. I've
inspected the uses of RefPtr<PhysicalPage> and it seems they're all
guarded by external locking. Some of it is less obvious, but this is
an area where we're making continuous headway.
We're not accessing any of the MM members here. Also remove some
redundant code to update CR3, since it calls activate_page_directory()
which does exactly the same thing.
Now that AddressSpace itself is always SpinlockProtected, we don't
need to also wrap the RegionTree. Whoever has the AddressSpace locked
is free to poke around its tree.
This allows sys$mprotect() to honor the original readable & writable
flags of the open file description as they were at the point we did the
original sys$mmap().
IIUC, this is what Dr. POSIX wants us to do:
https://pubs.opengroup.org/onlinepubs/9699919799/functions/mprotect.html
Also, remove the bogus and racy "W^X" checking we did against mappings
based on their current inode metadata. If we want to do this, we can do
it properly. For now, it was not only racy, but also did blocking I/O
while holding a spinlock.
This forces anyone who wants to look into and/or manipulate an address
space to lock it. And this replaces the previous, more flimsy, manual
spinlock use.
Note that pointers *into* the address space are not safe to use after
you unlock the space. We've got many issues like this, and we'll have
to track those down as wlel.
We were holding the MM lock across all of the region unmapping code.
This was previously necessary since the quickmaps used during unmapping
required holding the MM lock.
Now that it's no longer necessary, we can leave the MM lock alone here.
This makes locking them much more straightforward, and we can remove
a bunch of confusing use of AddressSpace::m_lock. That lock will also
be converted to use of SpinlockProtected in a subsequent patch.
You're still required to disable interrupts though, as the mappings are
per-CPU. This exposed the fact that our CR3 lookup map is insufficiently
protected (but we'll address that in a separate commit.)
While the "regular" quickmap (used to temporarily map a physical page
at a known address for quick access) has been per-CPU for a while,
we also have the PD (page directory) and PT (page table) quickmaps
used by the memory management code to edit page tables. These have been
global, which meant that SMP systems had to keep fighting over them.
This patch makes *all* quickmaps per-CPU. We reserve virtual addresses
for up to 64 CPUs worth of quickmaps for now.
Note that all quickmaps are still protected by the MM lock, and we'll
have to fix that too, before seeing any real throughput improvements.
Until now, our kernel has reimplemented a number of AK classes to
provide automatic internal locking:
- RefPtr
- NonnullRefPtr
- WeakPtr
- Weakable
This patch renames the Kernel classes so that they can coexist with
the original AK classes:
- RefPtr => LockRefPtr
- NonnullRefPtr => NonnullLockRefPtr
- WeakPtr => LockWeakPtr
- Weakable => LockWeakable
The goal here is to eventually get rid of the Lock* classes in favor of
using external locking.
Instead of having two separate implementations of AK::RefCounted, one
for userspace and one for kernelspace, there is now RefCounted and
AtomicRefCounted.
All users which relied on the default constructor use a None lock rank
for now. This will make it easier to in the future remove LockRank and
actually annotate the ranks by searching for None.
As soon as we've saved CR2 (the faulting address), we can re-enable
interrupt processing. This should make the kernel more responsive under
heavy fault loads.
Region::physical_page() now takes the VMObject lock while accessing the
physical pages array, and returns a RefPtr<PhysicalPage>. This ensures
that the array access is safe.
Region::physical_page_slot() now VERIFY()'s that the VMObject lock is
held by the caller. Since we're returning a reference to the physical
page slot in the VMObject's physical page array, this is the best we
can do here.
We really only need the VMObject lock when accessing the physical pages
array, so once we have a strong pointer to the physical page we want to
remap, we can give up the VMObject lock.
This fixes a deadlock I encountered while building DOOM on SMP.
When handling a page fault, we only need to remap the faulting region in
the current process. There's no need to traverse *all* regions that map
the same VMObject and remap them cross-process as well.
Those other regions will get remapped lazily by their own page fault
handlers eventually. Or maybe they won't and we avoided some work. :^)
- Instead of holding the VMObject lock across physical page allocation
and quick-map + copy, we now only hold it when updating the VMObject's
physical page slot.
We ensure that when we call SharedInodeVMObject::sync we lock the inode
lock before calling Inode virtual write_bytes method directly to avoid
assertion on the unlocked inode lock, as it was regressed recently. This
is not a complete fix as the need to lock from each path before calling
the write_bytes method should be avoided because it can lead to
hard-to-find bugs, and this commit only fixes the problem temporarily.
Previously, when starved for pages, *all* clean file-backed memory
would be released, which is quite excessive.
This patch instead releases just 1 page, since only 1 page is needed
to satisfy the request to `allocate_physical_page()`
Previously, we could only release *all* clean pages.
This patch makes it possible to release a specific amount of clean
pages. If the attempted number of pages to release is more than the
amount of clean pages, all clean pages will be released.
At the point at which we try to map the Region it was already added to
the Process region tree, so we have to make sure to remove it before
freeing it in the mapping failure path, otherwise the tree will contain
a dangling pointer to the free'd instance.
Until now, our only backup plan when running out of physical pages
was to try and purge volatile memory. If that didn't work out, we just
hung userspace out to dry with an ENOMEM.
This patch improves the situation by also considering clean, file-backed
pages (that we could page back in from disk).
This could be better in many ways, but it already allows us to boot to
WindowServer with 256 MiB of RAM. :^)
This deadlock was introduced with the creation of this API. The lock
order is such that we always need to take the page directory lock
before we ever take the MM lock.
This function violated that, as both Region creation and region
destruction require the pd and mm locks, but with the mm lock
already acquired we deadlocked with SMP mode enabled while other
threads were allocating regions.
With this change SMP boots to the desktop successfully for me,
(and then subsequently has other issues). :^)
There's no real value in separating physical pages to supervisor and
user types, so let's remove the concept and just let everyone to use
"user" physical pages which can be allocated from any PhysicalRegion
we want to use. Later on, we will remove the "user" prefix as this
prefix is not needed anymore.
We are limited on the amount of supervisor pages we can allocate, so
don't allocate from that pool. Supervisor pages are always below 16 MiB
barrier so using those was crucial when we used devices like the ISA
SoundBlaster 16 card, because that device required very low physical
addresses to be used.
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).
No functional changes.
Uncommitted pages (shared zero pages) can not contain any existing data
and can not be modified, so there's no point to committing a bunch of
extra pages to cover for them in the forked child.
Since both the parent process and child process hold a reference to the
COW committed set, once the child process exits, the committed COW
pages are effectively leaked, only being slowly re-claimed each time
the parent process writes to one of them, realizing it's no longer
shared, and uncommitting it.
In order to mitigate this we now hold a weak reference the parent
VMObject from which the pages are cloned, and we use it on destruction
when available to drop the reference to the committed set from it as
well.
This is basically unchanged since the beginning of 2020, which is a year
before we had proper ASLR.
Now that we have a proper ASLR implementation, we can turn this down a
bit, as it is no longer our only protection against predictable dynamic
loader addresses, and it actually obstructs the default loading address
of x86_64 quite frequently.
There's nothing stopping a userspace program from keeping a bunch of
threads around with a custom signal stack in a suspended state with
their normal thread stack mprotected to PROT_NONE.
OpenJDK seems to do this, for example.
This new type of VMObject will be used to coordinate switching safely
from graphical mode to text mode and vice-versa, by supplying a way to
remap all Regions that were created with this object, so mappings can be
changed according to the given state of system mode. This makes it quite
easy to give applications like WindowServer the feeling of having full
access to the framebuffer device from a DisplayConnector, but still keep
the Kernel in control to be able to safely switch to text console.
If we unregister from the RegionTree before unmapping, there's a race
where a new region can get inserted at the same address that we're about
to unmap. If this happens, ~Region() will then unmap the newly inserted
region, which now finds itself with cleared-out page table entries.
This had no business being in RegionTree, since RegionTree doesn't track
identity-mapped regions anyway. (We allow *any* address to be identity
mapped, not just the ones that are part of the RegionTree's range.)