Commit graph

450 commits

Author SHA1 Message Date
Andreas Kling ecdd9a5bc6 Kernel: Reduce code duplication a little bit in Region allocation
This patch reduces the number of code paths that lead to the allocation
of a Region object. It's quite hard to follow the various ways in which
this can happen, so this is an effort to simplify.
2020-03-01 15:56:23 +01:00
Andreas Kling 5e0c4d689f Kernel: Move ProcessPagingScope to its own files 2020-03-01 15:38:09 +01:00
Andreas Kling fee20bd8de Kernel: Remove some more harmless InodeVMObject miscasts 2020-03-01 12:27:03 +01:00
Andreas Kling b614462079 Kernel: Include the dirty bits when cloning an InodeVMObject
Now that (private) InodeVMObjects can be CoW-cloned on fork(), we need
to make sure we clone the dirty bits as well.
2020-03-01 12:11:50 +01:00
Andreas Kling 48bbfe51fb Kernel: Add some InodeVMObject type assertions in Region::clone()
Let's make sure that we're never cloning shared inode-backed objects
as if they were private, and vice versa.
2020-03-01 11:23:10 +01:00
Andreas Kling 88b334135b Kernel: Remove some Region construction helpers
It's now up to the caller to provide a VMObject when constructing a new
Region object. This will make it easier to handle things going wrong,
like allocation failures, etc.
2020-03-01 11:23:10 +01:00
Andreas Kling fddc3c957b Kernel: CoW-clone private inode-backed memory regions on fork()
When forking a process, we now turn all of the private inode-backed
mmap() regions into copy-on-write regions in both the parent and child.

This patch also removes an assertion that becomes irrelevant.
2020-03-01 11:23:10 +01:00
Andreas Kling 7cd1bdfd81 Kernel: Simplify some dbg() logging
We don't have to log the process name/PID/TID, dbg() automatically adds
that as a prefix to every line.

Also we don't have to do .characters() on Strings passed to dbg() :^)
2020-02-29 13:39:06 +01:00
Andreas Kling 5f7056d62c Kernel: Expose the VMObject type of each Region in /proc/PID/vm 2020-02-28 23:25:40 +01:00
Andreas Kling aa1e209845 Kernel: Remove some unnecessary indirection in InodeFile::mmap()
InodeFile now directly calls Process::allocate_region_with_vmobject()
instead of taking an awkward detour via a special Region constructor.
2020-02-28 20:29:14 +01:00
Andreas Kling 651417a085 Kernel: Split InodeVMObject into two subclasses
We now have PrivateInodeVMObject and SharedInodeVMObject, corresponding
to MAP_PRIVATE and MAP_SHARED respectively.

Note that PrivateInodeVMObject is not used yet.
2020-02-28 20:20:35 +01:00
Andreas Kling 07a26aece3 Kernel: Rename InodeVMObject => SharedInodeVMObject 2020-02-28 20:07:51 +01:00
Liav A d16b26f83a MemoryManager: Use dbg() instead of dbgprintf() 2020-02-27 13:05:12 +01:00
Liav A 42665817d1 RangeAllocator: Use dbg() instead of dbgprintf() 2020-02-27 13:05:12 +01:00
Liav A 3f2d5f2774 PhysicalPage: Use dbg() instead of dbgprintf() 2020-02-27 13:05:12 +01:00
Liav A 24d2aeda8e Region: Use dbg() instead of dbgprintf() 2020-02-27 13:05:12 +01:00
Liav A 3f95a7fc97 InodeVMObject: Use dbg() instead of dbgprintf() 2020-02-27 13:05:12 +01:00
Liav A 62adbbc598 PageDirectory: Use dbg() instead of dbgprintf() 2020-02-27 13:05:12 +01:00
Andreas Kling ceec1a7d38 AK: Make Vector use size_t for its size and capacity 2020-02-25 14:52:35 +01:00
Andreas Kling 30a8991dbf Kernel: Make Region weakable and use WeakPtr<Region> instead of Region*
This turns use-after-free bugs into null pointer dereferences instead.
2020-02-24 13:32:45 +01:00
Andreas Kling 0763f67043 AK: Make Bitmap use size_t for its size
Also rework its API's to return Optional<size_t> instead of int with -1
as the error value.
2020-02-24 09:56:07 +01:00
Andreas Kling 7ec758773c Kernel: Dump all kernel regions when we hit a page fault during IRQ
This way you can try to figure out what the faulting address is.
2020-02-23 11:10:52 +01:00
Andreas Kling f020081a38 Kernel: Put "Couldn't find user region" spam behind MM_DEBUG
This basically never tells us anything actionable anyway, and it's a
real annoyance when doing something validation-heavy like profiling.
2020-02-22 10:09:54 +01:00
Andreas Kling b298c01e92 Kernel: Log instead of crashing when getting a page fault during IRQ
This is definitely a bug, but it seems to happen randomly every now
and then and we need more info to track it down, so let's log for now.
2020-02-21 19:05:45 +01:00
Andreas Kling 04e40da188 Kernel: Fix crash when reading /proc/PID/vmobjects
InodeVMObjects can have nulled-out physical page slots. That just means
we haven't cached that page from disk right now.
2020-02-21 16:03:56 +01:00
Andreas Kling 59b9e49bcd Kernel: Don't trigger page faults during profiling stack walk
The kernel sampling profiler will walk thread stacks during the timer
tick handler. Since it's not safe to trigger page faults during IRQ's,
we now avoid this by checking the page tables manually before accessing
each stack location.
2020-02-21 15:49:39 +01:00
Andreas Kling d46071c08f Kernel: Assert on page fault during IRQ
We're not equipped to deal with page faults during an IRQ handler,
so add an assertion so we can immediately tell what's wrong.

This is why profiling sometimes hangs the system -- walking the stack
of the profiled thread causes a page fault and things fall apart.
2020-02-21 15:49:34 +01:00
Andreas Kling a87544fe8b Kernel: Refuse to allocate 0 bytes of virtual address space 2020-02-19 22:19:55 +01:00
Andreas Kling f17c377a0c Kernel: Use bitfields in Region
This makes Region 4 bytes smaller and we can use bitfield initializers
since they are allowed in C++20. :^)
2020-02-19 12:03:11 +01:00
Andreas Kling 4b16ac0034 Kernel: Purging a page should point it back to the shared zero page
Anonymous VM objects should never have null entries in their physical
page list. Instead, "empty" or untouched pages should refer to the
shared zero page.

Fixes #1237.
2020-02-18 09:56:11 +01:00
Andreas Kling 48f7c28a5c Kernel: Replace "current" with Thread::current and Process::current
Suggested by Sergey. The currently running Thread and Process are now
Thread::current and Process::current respectively. :^)
2020-02-17 15:04:27 +01:00
Andreas Kling 31e1af732f Kernel+LibC: Allow sys$mmap() callers to specify address alignment
This is exposed via the non-standard serenity_mmap() call in userspace.
2020-02-16 12:55:56 +01:00
Andreas Kling 7533d61458 Kernel: Fix weird whitespace mistake in RangeAllocator 2020-02-16 08:01:33 +01:00
Andreas Kling 635ae70b8f Kernel: More header dependency reduction work 2020-02-16 02:15:33 +01:00
Andreas Kling e28809a996 Kernel: Add forward declaration header 2020-02-16 01:50:32 +01:00
Andreas Kling 1d611e4a11 Kernel: Reduce header dependencies of MemoryManager and Region 2020-02-16 01:33:41 +01:00
Andreas Kling a356e48150 Kernel: Move all code into the Kernel namespace 2020-02-16 01:27:42 +01:00
Andreas Kling 5507945306 Kernel: Widen PhysicalPage refcount to 32 bits
A 16-bit refcount is just begging for trouble right nowl.
A 32-bit refcount will be begging for trouble later down the line,
so we'll have to revisit this eventually. :^)
2020-02-15 22:34:48 +01:00
Andreas Kling c624d3875e Kernel: Use a shared physical page for zero-filled pages until written
This patch adds a globally shared zero-filled PhysicalPage that will
be mapped into every slot of every zero-filled AnonymousVMObject until
that page is written to, achieving CoW-like zero-filled pages.

Initial testing show that this doesn't actually achieve any sharing yet
but it seems like a good design regardless, since it may reduce the
number of page faults taken by programs.

If you look at the refcount of MM.shared_zero_page() it will have quite
a high refcount, but that's just because everything maps it everywhere.
If you want to see the "real" refcount, you can build with the
MAP_SHARED_ZERO_PAGE_LAZILY flag, and we'll defer mapping of the shared
zero page until the first NP read fault.

I've left this behavior behind a flag for future testing of this code.
2020-02-15 13:17:40 +01:00
Andreas Kling 27f0102bbe Kernel: Add getter and setter for the X86 CR3 register
This gets rid of a bunch of inline assembly.
2020-02-10 20:00:32 +01:00
Andreas Kling ccfee3e573 Kernel: Remove more <LibBareMetal/Output/kstdio.h> includes 2020-02-10 12:07:48 +01:00
Andreas Kling 6cbd72f54f AK: Remove bitrotted Traits::dump() mechanism
This was only used by HashTable::dump() which I used when doing the
first HashTable implementation. Removing this allows us to also remove
most includes of <AK/kstdio.h>.
2020-02-10 11:55:34 +01:00
Liav A 99ea80695e Kernel: Use VirtualAddress & PhysicalAddress classes from LibBareMetal 2020-02-09 19:38:17 +01:00
Liav A e559af2008 Kernel: Apply changes to use LibBareMetal definitions 2020-02-09 19:38:17 +01:00
Andreas Kling 00d8ec3ead Kernel: The inode fault handler should grab the VMObject lock earlier
It doesn't look healthy to create raw references into an array before
a temporary unlock. In fact, that temporary unlock looks generally
unhealthy, but it's a different problem.
2020-02-08 12:55:21 +01:00
Andreas Kling a9d7902bb7 x86: Simplify region unmapping a bit
Add PageTableEntry::clear() to zero out a whole PTE, and use that for
unmapping instead of clearing individual fields.
2020-02-08 12:49:38 +01:00
Andreas Kling f91b3aab47 Kernel: Cloned shared regions should also be marked as shared 2020-02-08 02:39:46 +01:00
Andreas Kling bf5b7c32d8 Kernel: Add some sanity assertions in RangeAllocator::deallocate()
We should never end up deallocating an empty range, or a range that
ends before it begins.
2020-01-30 21:51:27 +01:00
Andreas Kling 31a141bd10 Kernel: Range::contains() should reject ranges with 2^32 wrap-around 2020-01-30 21:51:27 +01:00
Andreas Kling a27c5d2fb7 Kernel: Fail with EFAULT for any address+size that would wrap around
Previously we were only checking that each of the virtual pages in the
specified range were valid.

This made it possible to pass in negative buffer sizes to some syscalls
as long as (address) and (address+size) were on the same page.
2020-01-29 12:56:07 +01:00
Andreas Kling c17f80e720 Kernel: AnonymousVMObject::create_for_physical_range() should fail more
Previously it was not possible for this function to fail. You could
exploit this by triggering the creation of a VMObject whose physical
memory range would wrap around the 32-bit limit.

It was quite easy to map kernel memory into userspace and read/write
whatever you wanted in it.

Test: Kernel/bxvga-mmap-kernel-into-userspace.cpp
2020-01-28 20:48:07 +01:00
Andreas Kling 8131875da6 Kernel: Remove outdated comment in MemoryManager
Regions *do* zero-fill on demand now. :^)
2020-01-28 10:28:04 +01:00
Andreas Kling 3de5439579 AK: Let's call decrementing reference counts "unref" instead of "deref"
It always bothered me that we're using the overloaded "dereference"
term for this. Let's call it "unreference" instead. :^)
2020-01-23 15:14:21 +01:00
Andreas Kling f38cfb3562 Kernel: Tidy up debug logging a little bit
When using dbg() in the kernel, the output is automatically prefixed
with [Process(PID:TID)]. This makes it a lot easier to understand which
thread is generating the output.

This patch also cleans up some common logging messages and removes the
now-unnecessary "dbg() << *current << ..." pattern.
2020-01-21 16:16:20 +01:00
Liav A 200a5b0649 Kernel: Remove map_for_kernel() in MemoryManager
We don't need to have this method anymore. It was a hack that was used
in many components in the system but currently we use better methods to
create virtual memory mappings. To prevent any further use of this
method it's best to just remove it completely.

Also, the APIC code is disabled for now since it doesn't help booting
the system, and is broken since it relies on identity mapping to exist
in the first 1MB. Any call to the APIC code will result in assertion
failed.

In addition to that, the name of the method which is responsible to
create an identity mapping between 1MB to 2MB was changed, to be more
precise about its purpose.
2020-01-21 11:29:58 +01:00
Andreas Kling a0b716cfc5 Add AnonymousVMObject::create_with_physical_page()
This can be used to create a VMObject for a single PhysicalPage.
2020-01-20 13:13:03 +01:00
Andreas Kling 4ebff10bde Kernel: Write-only regions should still be mapped as present
There is no real "read protection" on x86, so we have no choice but to
map write-only pages simply as "present & read/write".

If we get a read page fault in a non-readable region, that's still a
correctness issue, so we crash the process. It's by no means a complete
protection against invalid reads, since it's trivial to fool the kernel
by first causing a write fault in the same region.
2020-01-20 13:13:03 +01:00
Andreas Kling 4b7a89911c Kernel: Remove some unnecessary casts to uintptr_t
VirtualAddress is constructible from uintptr_t and const void*.
PhysicalAddress is constructible from uintptr_t but not const void*.
2020-01-20 13:13:03 +01:00
Andreas Kling a246e9cd7e Use uintptr_t instead of u32 when storing pointers as integers
uintptr_t is 32-bit or 64-bit depending on the target platform.
This will help us write pointer size agnostic code so that when the day
comes that we want to do a 64-bit port, we'll be in better shape.
2020-01-20 13:13:03 +01:00
Andreas Kling 05836757c6 Kernel: Oops, fix bad sort order of available VM ranges
This made the allocator perform worse, so here's another second off of
the Kernel/Process.cpp compile time from a simple bugfix! (31s to 30s)
2020-01-19 15:53:43 +01:00
Andreas Kling 6eab7b398d Kernel: Make ProcessPagingScope restore CR3 properly
Instead of restoring CR3 to the current process's paging scope when a
ProcessPagingScope goes out of scope, we now restore exactly whatever
the CR3 value was when we created the ProcessPagingScope.

This fixes breakage in situations where a process ends up with nested
ProcessPagingScopes. This was making profiling very fragile, and with
this change it's now possible to profile g++! :^)
2020-01-19 13:44:53 +01:00
Andreas Kling ad3f931707 Kernel: Optimize VM range deallocation a bit
Previously, when deallocating a range of VM, we would sort and merge
the range list. This was quite slow for large processes.

This patch optimizes VM deallocation in the following ways:

- Use binary search instead of linear scan to find the place to insert
  the deallocated range.

- Insert at the right place immediately, removing the need to sort.

- Merge the inserted range with any adjacent range(s) in-line instead
  of doing a separate merge pass into a list copy.

- Add Traits<Range> to inform Vector that Range objects are trivial
  and can be moved using memmove().

I've also added an assertion that deallocated ranges are actually part
of the RangeAllocator's initial address range.

I've benchmarked this using g++ to compile Kernel/Process.cpp.
With these changes, compilation goes from ~41 sec to ~35 sec.
2020-01-19 13:29:59 +01:00
Andreas Kling f7b394e9a1 Kernel: Assert that copy_to/from_user() are called with user addresses
This will panic the kernel immediately if these functions are misused
so we can catch it and fix the misuse.

This patch fixes a couple of misuses:

    - create_signal_trampolines() writes to a user-accessible page
      above the 3GB address mark. We should really get rid of this
      page but that's a whole other thing.

    - CoW faults need to use copy_from_user rather than copy_to_user
      since it's the *source* pointer that points to user memory.

    - Inode faults need to use memcpy rather than copy_to_user since
      we're copying a kernel stack buffer into a quickmapped page.

This should make the copy_to/from_user() functions slightly less useful
for exploitation. Before this, they were essentially just glorified
memcpy() with SMAP disabled. :^)
2020-01-19 09:18:55 +01:00
Andreas Kling 2cd212e5df Kernel: Let's say that everything < 3GB is user virtual memory
Technically the bottom 2MB is still identity-mapped for the kernel and
not made available to userspace at all, but for simplicity's sake we
can just ignore that and make "address < 0xc0000000" the canonical
check for user/kernel.
2020-01-19 08:58:33 +01:00
Andreas Kling 862b3ccb4e Kernel: Enforce W^X between sys$mmap() and sys$execve()
It's now an error to sys$mmap() a file as writable if it's currently
mapped executable by anyone else.

It's also an error to sys$execve() a file that's currently mapped
writable by anyone else.

This fixes a race condition vulnerability where one program could make
modifications to an executable while another process was in the kernel,
in the middle of exec'ing the same executable.

Test: Kernel/elf-execve-mmap-race.cpp
2020-01-18 23:40:12 +01:00
Andreas Kling 6fea316611 Kernel: Move all CPU feature initialization into cpu_setup()
..and do it very very early in boot.
2020-01-18 10:11:29 +01:00
Andreas Kling 94ca55cefd Meta: Add license header to source files
As suggested by Joshua, this commit adds the 2-clause BSD license as a
comment block to the top of every source file.

For the first pass, I've just added myself for simplicity. I encourage
everyone to add themselves as copyright holders of any file they've
added or modified in some significant way. If I've added myself in
error somewhere, feel free to replace it with the appropriate copyright
holder instead.

Going forward, all new source files should include a license header.
2020-01-18 09:45:54 +01:00
Andreas Kling 19c31d1617 Kernel: Always dump kernel regions when dumping process regions 2020-01-18 08:57:18 +01:00
Andreas Kling 345f92d5ac Kernel: Remove two unused MemoryManager functions 2020-01-18 08:57:18 +01:00
Andreas Kling 3e8b60c618 Kernel: Clean up MemoryManager initialization a bit more
Move the CPU feature enabling to functions in Arch/i386/CPU.cpp.
2020-01-18 00:28:16 +01:00
Andreas Kling a850a89c1b Kernel: Add a random offset to the base of the per-process VM allocator
This is not ASLR, but it does de-trivialize exploiting the ELF loader
which would previously always parse executables at 0x01001000 in every
single exec(). I've taken advantage of this multiple times in my own
toy exploits and it's starting to feel cheesy. :^)
2020-01-17 23:29:54 +01:00
Andreas Kling 536c0ff3ee Kernel: Only clone the bottom 2MB of mappings from kernel to processes 2020-01-17 22:34:36 +01:00
Andreas Kling 122c76d7fa Kernel: Don't allocate per-process PDPT from super pages either
The default system is now down to 3 super pages allocated on boot. :^)
2020-01-17 22:34:36 +01:00
Andreas Kling ad1f79fb4a Kernel: Stop allocating page tables from the super pages pool
We now use the regular "user" physical pages for on-demand page table
allocations. This was by far the biggest source of super physical page
exhaustion, so that bug should be a thing of the past now. :^)

We still have super pages, but they are barely used. They remain useful
for code that requires memory with a low physical address.

Fixes #1000.
2020-01-17 22:34:36 +01:00
Andreas Kling f71fc88393 Kernel: Re-enable protection of the kernel image in memory 2020-01-17 22:34:36 +01:00
Andreas Kling 59b584d983 Kernel: Tidy up the lowest part of the address space
After MemoryManager initialization, we now only leave the lowest 1MB
of memory identity-mapped. The very first (null) page is not present.
All other pages are RW but not X. Supervisor only.
2020-01-17 22:34:36 +01:00
Andreas Kling 545ec578b3 Kernel: Tidy up the types imported from boot.S a little bit 2020-01-17 22:34:36 +01:00
Andreas Kling 7e6f0efe7c Kernel: Move Multiboot memory map parsing to its own function 2020-01-17 22:34:36 +01:00
Andreas Kling ba8275a48e Kernel: Clean up ensure_pte() 2020-01-17 22:34:36 +01:00
Andreas Kling e362b56b4f Kernel: Move kernel above the 3GB virtual address mark
The kernel and its static data structures are no longer identity-mapped
in the bottom 8MB of the address space, but instead move above 3GB.

The first 8MB above 3GB are pseudo-identity-mapped to the bottom 8MB of
the physical address space. But things don't have to stay this way!

Thanks to Jesse who made an earlier attempt at this, it was really easy
to get device drivers working once the page tables were in place! :^)

Fixes #734.
2020-01-17 22:34:26 +01:00
Liav A d2b41010c5 Kernel: Change Region allocation helpers
We now can create a cacheable Region, so when map() is called, if a
Region is cacheable then all the virtual memory space being allocated
to it will be marked as not cache disabled.

In addition to that, OS components can create a Region that will be
mapped to a specific physical address by using the appropriate helper
method.
2020-01-14 15:38:58 +01:00
Andreas Kling 5c3c2a9bac Kernel: Copy Region's "is_mmap" flag when cloning regions for fork()
Otherwise child processes will not be allowed to munmap(), madvise(),
etc. on the cloned regions!
2020-01-10 19:24:01 +01:00
Andreas Kling 62c45850e1 Kernel: Page allocation should not use memset_user() when zeroing
We're not zeroing new pages through a userspace address, so this should
not use memset_user().
2020-01-10 10:57:33 +01:00
Andreas Kling 197e73ee31 Kernel+LibELF: Enable SMAP protection during non-syscall exec()
When loading a new executable, we now map the ELF image in kernel-only
memory and parse it there. Then we use copy_to_user() when initializing
writable regions with data from the executable.

Note that the exec() syscall still disables SMAP protection and will
require additional work. This patch only affects kernel-originated
process spawns.
2020-01-10 10:57:06 +01:00
Andreas Kling 8e7420ddf2 Kernel: Harden memory mapping of the kernel image
We now map the kernel's text and rodata segments read+execute.
We also make the data and bss segments non-executable.

Thanks to q3k for the idea! :^)
2020-01-06 13:55:39 +01:00
Andreas Kling 9eef39d68a Kernel: Start implementing x86 SMAP support
Supervisor Mode Access Prevention (SMAP) is an x86 CPU feature that
prevents the kernel from accessing userspace memory. With SMAP enabled,
trying to read/write a userspace memory address while in the kernel
will now generate a page fault.

Since it's sometimes necessary to read/write userspace memory, there
are two new instructions that quickly switch the protection on/off:
STAC (disables protection) and CLAC (enables protection.)
These are exposed in kernel code via the stac() and clac() helpers.

There's also a SmapDisabler RAII object that can be used to ensure
that you don't forget to re-enable protection before returning to
userspace code.

THis patch also adds copy_to_user(), copy_from_user() and memset_user()
which are the "correct" way of doing things. These functions allow us
to briefly disable protection for a specific purpose, and then turn it
back on immediately after it's done. Going forward all kernel code
should be moved to using these and all uses of SmapDisabler are to be
considered FIXME's.

Note that we're not realizing the full potential of this feature since
I've used SmapDisabler quite liberally in this initial bring-up patch.
2020-01-05 18:14:51 +01:00
Andreas Kling aba7829724 Kernel: InodeVMObject can't call Inode::size() with interrupts disabled
Inode::size() may try to take a lock, so we can't be calling it with
interrupts disabled.

This fixes a kernel hang when trying to execute a binary in a TmpFS.
2020-01-03 15:40:03 +01:00
Andreas Kling 0f9800ca57 Kernel: Make the loop that marks the bottom 1MB NX a little less busy 2020-01-02 22:02:29 +01:00
Andreas Kling 32ec1e5aed Kernel: Mask kernel addresses in backtraces and profiles
Addresses outside the userspace virtual range will now show up as
0xdeadc0de in backtraces and profiles generated by unprivileged users.
2020-01-02 20:51:31 +01:00
Andreas Kling 3dcec260ed Kernel: Validate the full range of user memory passed to syscalls
We now validate the full range of userspace memory passed into syscalls
instead of just checking that the first and last byte of the memory are
in process-owned regions.

This fixes an issue where it was possible to avoid rejection of invalid
addresses that sat between two valid ones, simply by passing a valid
address and a size large enough to put the end of the range at another
valid address.

I added a little test utility that tries to provoke EFAULT in various
ways to help verify this. I'm sure we can think of more ways to test
this but it's at least a start. :^)

Thanks to mozjag for pointing out that this code was still lacking!

Incidentally this also makes backtraces work again.

Fixes #989.
2020-01-02 02:17:12 +01:00
Andreas Kling ea1911b561 Kernel: Share code between Region::map() and Region::remap_page()
These were doing mostly the same things, so let's just share the code.
2020-01-01 19:32:55 +01:00
Andreas Kling 5aeaab601e Kernel: Move CPU feature detection to Arch/x86/CPU.{cpp.h}
We now refuse to boot on machines that don't support PAE since all
of our paging code depends on it.

Also let's only enable SSE and PGE support if the CPU advertises it.
2020-01-01 12:57:00 +01:00
Andreas Kling 8602fa5b49 Kernel: Enable x86 SMEP (Supervisor Mode Execution Protection)
This prevents the kernel from jumping to code in userspace memory.
2020-01-01 01:59:52 +01:00
Andreas Kling c9ec415e2f Kernel: Always reject never-userspace addresses before checking regions
At the moment, addresses below 8MB and above 3GB are never accessible
to userspace, so just reject them without even looking at the current
process's memory regions.
2019-12-31 03:45:54 +01:00
Andreas Kling 66d5ebafa6 Kernel: Let's also not consider kernel regions to be valid user stacks
This one is less obviously exploitable than the previous one, but still
a bug nonetheless.
2019-12-31 00:28:14 +01:00
Andreas Kling 0fc24fe256 Kernel: User pointer validation should reject kernel-only addresses
We were happily allowing syscalls with pointers into kernel-only
regions (virtual address >= 0xc0000000).

This patch fixes that by only considering user regions in the current
process, and also double-checking the Region::is_user_accessible() flag
before approving an access.

Thanks to Fire30 for finding the bug! :^)
2019-12-31 00:24:35 +01:00
Andreas Kling 1f31156173 Kernel: Add a mode flag to sys$purge and allow purging clean inodes 2019-12-29 13:16:53 +01:00
Andreas Kling c74cde918a Kernel+SystemMonitor: Expose amount of per-process clean inode memory
This is memory that's loaded from an inode (file) but not modified in
memory, so still identical to what's on disk. This kind of memory can
be freed and reloaded transparently from disk if needed.
2019-12-29 12:45:58 +01:00
Andreas Kling 0d5e0e4cad Kernel+SystemMonitor: Expose amount of per-process dirty private memory
Dirty private memory is all memory in non-inode-backed mappings that's
process-private, meaning it's not shared with any other process.

This patch exposes that number via SystemMonitor, giving us an idea of
how much memory each process is responsible for all on its own.
2019-12-29 12:28:32 +01:00
Andreas Kling c1f8291ce4 Kernel: When physical page allocation fails, try to purge something
Instead of panicking right away when we run out of physical pages,
we now try to find a PurgeableVMObject with some volatile pages in it.
If we find one, we purge that entire object and steal one of its pages.

This makes it possible for the kernel to keep going instead of dying.
Very cool. :^)
2019-12-26 11:45:36 +01:00
Conrad Pankoff 17aef7dc99 Kernel: Detect support for no-execute (NX) CPU features
Previously we assumed all hosts would have support for IA32_EFER.NXE.
This is mostly true for newer hardware, but older hardware will crash
and burn if you try to use this feature.

Now we check for support via CPUID.80000001[20].
2019-12-26 10:05:51 +01:00
Andreas Kling 9e55bcb7da Kernel: Make kernel memory regions be non-executable by default
From now on, you'll have to request executable memory specifically
if you want some.
2019-12-25 22:41:34 +01:00
Andreas Kling 0b7a2e0a5a Kernel: Set NX bit for virtual addresses 0-1MB and 2-8MB
This removes the ability to jump into kmalloc memory, etc.
Only the kernel image itself is allowed to exec, located between 1-2MB.
2019-12-25 22:24:28 +01:00
Andreas Kling ce5f7f6c07 Kernel: Use the CPU's NX bit to enforce PROT_EXEC on memory mappings
Now that we have PAE support, we can ask the CPU to crash processes for
trying to execute non-executable memory. This is pretty cool! :^)
2019-12-25 13:35:57 +01:00
Andreas Kling 52deb09382 Kernel: Enable PAE (Physical Address Extension)
Introduce one more (CPU) indirection layer in the paging code: the page
directory pointer table (PDPT). Each PageDirectory now has 4 separate
PageDirectoryEntry arrays, governing 1 GB of VM each.

A really neat side-effect of this is that we can now share the physical
page containing the >=3GB kernel-only address space metadata between
all processes, instead of lazily cloning it on page faults.

This will give us access to the NX (No eXecute) bit, allowing us to
prevent execution of memory that's not supposed to be executed.
2019-12-25 13:35:57 +01:00
Andreas Kling c087abc48d Kernel: Rename PageDirectory::find_by_pdb() => find_by_cr3()
I caught myself wondering what "pdb" stood for, so let's rename this
to something more obvious.
2019-12-25 02:58:03 +01:00
Andreas Kling 7a0088c4d2 Kernel: Clean up Region access bit setters a little 2019-12-25 02:58:03 +01:00
Andreas Kling c9a5253ac2 Kernel: Uh, actually *actually* turn on CR4.PGE
I'm not sure how I managed to misread the location of this bit twice.
But I did! Here is finally the correct value, according to Intel:

    "Page Global Enable (bit 7 of CR4)"

Jeez! :^)
2019-12-25 02:58:03 +01:00
Andreas Kling 3623e35978 Kernel: Oops, actually enable CR4.PGE (page table global bit)
Turns out we were setting the wrong bit here. Now we will actually keep
kernel memory mappings in the TLB across context switches.
2019-12-24 22:45:27 +01:00
Andreas Kling ae2d72377d Kernel: Enable the x86 WP bit to catch invalid memory writes in ring 0
Setting this bit will cause the CPU to generate a page fault when
writing to read-only memory, even if we're executing in the kernel.

Seemingly the only change needed to make this work was to have the
inode-backed page fault handler use a temporary mapping for writing
the read-from-disk data into the newly-allocated physical page.
2019-12-21 16:21:13 +01:00
Andreas Kling 62c2309336 Kernel: Fix some warnings about passing non-POD to kprintf 2019-12-20 20:19:46 +01:00
Andreas Kling b6ee8a2c8d Kernel: Rename vmo => vmobject everywhere 2019-12-19 19:15:27 +01:00
Andreas Kling 1d4d6f16b2 Kernel: Add a specific-page variant of Region::commit() 2019-12-18 22:43:32 +01:00
Andreas Kling 0a75a46501 Kernel: Make sure the kernel info page is read-only for userspace
To enforce this, we create two separate mappings of the same underlying
physical page. A writable mapping for the kernel, and a read-only one
for userspace (the one returned by sys$get_kernel_info_page.)
2019-12-15 22:21:28 +01:00
Andreas Kling 931e4b7f5e Kernel+SystemMonitor: Prevent userspace access to process ELF image
Every process keeps its own ELF executable mapped in memory in case we
need to do symbol lookup (for backtraces, etc.)

Until now, it was mapped in a way that made it accessible to the
program, despite the program not having mapped it itself.
I don't really see a need for userspace to have access to this right
now, so let's lock things down a little bit.

This patch makes it inaccessible to userspace and exposes that fact
through /proc/PID/vm (per-region "user_accessible" flag.)
2019-12-15 20:11:57 +01:00
Andreas Kling 05a441afb2 Kernel: Don't turn private read-only regions into shared ones on fork
Even if they are read-only now, they can be mprotect(PROT_WRITE)'d in
the future, so we have to make sure they are CoW mapped.
2019-12-15 16:53:46 +01:00
Andreas Kling 3fbc50a350 Kernel+SystemMonitor: Expose the number of set CoW bits in each Region
This number tells us how many more pages in a given region will trigger
a CoW fault if written to.
2019-12-15 16:53:00 +01:00
Andreas Kling 9ad151c665 Kernel: Improve comment about the system virtual memory map a bit 2019-12-15 16:13:08 +01:00
Andreas Kling 65229a4082 Kernel: Move VMObject::for_each_region() to MemoryManager.h
It can't be in VMObject.h since it depends on MemoryManager.h
2019-12-09 20:06:03 +01:00
Andreas Kling a22b7f96fc Kernel: Remap all regions referring to a PurgeableVMObject on purge
Otherwise we won't get page faults next time you try to access the
purged memory.
2019-12-09 20:05:04 +01:00
Andreas Kling dbb644f20c Kernel: Start implementing purgeable memory support
It's now possible to get purgeable memory by using mmap(MAP_PURGEABLE).
Purgeable memory has a "volatile" flag that can be set using madvise():

- madvise(..., MADV_SET_VOLATILE)
- madvise(..., MADV_SET_NONVOLATILE)

When in the "volatile" state, the kernel may take away the underlying
physical memory pages at any time, without notifying the owner.
This gives you a guilt discount when caching very large things. :^)

Setting a purgeable region to non-volatile will return whether or not
the memory has been taken away by the kernel while being volatile.
Basically, if madvise(..., MADV_SET_NONVOLATILE) returns 1, that means
the memory was purged while volatile, and whatever was in that piece
of memory needs to be reconstructed before use.
2019-12-09 19:12:38 +01:00
Andreas Kling 05c65fb4f1 Kernel: Don't CoW non-writable pages
A page fault in a page marked for CoW should not trigger a CoW if the
page is non-writable. I think this makes sense.
2019-12-02 19:20:09 +01:00
Andreas Kling f41ae755ec Kernel: Crash on memory access in non-readable regions
This patch makes it possible to make memory regions non-readable.
This is enforced using the "present" bit in the page tables.
A process that hits an not-present page fault in a non-readable
region will be crashed.
2019-12-02 19:18:52 +01:00
Andreas Kling 7dc9c90f83 Kernel: Fix bug where mprotect() would ignore setting PROT_WRITE
A typo in Region::set_writable() caused us to update the readable flag
rather than the writable flag.
2019-12-02 18:15:36 +01:00
Andreas Kling cde0a1eeb5 Kernel: Put some debug spam behind PAGE_FAULT_DEBUG 2019-12-01 16:03:24 +01:00
Andreas Kling e56daf547c Kernel: Disallow syscalls from writeable memory
Processes will now crash with SIGSEGV if they attempt making a syscall
from PROT_WRITE memory.

This neat idea comes from OpenBSD. :^)
2019-11-29 16:30:05 +01:00
Andreas Kling 2d1bcce34a Kernel: Fix triple-fault when clicking on SystemServer in SystemMonitor
The fault was happening when retrieving a current backtrace for the
SystemServer process.

To generate a backtrace, we go into the paging scope of the process,
meaning we temporarily switch to using its page directory as our own.

Because kernel VM is allocated on demand, it's possible for a process's
mappings above the 3GB mark to be out-of-date. Normally this just gets
fixed up transparently by the page fault handler (which simply copies
the PDE from the canonical MM.kernel_page_directory() into the current
process.)

However, if the current kernel *stack* is in a piece of memory that
the backtraced process lacks up-to-date PDE's for, we still get a page
fault, but are unable to handle it, since the CPU wants to push to the
stack as part of calling the page fault handler. So we're screwed and
it's a triple-fault.

Fix this by always updating the kernel VM mappings before switching
into a paging scope. In practical terms, this is a 1KB memcpy() that
happens when generating a backtrace, or doing exec().
2019-11-27 12:40:42 +01:00
Andreas Kling 5b8cf2ee23 Kernel: Make syscall counters and page fault counters per-thread
Now that we show individual threads in SystemMonitor and "top",
it's also very nice to have individual counters for the threads. :^)
2019-11-26 21:37:38 +01:00
Andreas Kling 3dc87be891 Kernel: Mark mmap()-created regions with a special bit
Then only allow regions with that bit to be manipulated via munmap()
and mprotect(). This prevents messing with non-mmap()ed regions in
a process's address space (stacks, shared buffers, ...)
2019-11-24 12:26:21 +01:00
Andreas Kling 9a157b5e81 Revert "Kernel: Move Kernel mapping to 0xc0000000"
This reverts commit bd33c66273.

This broke the network card drivers, since they depended on kmalloc
addresses being identity-mapped.
2019-11-23 17:27:09 +01:00
Jesse Buhagiar bd33c66273 Kernel: Move Kernel mapping to 0xc0000000
The kernel is now no longer identity mapped to the bottom 8MiB of
memory, and is now mapped at the higher address of `0xc0000000`.

The lower ~1MiB of memory (from GRUB's mmap), however is still
identity mapped to provide an easy way for the kernel to get
physical pages for things such as DMA etc. These could later be
mapped to the higher address too, as I'm not too sure how to
go about doing this elegantly without a lot of address subtractions.
2019-11-22 16:23:23 +01:00
Andreas Kling 794758df3a Kernel: Implement some basic stack pointer validation
VM regions can now be marked as stack regions, which is then validated
on syscall, and on page fault.

If a thread is caught with its stack pointer pointing into anything
that's *not* a Region with its stack bit set, we'll crash the whole
process with SIGSTKFLT.

Userspace must now allocate custom stacks by using mmap() with the new
MAP_STACK flag. This mechanism was first introduced in OpenBSD, and now
we have it too, yay! :^)
2019-11-17 12:15:43 +01:00
Liav A bce510bf6f Kernel: Fix the search method of free userspace physical pages (#742)
Now the userspace page allocator will search through physical regions,
and stop the search as it finds an available page.

Also remove an "address of" sign since we don't need that when
counting size of physical regions
2019-11-08 22:39:29 +01:00
supercomputer7 c3c905aa6c Kernel: Removing hardcoded offsets from Memory Manager
Now the kernel page directory and the page tables are located at a
safe address, to prevent from paging data colliding with garbage.
2019-11-08 17:38:23 +01:00
Andreas Kling 19398cd7d5 Kernel: Reorganize memory layout a bit
Move the kernel image to the 1 MB physical mark. This prevents it from
colliding with stuff like the VGA memory. This was causing us to end
up with the BIOS screen contents sneaking into kernel memory sometimes.

This patch also bumps the kmalloc heap size from 1 MB to 3 MB. It's not
the perfect permanent solution (obviously) but it should get the OOM
monkey off our backs for a while.
2019-11-04 12:04:35 +01:00
Andreas Kling a6e9119537 Kernel: Tweak some outdated kprintfs in Region 2019-11-04 00:48:45 +01:00
Andreas Kling d67c6a92db Kernel: Move page fault handling from MemoryManager to Region
After the page fault handler has found the region in which the fault
occurred, do the rest of the work in the region itself.

This patch also makes all fault types consistently crash the process
if a new page is needed but we're all out of pages.
2019-11-04 00:47:03 +01:00
Andreas Kling 0e8f1d7cb6 Kernel: Don't expose a region's page directory to the outside world
Now that region manages its own mapping/unmapping, there's no need for
the outside world to be able to grab at its page directory.
2019-11-04 00:26:00 +01:00
Andreas Kling 6ed9cc4717 Kernel: Remove Region API's for setting/unsetting the page directory
This is done implicitly by mapping or unmapping the region.
2019-11-04 00:24:20 +01:00
Andreas Kling e3dda4e87b Kernel: Fix weird Region constructor that took nullable RefPtr<Inode>
It's never valid to construct a Region with a null Inode pointer using
this constructor, so just take a NonnullRefPtr<Inode> instead.
2019-11-04 00:21:08 +01:00
Andreas Kling 9b2dc36229 Kernel: Merge MemoryManager::map_region_at_address() into Region::map() 2019-11-04 00:05:57 +01:00
Andreas Kling 98b328754e Kernel: Fix bad setup of CoW faults for offset regions
Regions with an offset into their VMObject were incorrectly adding the
page offset when indexing into the CoW bitmap.
2019-11-03 23:54:35 +01:00
Andreas Kling 5b7f8634e3 Kernel: Set the G (global) bit for kernel page tables
Since the kernel page tables are shared between all processes, there's
no need to (implicitly) flush the TLB for them on every context switch.

Setting the G bit on kernel page tables allows the CPU to keep the
translation caches around.
2019-11-03 23:51:55 +01:00
Andreas Kling 4bf1a72d21 Kernel: Teach Region how to remap itself
Now remapping (i.e flushing kernel metadata to the CPU page tables)
is done by simply calling Region::remap().
2019-11-03 21:11:08 +01:00
Andreas Kling 3dce0f23f4 Kernel: Regions should be mapped into a PageDirectory, not a Process
This patch changes the parameter to Region::map() to be a PageDirectory
since that matches how we think about the memory model:

Regions are views onto VMObjects, and are mapped into PageDirectories.
Each Process has a PageDirectory. The kernel also has a PageDirectory.
2019-11-03 21:11:08 +01:00
Andreas Kling 2cfc43c982 Kernel: Move region map/unmap operations into the Region class
The more Region can take care of itself, the better.
2019-11-03 21:11:08 +01:00
Andreas Kling a221cddeec Kernel: Clean up a bunch of wrong-looking Region/VMObject code
Since a Region is merely a "window" onto a VMObject, it can both begin
and end at a distance from the VMObject's boundaries.
Therefore, we should always be computing indices into a VMObject's
physical page array by adding the Region's "first_page_index()".

There was a whole bunch of code that forgot to do that. This fixes
many wrong behaviors for Regions that start part-way into a VMObject.
2019-11-03 15:44:13 +01:00
Andreas Kling fe455c5ac4 Kernel: Move page remapping into Region::remap_page(index)
Let Region deal with this, instead of everyone calling MemoryManager.
2019-11-03 15:32:11 +01:00
Andreas Kling b0321bf290 Kernel: Zero-fill faults should not temporarily enable interrupts
We were doing a temporary STI/CLI in MemoryManager::zero_page() to be
able to acquire the VMObject's lock before zeroing out a page.

This logic was inherited from the inode fault handler, where we need
to enable interrupts anyway, since we might need to interact with the
underlying storage device.

Zero-fill faults don't actually need to lock the VMObject, since they
are already guaranteed exclusivity by interrupts being disabled when
entering the fault handler.

This is different from inode faults, where a second thread can often
get an inode fault for the same exact page in the same VMObject before
the first fault handler has received a response from the disk.
This is why the lock exists in the first place, to prevent this race.

This fixes an intermittent crash in sys$execve() that was made much
more visible after I made userspace stacks lazily allocated.
2019-11-01 17:59:47 +01:00
Tom 00a7c48d6e APIC: Enable APIC and start APs 2019-10-16 19:14:02 +02:00
Andreas Kling 35138437ef Kernel+SystemMonitor: Add fault counters
This patch adds three separate per-process fault counters:

- Inode faults

    An inode fault happens when we've memory-mapped a file from disk
    and we end up having to load 1 page (4KB) of the file into memory.

- Zero faults

    Memory returned by mmap() is lazily zeroed out. Every time we have
    to zero out 1 page, we count a zero fault.

- CoW faults

    VM objects can be shared by multiple mappings that make their own
    unique copy iff they want to modify it. The typical reason here is
    memory shared between a parent and child process.
2019-10-02 14:13:49 +02:00
Andreas Kling d481ae95b5 Kernel: Defer creation of Region CoW bitmaps until they're needed
Instead of allocating and populating a Copy-on-Write bitmap for each
Region up front, wait until we actually clone the Region for sharing
with another process.

In most cases, we never need any CoW bits and we save ourselves a lot
of kmalloc() memory and time.
2019-10-01 19:58:41 +02:00
Andreas Kling c58d1868cb Kernel: Fix munmap() bad splitting of already-split Regions
When splitting an Region that's already the result of an earlier split,
we have to take the Region's offset-in-VMObject into account since it
may be non-zero.
2019-10-01 11:40:40 +02:00
Andreas Kling ac20919b13 Kernel: Make it possible to turn off VM guard pages at compile time
This might be useful for debugging since guard pages introduce a fair
amount of noise in the virtual address space.
2019-09-30 17:22:16 +02:00
Conrad Pankoff fa20a447a9 Kernel: Repair unaligned regions supplied by the boot loader
We were just blindly trusting that the bootloader would only give us
page-aligned memory regions. This is apparently not always the case,
so now we can try to repair those regions.

Fixes #601
2019-09-28 09:23:52 +02:00
Andreas Kling 2584636d19 Kernel: Fix partial munmap() deallocating still-in-use VM
We were always returning the full VM range of the partially-unmapped
Region to the range allocator. This caused us to re-use those addresses
for subsequent VM allocations.

This patch also skips creating a new VMObject in partial munmap().
Instead we just make split regions that point into the same VMObject.

This fixes the mysterious GCC ICE on large C++ programs.
2019-09-27 20:21:52 +02:00
Andreas Kling 7f9a33dba1 Kernel: Make Region single-owner instead of ref-counted
This simplifies the ownership model and makes Region easier to reason
about. Userspace Regions are now primarily kept by Process::m_regions.

Kernel Regions are kept in various OwnPtr<Regions>'s.

Regions now only ever get unmapped when they are destroyed.
2019-09-27 14:25:42 +02:00
Andreas Kling 9c549c178a Kernel: Pad virtual address space allocations with guard pages
Put one unused page on each side of VM allocations to make invalid
accesses more likely to generate crashes.

Note that we will not add this guard padding for mmap() at a specific
memory address, only to "mmap it anywhere" requests.
2019-09-22 15:12:29 +02:00
Conrad Pankoff 224fbb7910 Kernel: Fix returning pages to regions >= 2GB 2019-09-17 09:27:23 +02:00
Conrad Pankoff 9c5e3cd818 Kernel: Ignore memory the bootloader gives us above 2^32 2019-09-17 09:27:23 +02:00
Andreas Kling a34f3a3729 Kernel: Fix some bitrot in MemoryManager debug logging code 2019-09-16 14:45:44 +02:00
Andreas Kling 5d491fa1cd Kernel: Add a simple slab allocator for small allocations
This is a freelist allocator with static size classes that works as a
complement to the generic kmalloc(). It's a lot faster than kmalloc()
since allocation just means popping from the freelist.

It's also significantly more compact when there are a lot of objects
smaller than the minimum kmalloc chunk size (32 bytes.)

This patch enables it for the Region and PhysicalPage classes.
In the PhysicalPage (8 bytes) case, it's a huge improvement since we
no longer waste 75% of the storage allocated.

There are also a number of ways this can be improved, so let's keep
working on it going forward.
2019-09-16 10:33:27 +02:00
Andreas Kling 1c692e87a6 Kernel: Move kmalloc() into a Kernel/Heap/ directory 2019-09-16 09:01:44 +02:00
Andreas Kling e60bbadbbc Kernel: Add LogStream operator<< for PhysicalAddress 2019-09-15 20:47:49 +02:00
Andreas Kling a40afc4562 Kernel: Get rid of MemoryManager::allocate_page_table()
We can just use the physical page allocator directly, there's no need
for a dedicated function for page tables.
2019-09-15 20:34:03 +02:00
Andreas Kling 73fdbba59c AK: Rename <AK/AKString.h> to <AK/String.h>
This was a workaround to be able to build on case-insensitive file
systems where it might get confused about <string.h> vs <String.h>.

Let's just not support building that way, so String.h can have an
objectively nicer name. :^)
2019-09-06 15:36:54 +02:00
Andreas Kling bf43d94a2f Kernel: Disable interrupts throughout ~Region()
We don't want an interrupt handler to access the VM data structures
while their internal consistency is broken.
2019-09-05 11:15:05 +02:00
Andreas Kling e25ade7579 Kernel: Rename "vmo" to "vmobject" everywhere 2019-09-04 11:27:14 +02:00
Andreas Kling 0e53b1d1ad Kernel: Add some convenient getters to Region
Add getters for the underlying Range, the access bits, and also add
contains(Range) which just wraps m_range.contains().
2019-08-29 20:55:40 +02:00
Andreas Kling 10e0e13bf3 Kernel: Add LogStream operator<< for Range 2019-08-29 20:54:50 +02:00
Andreas Kling dde10f534f Revert "Kernel: Avoid a memcpy() of the whole block when paging in from inode"
This reverts commit 11896d0e26.

This caused a race where other processes using the same InodeVMObject
could end up accessing the newly-mapped physical page before we've
actually filled it with bytes from disk.

It would be nice to avoid these copies without breaking anything.
2019-08-26 13:50:55 +02:00
Andreas Kling f5d779f47e Kernel: Never forcibly page in entire executables
We were doing this for the initial kernel-spawned userspace process(es)
to work around instability in the page fault handler. Now that the page
fault handler is more robust, we can stop worrying about this.

Specifically, the page fault handler was previous not able to handle
getting a page fault in anything but the currently executing task's
page directory.
2019-08-26 13:20:01 +02:00
Andreas Kling e29fd3cd20 Kernel: Display virtual addresses as V%p instead of L%x
The L was a leftover from when these were called linear addresses.
2019-08-26 11:31:58 +02:00
Andreas Kling 11896d0e26 Kernel: Avoid a memcpy() of the whole block when paging in from inode 2019-08-25 14:34:53 +02:00
Andreas Kling b018cd653f Kernel: Fix oversized InodeVMObject after inode size changes 2019-08-24 19:35:47 +02:00
Andreas Kling 179158bc97 Kernel: Put debug spam about already-paged-in inode pages behind #ifdef 2019-08-19 17:30:36 +02:00
Andreas Kling 9104d32341 Kernel: Use range-for with InlineLinkedList 2019-08-08 13:40:58 +02:00
Andreas Kling 07425580a8 Kernel: Put all Regions on InlineLinkedLists (separated by user/kernel)
Remove the global hash tables and replace them with InlineLinkedLists.
This significantly reduces the kernel heap pressure from doing many
small mmap()'s.
2019-08-08 11:11:22 +02:00
Andreas Kling a96d76fd90 Kernel: Put all VMObjects in an InlineLinkedList instead of a HashTable
Using a HashTable to track "all instances of Foo" is only useful if we
actually need to look up entries by some kind of index. And since they
are HashTable (not HashMap), the pointer *is* the index.

Since we have the pointer, we can just use it directly. Duh.
This increase sizeof(VMObject) by two pointers, but removes a global
table that had an entry for every VMObject, where the cost was higher.
It also avoids all the general hash tabling business when creating or
destroying VMObjects. Generally we should do more of this. :^)
2019-08-08 11:11:22 +02:00
Andreas Kling 98ce498922 Kernel: Remove unused MemoryManager::remove_identity_mapping()
This was not actually used and just sitting there being confusing.
2019-08-07 22:13:10 +02:00
Andreas Kling f5ff796970 Kernel: Always give back VM to the RangeAllocator when unmapping Region
We were only doing this in Process::deallocate_region(), which meant
that kernel-only Regions never gave back their VM.

With this patch, we can start reusing freed-up address space! :^)
2019-08-07 21:57:39 +02:00
Andreas Kling b67200dfea Kernel: Use a FixedArray for VMObject::m_physical_pages
This makes VMObject 8 bytes smaller since we can use the array size as
the page count.

The size() is now also computed from the page count instead of being
a separate value. This makes sizes always be a multiple of PAGE_SIZE,
which is sane.
2019-08-07 20:12:50 +02:00
Andreas Kling 6bdb81ad87 Kernel: Split VMObject into two classes: Anonymous- and InodeVMObject
InodeVMObject is a VMObject with an underlying Inode in the filesystem.
AnonymousVMObject has no Inode.

I'm happy that InodeVMObject::inode() can now return Inode& instead of
VMObject::inode() return Inode*. :^)
2019-08-07 18:09:32 +02:00
Andreas Kling cb2d572a14 Kernel: Remove "allow CPU caching" flag on VMObject
This wasn't really thought-through, I was just trying anything to see
if it would make WindowServer faster. This doesn't seem to make much of
a difference either way, so let's just not do it for now.

It's easy to bring back if we think we need it in the future.
2019-08-07 16:34:00 +02:00
Andreas Kling 3364da388f Kernel: Remove VMObject names
The VMObject name was always either the owning region's name, or the
absolute path of the underlying inode.

We can reconstitute this information if wanted, no need to keep copies
of these strings around.
2019-08-07 16:14:08 +02:00
Andreas Kling c973a51a23 Kernel: Make KBuffer lazily populated
KBuffers are now zero-filled on demand instead of up front. This means
that you can create a huge KBuffer and it will only take up VM, not
physical pages (until you access them.)
2019-08-06 15:06:31 +02:00
Andreas Kling a0bb592b4f Kernel: Allow zero-fill page faults on kernel-only pages
We were short-circuiting the page fault handler a little too eagerly
for page-not-present faults in kernel memory.
If the current page directory already has up-to-date mapps for kernel
memory, allow it to progress to checking for zero-fill conditions.

This will enable us to have lazily populated kernel regions.
2019-08-06 15:06:31 +02:00
Andreas Kling 2ad963d261 Kernel: Add mapping from page directory base (PDB) to PageDirectory
This allows the page fault code to find the owning PageDirectory and
corresponding process for faulting regions.

The mapping is implemented as a global hash map right now, which is
definitely not optimal. We can come up with something better when it
becomes necessary.
2019-08-06 11:30:26 +02:00
Andreas Kling 8d07bce12a Kernel: Break region_from_vaddr() into {user,kernel}_region_from_vaddr
Sometimes you're only interested in either user OR kernel regions but
not both. Let's break this into two functions so the caller can choose
what he's interested in.
2019-08-06 10:33:22 +02:00
Andreas Kling e27e1b3fb2 Kernel: Add LogStream operator<< for VirtualAddress 2019-08-06 10:28:46 +02:00
Andreas Kling 945f8eb22a Kernel: Don't treat read faults like CoW exceptions
I'm not sure why we would have a non-readable CoW region, but I suppose
we could, so let's not Copy-on-Read in those cases.
2019-08-06 09:39:39 +02:00
Andreas Kling af4cf01560 Kernel: Clean up the page fault handling code a bit
Not using "else" after "return" unnests the code and makes it easier to
follow. Also use an enum for the two different page fault types.
2019-08-06 09:33:35 +02:00
Andreas Kling da6c8fe3f8 Kernel: On kernel NP fault, always copy into *active* page directory
If we were using a ProcessPagingScope to temporarily go into another
process's page tables, things would fall apart when hitting a kernel
NP fault, since we'd clone the kernel page directory entry into the
*currently active process's* page directory rather than cloning it
into the *currently active* page directory.
2019-08-06 07:28:35 +02:00
Andreas Kling 79e22acb22 Kernel: Use KBuffers for ProcFS and SynthFS
Instead of generating ByteBuffers and keeping those lying around, have
these filesystems generate KBuffers instead. These are way less spooky
to leave around for a while.

Since FileDescription will keep a generated file buffer around until
userspace has read the whole thing, this prevents trivially exhausting
the kmalloc heap by opening many files in /proc for example.

The code responsible for generating each /proc file is not perfectly
efficient and many of them still use ByteBuffers internally but they
at least go away when we return now. :^)
2019-08-05 11:37:48 +02:00
Andreas Kling b5f1a4ac07 Kernel: Flush the TLB (page only) when copying in a new kernel mapping
Not flushing the TLB here puts us in an infinite page fault loop.
2019-08-04 21:22:11 +02:00
Andreas Kling 1f8f739ea2 Kernel: Simplify PhysicalPage construction.
There was some leftover cruft from the times when PhysicalPage was allocated
using different allocators depending on lifetime.
2019-07-24 06:29:47 +02:00
Andreas Kling f8beb0f665 Kernel: Share the "return to ring 0/3 from signal" trampolines globally.
Generate a special page containing the "return from signal" trampoline code
on startup and then route signalled threads to it. This avoids a page
allocation in every process that ever receives a signal.
2019-07-19 17:01:16 +02:00
Andreas Kling fdf931cfce Kernel: Remove accidental use of removed Region::set_user_accessible(). 2019-07-19 16:22:09 +02:00
Andreas Kling 5b2447a27b Kernel: Track user accessibility per Region.
Region now has is_user_accessible(), which informs the memory manager how
to map these pages. Previously, we were just passing a "bool user_allowed"
to various functions and I'm not at all sure that any of that was correct.

All the Region constructors are now hidden, and you must go through one of
these helpers to construct a region:

- Region::create_user_accessible(...)
- Region::create_kernel_only(...)

That ensures that we don't accidentally create a Region without specifying
user accessibility. :^)
2019-07-19 16:11:52 +02:00
Andreas Kling 3dac1f8ac5 Kernel: Remove use of [[gnu::pure]].
I was messing around with this to tell the compiler that these functions
always return the same value no matter how many times you call them.

It doesn't really seem to improve code generation and it looks weird so
let's just get rid of it.
2019-07-16 13:44:41 +02:00
Andreas Kling 5254a320d8 Kernel: Remove use of copy_ref() in favor of regular RefPtr copies.
This is obviously more readable. If we ever run into a situation where
ref count churn is actually causing trouble in the future, we can deal with
it then. For now, let's keep it simple. :^)
2019-07-11 15:40:04 +02:00
Andreas Kling 149fd7e045 Kernel: Move PhysicalAddress.h into VM/ 2019-07-09 15:04:45 +02:00
Andreas Kling eca5c2bdf8 Kernel: Move VirtualAddress.h into VM/ 2019-07-09 15:04:45 +02:00
Andreas Kling 27f699ef0c AK: Rename the common integer typedefs to make it obvious what they are.
These types can be picked up by including <AK/Types.h>:

* u8, u16, u32, u64 (unsigned)
* i8, i16, i32, i64 (signed)
2019-07-03 21:20:13 +02:00
VAN BOSSUYT Nicolas 802d4dcb6b Meta: Removed all gitignore in the source tree only keeping the root one 2019-06-30 10:41:26 +02:00
Andreas Kling 601b0a8c68 Kernel: Use NonnullRefPtrVector in parts of the kernel. 2019-06-27 13:35:02 +02:00
Andreas Kling 8f3f5ac8ce Kernel: Automatically populate page tables with lazy kernel regions.
If we get an NP page fault in a process, and the fault address is in the
kernel address range (anywhere above 0xc0000000), we probably just need
to copy the page table info over from the kernel page directory.

The kernel doesn't allocate address space until it's needed, and when it
does allocate some, it only puts the info in the kernel page directory,
and any *new* page directories created from that point on. Existing page
directories need to be updated, and that's what this patch fixes.
2019-06-26 22:27:41 +02:00
Andreas Kling 183205d51c Kernel: Make the x86 paging code slightly less insane.
Instead of PDE's and PTE's being weird wrappers around dword*, just have
MemoryManager::ensure_pte() return a PageDirectoryEntry&, which in turn has
a PageTableEntry* entries().

I've been trying to understand how things ended up this way, and I suspect
it was because I inadvertently invoked the PageDirectoryEntry copy ctor in
the original work on this, which must have made me very confused..

Anyways, now things are a bit saner and we can move forward towards a better
future, etc. :^)
2019-06-26 21:45:56 +02:00
Andreas Kling 46a06c23e3 Kernel: Fix all compiler warnings. 2019-06-22 16:22:34 +02:00
Andreas Kling d343fb2429 AK: Rename Retainable.h => RefCounted.h. 2019-06-21 18:58:45 +02:00
Andreas Kling 550b0b062b AK: Rename RetainPtr.h => RefPtr.h, Retained.h => NonnullRefPtr.h. 2019-06-21 18:45:59 +02:00
Andreas Kling 90b1354688 AK: Rename RetainPtr => RefPtr and Retained => NonnullRefPtr. 2019-06-21 18:37:47 +02:00
Andreas Kling 77b9fa89dd AK: Rename Retainable => RefCounted.
(And various related renames that go along with it.)
2019-06-21 15:30:03 +02:00
Sergey Bugaev d900fe98e2 VM: Remove PhysicalPage::create_eternal().
Now that it is possible to create non-eternal non-freeable
pages, PageDirectory can do just that.
2019-06-14 16:14:49 +02:00
Sergey Bugaev 010314ee66 VM: Make VMObject::create_for_physical_range() create non-freeable pages.
This method is used in BXVGADevice to create pages for the framebuffer;
we should neither make the PhysicalPage instances eternal, nor hand over
actual physical pages to the memory allocator.
2019-06-14 16:14:49 +02:00
Sergey Bugaev a8e86841ce VM: Support non-freeable, non-eternal PhysicalPages. 2019-06-14 16:14:49 +02:00
Sergey Bugaev 6bb7c80365 VM: Fix leaking PhysicalPage instances.
After PhysicalPage::return_to_freelist(), an actual physical page
is returned back to the memory manager; which will create a new
PhysicalPage instance if it decides to reuse the physical page. This
means this PhysicalPage instance should be freed; otherwise it would
get leaked.
2019-06-14 16:14:49 +02:00
Sergey Bugaev 118cb391dd VM: Pass a PhysicalPage by rvalue reference when returning it to the freelist.
This makes no functional difference, but it makes it clear that
MemoryManager and PhysicalRegion take over the actual physical
page represented by this PhysicalPage instance.
2019-06-14 16:14:49 +02:00
Sergey Bugaev 7710e48d83 VM: Fix freeing physical pages.
Pages created with PhysicalPage::create_eternal() should *not* be
returnable to the freelist; and pages created with the regular
PhysicalPage::create() should be; not the other way around.
2019-06-14 16:14:49 +02:00
Andreas Kling 1c5677032a Kernel: Replace the last "linear" with "virtual". 2019-06-13 21:42:12 +02:00
Conrad Pankoff b29a83d554 Kernel: Wrap around to region start if necessary in take_free_page 2019-06-12 15:38:17 +02:00
Conrad Pankoff aee9317d86 Kernel: Refactor MemoryManager to use a Bitmap rather than a Vector
This significantly reduces the pressure on the kernel heap when
allocating a lot of pages.

Previously at about 250MB allocated, the free page list would outgrow
the kernel's heap. Given that there is no longer a page list, this does
not happen.

The next barrier will be the kernel memory used by the page records for
in-use memory. This kicks in at about 1GB.
2019-06-12 15:38:17 +02:00
Andreas Kling 9da62f52a1 Kernel: Use the Multiboot memory map info to inform our paging setup.
This makes it possible to run Serenity with more than 64 MB of RAM.
Because each physical page is represented by a PhysicalPage object, and such
objects are allocated using kmalloc_eternal(), more RAM means more pressure
on kmalloc_eternal(), so we're gonna need a better strategy for this.

But for now, let's just celebrate that we can use the 128 MB of RAM we've
been telling QEMU to run with. :^)
2019-06-09 11:48:58 +02:00
Andreas Kling de65c960e9 Kernel: Tweak some String&& => const String&.
String&& is just not very practical. Also return const String& when the
returned string is a member variable. The call site is free to make a copy
if he wants, but otherwise we can avoid the retain count churn.
2019-06-07 20:58:12 +02:00
Andreas Kling 736092a087 Kernel: Move i386.{cpp,h} => Arch/i386/CPU.{cpp,h}
There's a ton of work that would need to be done before we could spin up on
another architecture, but let's at least try to separate things out a bit.
2019-06-07 20:02:01 +02:00
Andreas Kling 39d1a9ae66 Meta: Tweak .clang-format to not wrap braces after enums. 2019-06-07 17:13:23 +02:00
Andreas Kling e42c3b4fd7 Kernel: Rename LinearAddress => VirtualAddress. 2019-06-07 12:56:50 +02:00
Andreas Kling bc951ca565 Kernel: Run clang-format on everything. 2019-06-07 11:43:58 +02:00
Andreas Kling 49768524d4 VM: Get rid of KernelPagingScope.
Every page directory inherits the kernel page directory, so there's no need
to explicitly enter the kernel's paging scope anymore.
2019-06-01 17:51:48 +02:00
Andreas Kling 02e21de20a VM: Always flush TLB for kernel page directory changes.
Since the kernel page directory is inherited by all other page directories,
we should always flush the TLB when it's updated.
2019-06-01 17:25:36 +02:00
Andreas Kling ba58b4617d VM: Don't remap each Region page twice in page_in().
page_in_from_inode() will map the page after reading it from disk, so we
don't need to remap it once again.
2019-06-01 15:45:50 +02:00
Andreas Kling baaede1bf9 Kernel: Make the Process allocate_region* API's understand "int prot".
Instead of having to inspect 'prot' at every call site, make the Process
API's take care of that so we can just pass it through.
2019-05-30 16:14:37 +02:00
Robin Burchell 0dc9af5f7e Add clang-format file
Also run it across the whole tree to get everything using the One True Style.
We don't yet run this in an automated fashion as it's a little slow, but
there is a snippet to do so in makeall.sh.
2019-05-28 17:31:20 +02:00
Andreas Kling 7afc0fb9c8 Kernel: Forked children should inherit their RangeAllocator by copy.
Otherwise we'll start handing out addresses that are very likely already in
use by existing ranges.
2019-05-22 13:24:28 +02:00
Andreas Kling bcc6ddfb6b Kernel: Let PageDirectory own the associated RangeAllocator.
Since we transition to a new PageDirectory on exec(), we need a matching
RangeAllocator to go with the new directory. Instead of juggling this in
Process and MemoryManager, simply attach the RangeAllocator to the
PageDirectory instead.

Fixes #61.
2019-05-20 04:46:29 +02:00
Andreas Kling b33cc7f772 Kernel: Remove some RangeAllocator debug spam. 2019-05-18 03:59:16 +02:00
Andreas Kling 87b54a82c7 Kernel: Let Region keep a Range internally. 2019-05-17 04:32:08 +02:00
Andreas Kling 4a6fcfbacf Kernel: Use a RangeAllocator for kernel-only virtual space allocation too. 2019-05-17 04:02:29 +02:00
Andreas Kling c414e65498 Kernel: Implement a simple virtual address range allocator.
This replaces the previous virtual address allocator which was basically
just "m_next_address += size;"

With this in place, virtual addresses can get reused, which cuts down on
the number of page tables created. When we implement ASLR some day, we'll
probably have to do page table deallocation, but for now page tables are
only deallocated once the process dies.
2019-05-17 03:40:15 +02:00
Andreas Kling 176f683f66 Kernel: Move Inode to its own files. 2019-05-16 03:02:37 +02:00
Andreas Kling 01ffcdfa31 Kernel: Encapsulate the Region's COW map a bit better. 2019-05-14 17:31:57 +02:00
Andreas Kling 7c10a93d48 Kernel: Make allocate_kernel_region() commit the region automatically.
This means that kernel regions will eagerly get physical pages allocated.
It would be nice to zero-fill these on demand instead, but that would
require a bunch of MemoryManager changes.
2019-05-14 15:38:00 +02:00
Andreas Kling c8a216b107 Kernel: Allocate kernel stacks for threads using the region allocator.
This patch moves away from using kmalloc memory for thread kernel stacks.
This reduces pressure on kmalloc (16 KB per thread adds up fast) and
prevents kernel stack overflow from scribbling all over random unrelated
kernel memory.
2019-05-14 11:51:00 +02:00
Andreas Kling 6228503c16 Kernel: Add a bit of logging in VMObject::inode_size_changed(). 2019-05-04 21:15:59 +02:00
Andreas Kling 34c5db61aa Kernel: Simplify VMObject::is_anonymous().
This doesn't need a separate flag. A VMObject is always anonymous if it has
no backing inode.
2019-05-02 23:34:28 +02:00
Andreas Kling b8e60b6652 Kernel: Remove unused Region::is_bitmap(). 2019-05-02 23:31:11 +02:00
Andreas Kling c3b7ace3e0 Kernel: Assign Lock names in class member initializers. 2019-05-02 03:28:20 +02:00
Andreas Kling 3f6408919f AK: Improve smart pointer ergonomics a bit. 2019-04-14 02:36:06 +02:00
Andreas Kling a58d7fd8bb Kernel: Get rid of Kernel/types.h, separate LinearAddress/PhysicalAddress. 2019-04-06 14:29:29 +02:00
Andreas Kling b9738fa8ac Kernel: Move VM-related files into Kernel/VM/.
Also break MemoryManager.{cpp,h} into one file per class.
2019-04-03 15:13:07 +02:00