This brings mmap more in line with other operating systems. Prior to
this, it was impossible to request memory that was definitely committed,
instead MAP_PURGEABLE would provide a region that was not actually
purgeable, but also not fully committed, which meant that using such memory
still could cause crashes when the underlying pages could no longer be
allocated.
This fixes some random crashes in low-memory situations where non-volatile
memory is mapped (e.g. malloc, tls, Gfx::Bitmap, etc) but when a page in
these regions is first accessed, there is insufficient physical memory
available to commit a new page.
Rather than lazily committing regions by default, we now commit
the entire region unless MAP_NORESERVE is specified.
This solves random crashes in low-memory situations where e.g. the
malloc heap allocated memory, but using pages that haven't been
used before triggers a crash when no more physical memory is available.
Use this flag to create large regions without actually committing
the backing memory. madvise() can be used to commit arbitrary areas
of such regions after creating them.
This adds the ability for a Region to define volatile/nonvolatile
areas within mapped memory using madvise(). This also means that
memory purging takes into account all views of the PurgeableVMObject
and only purges memory that is not needed by all of them. When calling
madvise() to change an area to nonvolatile memory, return whether
memory from that area was purged. At that time also try to remap
all memory that is requested to be nonvolatile, and if insufficient
pages are available notify the caller of that fact.
Instead of specifying the boot argument to be root=/dev/hdXY, now
one can write root=PARTUUID= with the right UUID, and if the partition
is found, the kernel will boot from it.
This feature is mainly used with GUID partitions, and is considered to
be the most reliable way for the kernel to identify partitions.
RTTI is still disabled for the Kernel, and for the Dynamic Loader. This
allows for much less awkward navigation of class heirarchies in LibCore,
LibGUI, LibWeb, and LibJS (eventually). Measured RootFS size increase
was < 1%, and libgui.so binary size was ~3.3%. The small binary size
increase here seems worth it :^)
Use the GNU LD option --no-dynamic-linker. This allows uncommenting some
code in the Kernel that gets upset if your ELF interpreter has its own
interpreter.
Compared to version 10 this fixes a bunch of formatting issues, mostly
around structs/classes with attributes like [[gnu::packed]], and
incorrect insertion of spaces in parameter types ("T &"/"T &&").
I also removed a bunch of // clang-format off/on and FIXME comments that
are no longer relevant - on the other hand it tried to destroy a couple of
neatly formatted comments, so I had to add some as well.
BlockCondition::unblock should return true if it unblocked at
least one thread, not if iterating the blockers had been stopped.
This is a regression introduced by 49a76164c.
Fixes#4670
This assertion cannot be safely/reliably made in the
~SharedInodeVMObject destructor. The problem is that
Inode::is_shared_vmobject holds a weak reference to the instance
that is being destroyed (ref count 0). Checking the pointer using
WeakPtr::unsafe_ptr will produce nullptr depending on timing in
this case, and WeakPtr::safe_ref will reliably produce a nullptr
as soon as the reference count drops to 0. The only case where
this assertion could succeed is when WeakPtr::unsafe_ptr returned
the pointer because it won the race against revoking it. And
because WeakPtr::safe_ref will always return a nullptr, we cannot
reliably assert this from the ~SharedInodeVMObject destructor.
Fixes#4621
If a heap expansion is triggered by allocating from e.g. the
RangeAllocator, which may be holding a spin lock, we cannot
immediately allocate another block of backup memory, which could
require the same locks to be acquired. So, defer allocating the
backup memory
Fixes#4675
When doing the cast to u64 on the page directory physical address,
the sign bit was being extended. This only beomes an issue when
crossing the 2 GiB boundary. At >= 2 GiB, the physical address
has the sign bit set. For example, 0x80000000.
This set all the reserved bits in the PDPTE, causing a GPF
when loading the PDPT pointer into CR3. The reserved bits are
presumably there to stop you writing out a physical address that
the CPU physically cannot handle, as the size of the reserved bits
is determined by the physical address width of the CPU.
This fixes this by casting to FlatPtr instead. I believe the sign
extension only happens when casting to a bigger type. I'm also using
FlatPtr because it's a pointer we're writing into the PDPTE.
sizeof(FlatPtr) will always be the same size as sizeof(void*).
This also now asserts that the physical address in the PDPTE is
within the max physical address the CPU supports. This is better
than getting a GPF, because CPU::handle_crash tries to do the same
operation that caused the GPF in the first place. That would cause
an infinite loop of GPFs until the stack was exhausted, causing a
triple fault.
As far as I know and tested, I believe we can now use the full 32-bit
physical range without crashing.
Fixes#4584. See that issue for the full debugging story.
The unblock_all variant used to ASSERT if a blocker didn't unblock,
but it wasn't clear from the name that it would do that. Because
the BlockCondition already asserts that no blockers are left at
destruction time, it would still catch blockers that haven't been
unblocked for whatever reason.
Fixes#4496
* Add SERENITY_ARCH option to CMake for selecting the target toolchain
* Port all build scripts but continue to use i686
* Update GitHub Actions cache to include BuildIt.sh
Anything above or equal to the 2 GB mark has the left most bit set
(0x8000...), which was falsely interpreted as negative due to
local_offset being signed.
This makes it unsigned by using FlatPtr. To check for underflow as
was intended, lets use Checked instead.
Fixes#4585
The partitioning code was very outdated, and required a full refactor.
The new subsystem removes duplicated code and uses more AK containers.
The most important change is that all implementations of the
PartitionTable class conform to one interface, which made it possible
to remove unnecessary code in the EBRPartitionTable class.
Finding partitions is now done in the StorageManagement singleton,
instead of doing so in init.cpp.
Also, now we don't try to find partitions on demand - the kernel will
try to detect if a StorageDevice is partitioned, and if so, will check
what is the partition table, which could be MBR, GUID or EBR.
Then, it will create DiskPartitionMetadata object for each partition
that is available in the partition table. This object will be used
by the partition enumeration code to create a DiskPartition with the
correct minor number.
The DevFS along with DevPtsFS give a complete solution for populating
device nodes in /dev. The main purpose of DevFS is to eliminate the
need of device nodes generation when building the system.
Later on, DevFS will assist with exposing disk partition nodes.
BlockBasedFileSystem::read_block method should get a reference of
a UserOrKernelBuffer.
If we need to force caching a block, we will call other method to do so.
clang trunk with -std=c++20 doesn't seem to properly look for an
aggregate initializer here when the type being constructed is a simple
aggregate (e.g. `struct Thing { int a; int b; };`). This template fails
to compile in a usage added 12/16/2020 in `AK/Trie.h`.
Both forms of initialization are supposed to call the
aggregate-initializers but direct-list-initialization delegating to
aggregate initializers is a new addition in c++20 that might not be
implemented yet.
The PIT is now also running at a rate of ~250 ticks/second, so rather
than assuming there are 1000 ticks/second we need to query the timer
being used for the actual frequency.
Fixes#4508
This was a goofy kernel API where you could assign an icon_id (int) to
a process which referred to a global shbuf with a 16x16 icon bitmap
inside it.
Instead of this, programs that want to display a process icon now
retrieve it from the process executable instead.
Dumping core can happen at the end of a profiling run, and in that case
we have to protect the target process and take the lock while iterating
over its region map.
Fixes#4509.
When the ExpandableHeap calls the remove_memory function, the
subheap is assumed to be removed and freed entirely. remove_memory
may drop the underlying memory at any time, but it also may cause
further allocation requests. Not removing it from the list before
calling remove_memory could cause a memory allocation in that
subheap while remove_memory is executing. which then causes issues
once the underlying memory is actually freed.
This can happen when an unveil follows another with a path that is a
sub-path of the other one:
```c++
unveil("/home/anon/.config/whoa.ini", "rw");
unveil("/home/anon", "r"); // this would fail, as "/home/anon" inherits
// the permissions of "/", which is None.
```
Problem:
- C functions with no arguments require a single `void` in the argument list.
Solution:
- Put the `void` in the argument list of functions in C header files.
This new flag controls two things:
- Whether the kernel will generate core dumps for the process
- Whether the EUID:EGID should own the process's files in /proc
Processes are automatically made non-dumpable when their EUID or EGID is
changed, either via syscalls that specifically modify those ID's, or via
sys$execve(), when a set-uid or set-gid program is executed.
A process can change its own dumpable flag at any time by calling the
new sys$prctl(PR_SET_DUMPABLE) syscall.
Fixes#4504.
This is instead of the UID:GID, since that was allowing some very bad
information leaks like spawning "su" as an unprivileged user and having
full /proc access to it.
Work towards #4504.
If the allocation fails (e.g ENOMEM) we want to simply return an error
from sys$execve() and continue executing the current executable.
This patch also moves make_userspace_stack_for_main_thread() out of the
Thread class since it had nothing in particular to do with Thread.
Process had a couple of members whose only purpose was holding on to
some temporary data while building the auxiliary vector. Remove those
members and move the vector building to a free function in execve.cpp
Make it possible to bail out of ELF::Image::for_each_program_header()
and then do exactly that if something goes wrong during executable
loading in the kernel.
Also make the errors we return slightly more nuanced than just ENOEXEC.
Get rid of the lambda functions and put the logic inline in the program
header traversal loop instead. This makes the code quite a bit shorter
and hopefully makes it easier to see what's going on.
This commit gets rid of ELF::Loader entirely since its very ambiguous
purpose was actually to load executables for the kernel, and that is
now handled by the kernel itself.
This patch includes some drive-by cleanup in LibDebug and CrashDaemon
enabled by the fact that we no longer need to keep the ref-counted
ELF::Loader around.
It was really weird that ELF loading was performed by the ELF::Loader
class instead of just being done by the kernel itself. This patch moves
all the layout logic from ELF::Loader over to sys$execve().
The kernel no longer cares about ELF::Loader and instead only uses an
ELF::Image as an interpreting wrapper around executables.
Now that the CrashDaemon symbolicates crashes in userspace, let's take
this one step further and stop trying to symbolicate userspace programs
in the kernel at all.
ProcFS /proc/<pid>/vm map info no longer contains two `purgeable` keys.
The second `purgeable` key has been removed and replaced with keys for
`kernel` and `cacheable`.
We were casting the address to Userspace<T> without validating it first
which is no good and will trap an assertion soon after.
Let's catch this sooner with an ASSERT in the Userspace<T> constructor
and update the PT_PEEK and PT_POKE handlers to avoid it.
Fixes#4505.
This is a crude protection against IOPL elevation attacks. If for
any reason we find ourselves about to switch to a user mode thread
with IOPL != 0, we'll now simply panic the kernel.
If this happens, it basically means that something tricked the kernel
into incorrectly modifying the IOPL of a thread, so it's no longer
safe to trust the kernel anyway.
It was possible to overwrite the entire EFLAGS register since we didn't
do any masking in the ptrace and sigreturn syscalls.
This made it trivial to gain IO privileges by raising IOPL to 3 and
then you could talk to hardware to do all kinds of nasty things.
Thanks to @allesctf for finding these issues! :^)
Their exploit/write-up: https://github.com/allesctf/writeups/blob/master/2020/hxpctf/wisdom2/writeup.md
Since they're all covered by the same spec sheet, we can expect
the same code to cover most of the devices.
It can't currently differentiate between them, which would be nice to
add for determining what registers we can access.
And make an effort to propagate errors out from the inner parts.
This fixes an issue where the kernel would infinitely loop in coredump
generation if the TmpFS filled up.
When enumerating the hardware using MMIO mode, it would attempt to
create a physical ID first. To create a physical ID, it needs to
retrieve the capabilities of the device.
When enumerating the first device, there would be no device
configuration space mappings. Access::get_capabilities_pointer
calls PCI::read16, which in turn goes to MMIOAccess::read16_field.
MMIOAccess::read16_field attempts to get a device configuration space
and fully expects to get one. However, since this is the first device,
there are none and it crashes with an m_has_value assertion failure.
This fixes this by creating the device configuration space mapping
before creating the physical ID.
Testing with VMware Player 16.1.0.
This implements a number of changes related to time:
* If a HPET is present, it is now used only as a system timer, unless
the Local APIC timer is used (in which case the HPET timer will not
trigger any interrupts at all).
* If a HPET is present, the current time can now be as accurate as the
chip can be, independently from the system timer. We now query the
HPET main counter for the current time in CPU #0's system timer
interrupt, and use that as a base line. If a high precision time is
queried, that base line is used in combination with quering the HPET
timer directly, which should give a much more accurate time stamp at
the expense of more overhead. For faster time stamps, the more coarse
value based on the last interrupt will be returned. This also means
that any missed interrupts should not cause the time to drift.
* The default system interrupt rate is reduced to about 250 per second.
* Fix calculation of Thread CPU usage by using the amount of ticks they
used rather than the number of times a context switch happened.
* Implement CLOCK_REALTIME_COARSE and CLOCK_MONOTONIC_COARSE and use it
for most cases where precise timestamps are not needed.
The StorageManagement class has 2 roles:
1. During boot, it should find all storage controllers in the machine,
and then determine what is the boot device.
2. Later on boot, it is a registrar of all storage controllers and
storage devices. Thus, it could be used to show information about these
devices when implemented.
This change allows the user to specify a boot driver other than /dev/hda
and if it's connected in the machine - it will boot.
Previously, the indexing scheme was that 0 is Primary-Master, 1 is
Primary-Slave, 2 is Secondary-Master, 3 is Secondary-Slave.
Instead of merely matching between numbers to the channel & position,
the IDEController code will try to find all available drives connected to
the two channels, then it will create a Vector with nonnull RefPtr to
them. Then we take use the given index with this Vector.
This new subsystem is somewhat replacing the IDE disk code we had with a
new flexible design.
StorageDevice is a generic class that represent a generic storage
device. It is meant that specific storage hardware will override the
interface. StorageController is a generic class that represent
a storage controller that can be found in a machine.
The IDEController class governs two IDEChannels. An IDEChannel is
responsible to manage the master & slave devices of the channel,
therefore an IDEChannel is an IRQHandler.
IRQ 7 and 15 on the PIC architecture are used for spurious interrupts.
IRQ 7 could also be used for LPT connection, and IRQ 15 can be used for
the secondary IDE channel. Therefore, we need to allow to install a
real IRQ handler and check if a real IRQ was asserted. If so, we handle
them in the usual way.
A note on this fix - unregistering or registering a new IRQ handler
after we already registered one in the spurious interrupt handler is
not supported yet.
Such device is not an IRQHandler by itself, but actually a controller of
many IRQ or MSI devices. The purpose of this class is to manage multiple
sources of interrupts.
For example, a generic ISA IDE controller controls 2 IRQ sources - 14
and 15. So, when we initialize the IDE controller, it will initialize
two IDE channels (also known as PATAChannels) to utilize IRQ 14 and 15,
respectively. NVMe with MSI-X support can theoretically handle up to
2048 interrupts.
Problem:
- `(void)` simply casts the expression to void. This is understood to
indicate that it is ignored, but this is really a compiler trick to
get the compiler to not generate a warning.
Solution:
- Use the `[[maybe_unused]]` attribute to indicate the value is unused.
Note:
- Functions taking a `(void)` argument list have also been changed to
`()` because this is not needed and shows up in the same grep
command.
Switch on the new credentials before loading the new executable into
memory. This ensures that attempts to ptrace() the program from an
unprivileged process will fail.
This covers one bug that was exploited in the 2020 HXP CTF:
https://hxp.io/blog/79/hxp-CTF-2020-wisdom2/
Thanks to yyyyyyy for finding the bug! :^)
The overrides of this function don't need to know how the original
packet was stored, so let's just give them a ReadonlyBytes view of
the raw packet data.
We need to stop assuming that KBuffer allocation always succeeds.
This patch adds the following API:
- static OwnPtr<KBuffer> KBuffer::create_with_size(size_t);
All KBuffer clients should move towards using this (and handling any
failures with grace.)
ACPI 2 declared the third revision of FADT, that should have
IAPC_BOOT_ARCH flags in it, also to indicate if i8042 is present.
Q35 machine reports that it has FADT with revision 3, but the code
in QEMU simply ignores these flags and put zero on them no matter
the revision of FADT.
We need to account for how many shared lock instances the current
thread owns, so that we can properly release such references when
yielding execution.
We also need to release the process lock when donating.
LexicalPath is a big and heavy class that's really meant as a helper
for extracting parts of a path, not for storage or passing around.
Instead, pass paths around as strings and use LexicalPath locally
as needed.
When a process crashes, we generate a coredump file and write it in
/tmp/coredumps/.
The coredump file is an ELF file of type ET_CORE.
It contains a segment for every userspace memory region of the process,
and an additional PT_NOTE segment that contains the registers state for
each thread, and a additional data about memory regions
(e.g their name).
This adds an allocate_tls syscall through which a userspace process
can request the allocation of a TLS region with a given size.
This will be used by the dynamic loader to allocate TLS for the main
executable & its libraries.
When the main executable needs an interpreter, we load the requested
interpreter program, and pass to it an open file decsriptor to the main
executable via the auxiliary vector.
Note that we do not allocate a TLS region for the interpreter.
This fixes an issue where TCP sockets could get into the Established
state too quickly and fail to unblock a subsequent sys$select() call.
This makes websites load *significantly* faster. :^)
Since the process lock is using the Lock class, re-locking the process
lock may cause another call to Thread::block. This caused some problems
with multiple blockers attempting to be used at the same time. To solve
this problem, remember if the process lock was held, and if it was then
relock after we're done with the blockers, just before returning.
This prevents zombies created by multi-threaded applications and brings
our model back to closer to what other OSs do.
This also means that SIGSTOP needs to halt all threads, and SIGCONT needs
to resume those threads.
This is necessary because if a process changes the state to Stopped
or resumes from that state, a wait entry is created in the parent
process. So, if a child process does this before disown is called,
we need to clear those entries to avoid leaking references/zombies
that won't be cleaned up until the former parent exits.
This also should solve an even more unlikely corner case where another
thread is waiting on a pid that is being disowned by another thread.
Fix some problems with join blocks where the joining thread block
condition was added twice, which lead to a crash when trying to
unblock that condition a second time.
Deferred block condition evaluation by File objects were also not
properly keeping the File object alive, which lead to some random
crashes and corruption problems.
Other problems were caused by the fact that the Queued state didn't
handle signals/interruptions consistently. To solve these issues we
remove this state entirely, along with Thread::wait_on and change
the WaitQueue into a BlockCondition instead.
Also, deliver signals even if there isn't going to be a context switch
to another thread.
Fixes#4336 and #4330
Instead of flushing the TLB on the current processor first and then
notifying the other processors to do the same, notify the others
first, and while waiting on the others flush our own.
Move counting interrupts out of the handle_interrupt method so that
it is done in all cases without the interrupt handler having to
implement it explicitly.
Also make the counter an atomic value as e.g. the LocalAPIC interrupts
may be triggered on multiple processors simultaneously.
Fixes#4297
This allows us to use blocking timeouts with either monotonic or
real time for all blockers. Which means that clock_nanosleep()
now also supports CLOCK_REALTIME.
Also, switch alarm() to use CLOCK_REALTIME as per specification.
We need to be able to guarantee that a timer won't be executing after
TimerQueue::cancel_timer returns. In the case of multiple processors
this means that we may need to wait while the timer handler finishes
execution on another core.
This also fixes a problem in Thread::block and Thread::wait_on where
theoretically the timer could execute after the function returned
and the Thread disappeared.
This should catch more malformed ELF files earlier than simply
checking the ELF header alone. Also change the API of
validate_program_headers to take the interpreter_path by pointer. This
makes it less awkward to call when we don't care about the interpreter,
and just want the validation.
This changes the Thread::wait_on function to not enable interrupts
upon leaving, which caused some problems with page fault handlers
and in other situations. It may now be called from critical
sections, with interrupts enabled or disabled, and returns to the
same state.
This also requires some fixes to Lock. To aid debugging, a new
define LOCK_DEBUG is added that enables checking for Lock leaks
upon finalization of a Thread.
New Thread objects should be adopted into a RefPtr upon creation.
If creating a thread failed (e.g. out of memory), releasing the RefPtr
will destruct the partially created object, but in the successful case
the thread will add an additional reference that it keeps until it
finishes execution. Adopting will drop it to 1 when returning from
create_thread, or 0 if the thread could not be fully constructed.
This makes the Scheduler a lot leaner by not having to evaluate
block conditions every time it is invoked. Instead evaluate them as
the states change, and unblock threads at that point.
This also implements some more waitid/waitpid/wait features and
behavior. For example, WUNTRACED and WNOWAIT are now supported. And
wait will now not return EINTR when SIGCHLD is delivered at the
same time.
This adds the ability to pass a pointer to kernel thread/process.
Also add the ability to use a closure as thread function, which
allows passing information to a kernel thread more easily.
Use the TimerQueue to expire blocking operations, which is one less thing
the Scheduler needs to check on every iteration.
Also, add a BlockTimeout class that will automatically handle relative or
absolute timeouts as well as overriding timeouts (e.g. socket timeouts)
more consistently.
Also, rework the TimerQueue class to be able to fire events from
any processor, which requires Timer to be RefCounted. Also allow
creating id-less timers for use by blocking operations.
Rather than waiting until we get the first mouse packet, enable the
absolute mode immediately. This avoids having to click first to be
able to move the mouse.
We need to not only add a record for a reference, but we need
to copy the reference count on fork as well, because the code
in the fork assumes that it has the same amount of references,
still.
Also, once all references are dropped when a process is disowned,
delete the shared buffer.
Fixes#4076
This makes misses in the BlockBasedFS's LRU block cache faster by
storing the cache entries in one of two doubly-linked list.
Dirty and clean cache entries are kept in two separate lists, and
move between them when their state changes. This can probably be
improved upon further.
If the inode's block list cache is empty, we forgot to assign the
result of computing the block list. The fact that this worked anyway
makes me wonder when we actually don't have a cache..
Thanks to szyszkienty for spotting this! :^)
Instead of doing a linear scan of the entire cache when doing a lookup,
we now have a nice O(1) HashMap in front of the cache.
The cache miss case can still be improved, this patch really only helps
the cache hit case.
This dramatically improves cached filesystem I/O. :^)
Since we're using UserOrKernelBuffers, SMAP will be automatically
disabled when we actually access the buffer later on. There's no need
to disable it wholesale across the entire read/write operations.
This is a new "browse" permission that lets you open (and subsequently list
contents of) directories underneath the path, but not regular files or any other
types of files.
We should never resume a thread by directly setting it to Running state.
Instead, if a thread was in Running state when stopped, record the state
as Runnable.
Fixes#4150
We were leaking a ref on the executed inode in successful calls to
sys$execve(). This meant that once a binary had ever been executed,
it was impossible to remove it from the file system.
The execve system call is particularly finicky since the function
does not return normally on success, so extra care must be taken to
ensure nothing is kept alive by stack variables.
There is a big NOTE comment about this, and yet the bug still got in.
It would be nice to enforce this, but I'm unsure how.
e2fsck was complaining about blocks being allocated in an inode's list
of direct blocks while at the same time being free in the block bitmap.
It was easy to reproduce by creating a file with non-zero length and
then truncating it. This fixes the issue by clearing out the direct
block list when resizing a file to 0.
We need to create a reference for the new PID for each shared buffer
that the process had a reference to. If the process subsequently
get replaced through exec, those references will be dropped again.
But if exec for some reason fails then other code, such as global
destructors could still expect having access to them.
Fixes#4076
The time returned by sys$clock_gettime() was not aligned with the delay
calculations in sys$clock_nanosleep(). This patch fixes that by taking
the system's ticks_per_second value into account in both functions.
This patch also removes the need for Thread::sleep_until() and uses
Thread::sleep() for both absolute and relative sleeps.
This was causing the nesalizer emulator port to sleep for a negative
amount of time at the end of each frame, making it run way too fast.
Problem:
- C-style arrays do not automatically provide bounds checking and are
less type safe overall.
- `__builtin_memcmp` is not a constant expression in the current gcc.
Solution:
- Change private m_data to be AK::Array.
- Eliminate constructor from C-style array.
- Change users of the C-style array constructor to use the default
constructor.
- Change `operator==()` to be a hand-written comparison loop and let
the optimizer figure out to use `memcmp`.
We won't be receiving full PS/2 mouse packets when the VMWareBackdoor
absolute mouse mode is enabled. So, read just one byte every time
and retrieve the latest mouse packet from VMWareBackdoor immediately.
Fixes#4086
This allows issuing asynchronous requests for devices and waiting
on the completion of the request. The requests can cascade into
multiple sub-requests.
Since IRQs may complete at any time, if the current process is no
longer the same that started the process, we need to swich the
paging context before accessing user buffers.
Change the PATA driver to use this model.
Rework the PS/2 keyboard and mouse drivers to use a common 8042
controller driver. Also, reset and reconfigure the 8042 controller
as they are not guaranteed to be in the state that we expect.
When two processors send each others a SMP message at the same time
they need to process messages while waiting for delivery of the
message they just sent, or they will deadlock.
This makes most operations thread safe, especially so that they
can safely be used in the Kernel. This includes obtaining a strong
reference from a weak reference, which now requires an explicit
call to WeakPtr::strong_ref(). Another major change is that
Weakable::make_weak_ref() may require the explicit target type.
Previously we used reinterpret_cast in WeakPtr, assuming that it
can be properly converted. But WeakPtr does not necessarily have
the knowledge to be able to do this. Instead, we now ask the class
itself to deliver a WeakPtr to the type that we want.
Also, WeakLink is no longer specific to a target type. The reason
for this is that we want to be able to safely convert e.g. WeakPtr<T>
to WeakPtr<U>, and before this we just reinterpret_cast the internal
WeakLink<T> to WeakLink<U>, which is a bold assumption that it would
actually produce the correct code. Instead, WeakLink now operates
on just a raw pointer and we only make those constructors/operators
available if we can verify that it can be safely cast.
In order to guarantee thread safety, we now use the least significant
bit in the pointer for locking purposes. This also means that only
properly aligned pointers can be used.
Most systems (Linux, OpenBSD) adjust 0.5 ms per second, or 0.5 us per
1 ms tick. That is, the clock is sped up or slowed down by at most
0.05%. This means adjusting the clock by 1 s takes 2000 s, and the
clock an be adjusted by at most 1.8 s per hour.
FreeBSD adjusts 5 ms per second if the remaining time adjustment is
>= 1 s (0.5%) , else it adjusts by 0.5 ms as well. This allows adjusting
by (almost) 18 s per hour.
Since Serenity OS can lose more than 22 s per hour (#3429), this
picks an adjustment rate up to 1% for now. This allows us to
adjust up to 36s per hour, which should be sufficient to adjust
the clock fast enough to keep up with how much time the clock
currently loses. Once we have a fancier NTP implementation that can
adjust tick rate in addition to offset, we can think about reducing
this.
adjtime is a bit old-school and most current POSIX-y OSs instead
implement adjtimex/ntp_adjtime, but a) we have to start somewhere
b) ntp_adjtime() is a fairly gnarly API. OpenBSD's adjfreq looks
like it might provide similar functionality with a nicer API. But
before worrying about all this, it's probably a good idea to get
to a place where the kernel APIs are (barely) good enough so that
we can write an ntp service, and once we have that we should write
a way to automatically evaluate how well it keeps the time adjusted,
and only then should we add improvements ot the adjustment mechanism.
This addresses the issue first enountered in #3644. If a path is
first unveiled with "c" permissions, we should NOT return ENOENT
if the node does not exist on the disk, as the program will most
likely be creating it at a later time.
When computing the list of blocks to deallocate when freeing an inode,
we would stop collecting blocks after reaching the inode's block count.
Since we're getting rid of the inode, we need to also include the meta
blocks used by the on-disk block list itself.
* Change the register structures to use the volatile keyword explicitly
on the register values. This avoids accidentally omitting it as any
access will be guaranteed volatile.
* Don't assume we can read/write 64 bit value to the main counter and
the comparator. Not all HPET implementations may support this. So,
just use 32 bit words to access the registers. This ultimately works
around a bug in Bochs 2.6.11 that loses 32 bits of a 64 bit write to
a timer's comparator register (it internally writes one half and
clears the Tn_VAL_SET_CNF bit, and then because it's cleared it
fails to write the second half).
* Properly calculate the tick duration in calculate_ticks_in_nanoseconds
* As per specification, changing the frequency of one periodic timer
requires a restart of all periodic timers as it requires the main
counter to be reset.
This allows issuing asynchronous requests for devices and waiting
on the completion of the request. The requests can cascade into
multiple sub-requests.
Since IRQs may complete at any time, if the current process is no
longer the same that started the process, we need to swich the
paging context before accessing user buffers.
Change the PATA driver to use this model.
Because allocating/freeing regions may require locks that need to
wait on other processors for completion, this needs to be delayed
until it's safer. Otherwise it is possible to deadlock because we're
holding the global heap lock.
Function calls that are deferred will be executed before a thread
enters a pre-emptable state (meaning it is not in a critical section
and it is not in an irq handler). If it is not already in such a
state, it will be called immediately.
This is meant to be used from e.g. IRQ handlers where we might want
to block a thread until an interrupt happens.
If you try to do this (e.g "mv directory directory"), sys$rename() will
now fail with EDIRINTOSELF.
Dr. POSIX says we should return EINVAL for this, but a custom error
code allows us to print a much more helpful error message when this
problem occurs. :^)
Remapping these registers every time we try to read from or write to
them causes a lot of SMP broadcasts and a lot of other overhead.
This improves boot time noticeably.
Instead of mapping a 4KB region to access device configuration space
each time we call one of the PCI helpers, just map them once during
the boot process.
Then, if we request to access one of those devices, we can ask the
PCI subsystem to give us the virtual address where the device's
configuration space is mapped.
We were stripping the L3 headers from packets received on raw sockets.
This didn't match what other systems do, so let's adjust our behavior.
Thanks to @SpencerCDixon for noticing this! :^)
g_scheduler_lock cannot safely be acquired after Thread::m_lock
because another processor may already hold g_scheduler_lock and wait
for the same Thread::m_lock.
It's possible that we broadcast an IPI message right at the same time
another processor requests a halt. Rather than spinning forever waiting
for that message to be handled, check if we should halt while waiting.
This enables the APIC timer on all CPUs, which means Scheduler::timer_tick
is now called on all CPUs independently. We still don't do anything on
the APs as it instantly crashes due to a number of other problems.
This makes it possible to start _everything_ under UserspaceEmulator, by
setting `init_args` to `--report-to-debug,/bin/SystemServer` and `init`
to `/bin/UserspaceEmulator`.
With the UE patches before this, we get to spawn WindowServer, and crash
because of FLD_RM32 (nothing tested past that) in graphical mode.
But we get a working shell in text mode :^) (and DHCPClient fails when
setting whatever settings it has received)
This fixes an issue where making a TCP connection to localhost didn't
work correctly since the loopback interface is currently synchronous.
(Sending something to localhost would enqueue a packet on the same
interface and then immediately wake the network task to process that
packet.)
This was preventing the TCP handshake from working correctly with
localhost since we'd send out the SYN packet before moving to the
SynSent state. The lock is now held long enough for this operation
to be atomic.
Problem:
- `constexpr` functions are decorated with the `inline` specifier
keyword. This is redundant because `constexpr` functions are
implicitly `inline`.
- [dcl.constexpr], §7.1.5/2 in the C++11 standard): "constexpr
functions and constexpr constructors are implicitly inline (7.1.2)".
Solution:
- Remove the redundant `inline` keyword.
Notice that we ensured that the size is a multiple of the page size and
that there is at least one page there, otherwise, this change would be
invalid.
We create an empty region and then expand it:
// First iteration.
m_user_physical_regions.append(PhysicalRegion::create(addr, addr));
// Following iterations.
region->expand(region->lower(), addr);
So if the memory region only has one page, we would end up with an empty
region. Thus we need to do one more iteration.
If something goes wrong when trying to write out a perfcore file during
process finalization, there's nowhere to report an error to, other than
the debug log. So write it to the debug log.
Problem: Defining the destructor violates the "rule of 0" and prevents
the copy/move constructor/assignment operators from being provided by
the compiler.
Solution: Change the constructor and destructor to be the default
compiler-provided definition.
In case we want to rely more on TSC in time keeping in the future, idk
This adds:
- RDTSCP, for when the RDTSCP instruction is available
- CONSTANT_TSC, for when the TSC has a constant frequency, invariant
under things like the CPU boosting its frequency.
- NONSTOP_TSC, for when the TSC doesn't pause when the CPU enters
sleep states.
AMD cpus and newer intel cpus set the INVSTC bit (bit 8 in edx of
extended cpuid 0x8000000008), which implies both CONSTANT_TSC and
NONSTOP_TSC. Some older intel processors have CONSTANT_TSC but not
NONSTOP_TSC; this is set based on cpu model checks.
There isn't a ton of documentation on this, so this follows Linux
terminology and http://blog.tinola.com/?e=54
CONSTANT_TSC:
39b3a79105
NONSTOP_TSC:
40fb17152c
qemu disables invtsc (bit 8 in edx of extended cpuid 0x8000000008)
by default even if the host cpu supports it. It can be enabled by
running with `SERENITY_QEMU_CPU=host,migratable=off` set.
The block group indices are 1-based for some reason. Because of that,
we were forgetting to check in the very last block group when doing
block allocation. This caused block allocation to fail even when the
superblock indicated that we had free blocks.
Fixes#3674.
When obeying FD_CLOEXEC, we don't need to explicitly call close() on
all the FileDescriptions. We can just clear them out from the process
fd table. ~FileDescription() will call close() anyway.
This fixes an issue where TelnetServer would shut down accepted sockets
when exec'ing a shell for them. Since the parent process still has the
socket open, we should not force-close it. Just let go.
This finally takes care of the kind-of excessive boilerplate code that were the
ctype adapters. On the other hand, I had to link `LibC/ctype.cpp` to the Kernel
(for `AK/JsonParser.cpp` and `AK/Format.cpp`). The previous commit actually makes
sense now: the `string.h` includes in `ctype.{h,cpp}` would require to link more LibC
stuff to the Kernel when it only needs the `_ctype_` array of `ctype.cpp`, and there
wasn't any string stuff used in ctype.
Instead of all this I could have put static derivatives of `is_any_of()` in the
concerned AK files, however that would have meant more boilerplate and workarounds;
so I went for the Kernel approach.
Similar to Process, we need to make Thread refcounted. This will solve
problems that will appear once we schedule threads on more than one
processor. This allows us to hold onto threads without necessarily
holding the scheduler lock for the entire duration.
The thread joining logic hadn't been updated to account for the subtle
differences introduced by software context switching. This fixes several
race conditions related to thread destruction and joining, as well as
finalization which did not properly account for detached state and the
fact that threads can be joined after termination as long as they're not
detached.
Fixes#3596
This function is not avaliable in the kernel.
In the future it would be nice to have some sort of <charconv> header
that does this for all integer types and then call it in strtoull and et
cetera.
The difference would be that this function say 'from_chars' would return
an Optional and not just interpret anything invalid as zero.
We need to assert if interrupts are not disabled when changing the
interrupt number of an interrupt handler.
Before this fix, any change like this would lead to a crash,
because we are using InterruptDisabler in IRQHandler::change_irq_number.