After the page fault handler has found the region in which the fault
occurred, do the rest of the work in the region itself.
This patch also makes all fault types consistently crash the process
if a new page is needed but we're all out of pages.
Since the kernel page tables are shared between all processes, there's
no need to (implicitly) flush the TLB for them on every context switch.
Setting the G bit on kernel page tables allows the CPU to keep the
translation caches around.
This patch changes the parameter to Region::map() to be a PageDirectory
since that matches how we think about the memory model:
Regions are views onto VMObjects, and are mapped into PageDirectories.
Each Process has a PageDirectory. The kernel also has a PageDirectory.
Since a Region is merely a "window" onto a VMObject, it can both begin
and end at a distance from the VMObject's boundaries.
Therefore, we should always be computing indices into a VMObject's
physical page array by adding the Region's "first_page_index()".
There was a whole bunch of code that forgot to do that. This fixes
many wrong behaviors for Regions that start part-way into a VMObject.
When creating a new directory, we set the initial size to 1 block.
This meant that we were allocating a block up front, but the Inode's
internal block list cache was not populated with this block.
This broke write_bytes() on a new directory, since it assumed that
the block list cache would be up to date if the call to write_bytes()
would not change the directory's size.
This patch fixes the issue in two ways: First, we cache the initial
block list created for new directories.
Second, we now repopulate the block list cache in write_bytes() if it
is empty when we get there. This is basically just a safety fallback
to avoid having this kind of bug in the future.
Ports/.port_include.sh, Toolchain/BuildIt.sh, Toolchain/UseIt.sh
have been left largely untouched due to use of Bash-exclusive
functions and variables such as $BASH_SOURCE, pushd and popd.
We can't be calling the virtual FS::flush_writes() in order to flush
the disk cache from within the disk cache, since an FS subclass may
try to do cache stuff in its flush_writes() implementation.
Instead, separate out the implementation of DiskBackedFS's flushing
logic into a flush_writes_impl() and call that from the cache code.
Also cache the block group descriptor table in a KBuffer on file system
initialization, instead of on first access.
This reduces pressure on the kmalloc heap somewhat.
We were writing out the full block list whenever Ext2FSInode::resize()
was called, even if the old and new sizes were identical.
This patch makes it a no-op, which drastically improves "cp" speed
since we now take full advantage of the up-front call to ftruncate().
If there are not enough free blocks in the filesystem to accomodate
growing an Inode, we should fail with ENOSPC before even starting to
allocate blocks.
Add a simple cache to Ext2FS where we keep block bitmaps along with a
dirty bit. This allows us to coalesce bitmap flushes, giving us a nice
~3x improvement in disk_benchmark write speeds.
Store the cached super block as an ext2_super_block member instead of
caching it in a ByteBuffer and using a casting helper everywhere.
This patch also combines reading/writing of the super block into a
single disk device operation (instead of two.)
Oops, we had a little mistake here. We were flushing whenever !NOFLSH,
not just when generating a signal.
This broke arrow keys in the terminal (you would only get A/B/C/D when
pressing arrow keys, instead of the full escape sequence.)
We were doing a temporary STI/CLI in MemoryManager::zero_page() to be
able to acquire the VMObject's lock before zeroing out a page.
This logic was inherited from the inode fault handler, where we need
to enable interrupts anyway, since we might need to interact with the
underlying storage device.
Zero-fill faults don't actually need to lock the VMObject, since they
are already guaranteed exclusivity by interrupts being disabled when
entering the fault handler.
This is different from inode faults, where a second thread can often
get an inode fault for the same exact page in the same VMObject before
the first fault handler has received a response from the disk.
This is why the lock exists in the first place, to prevent this race.
This fixes an intermittent crash in sys$execve() that was made much
more visible after I made userspace stacks lazily allocated.
Add text.startup to the .text block, add .ctors as well.
Use them in init.cpp to call global constructors after
gtd and idt init. That way any funky constructors should be ok.
Also defines some Itanium C++ ABI methods that probably shouldn't be,
but without them the linker gets very angry.
If the code ever actually tries to use __dso_handle or call
__cxa_atexit, there's bigger problems with the kernel.
Bit of a hack would be an understatement but hey. It works :)
Make userspace stacks lazily allocated and allow them to grow up to
4 megabytes. This avoids a lot of silly crashes we were running into
with software expecting much larger stacks. :^)
Add dedicated internal types for Int64 and UnsignedInt64. This makes it
a bit more straightforward to work with 64-bit numbers (instead of just
implicitly storing them as doubles.)
This is just a wrapper around strstr() for now. There are many better
ways to search for a string within a string, but I'm just adding a nice
API at the moment. :^)
Add the ability to both pass arguments to scripts with shebangs
(./script argument1 argument2) and to specify them in the shebang line
(#!/usr/local/bin/bash -x -e)
Fixes#585
I have no idea why this was here. It makes no sense. If you're trying
to find out if something is a directory, why wouldn't you be allowed to
ask that about a FIFO? :^)
Thanks to Brandon for spotting this!
Also, while we're here, cache the directory state in a bool member so
we don't have to keep fetching inode metadata when checking this
repeatedly. This is important since sys$read() now calls it.
This makes tcgetpgrp() on a master PTY return the PGID of the slave PTY
which is probably what you are looking for. I'm not sure how correct or
standardized this is, but it makes sense to me right now.
We now return EISDIR whenever a program attempts to call sys$read
on a directory. Previously, attempting to read a directory could
either return junk data or, in the case of /proc/, cause a kernel
panic.
Fixed an issue with operator precedence in calls to `send_byte()`, in
which a value of `1` was being sent to the function. This had the
nasty side-effect of selecting the slave drive if the value of
`head` was equal to one. A read/write would fail in the case, as
it would attempt to read from the slave drive (not good).
I've also added a seek to the top of the read/write code, which seems
to have fixed an issue with Linux not detecting the disk images after
they have been unmounted from Serenity. This isn't specified in the
datasheet, but a few other drivers have it so we should too :^)
Thread::make_userspace_stack_for_main_thread is only ever called from
Process::do_exec, after all the fun ELF loading and TSS setup has
occured.
The calculations in there that check if the combined argv + envp
size will exceed the default stack size are not used in the rest of
the stack setup. So, it should be safe to move this to the beginning
of do_exec and bail early with -E2BIG, just like the man pages say.
Additionally, advertise this limit in limits.h to be a good POSIX.1
citizen. :)
Previously, procfs$pid_fds would return nothing when called
for a process that had either no open files or a non-existent
handle. This could cause problems when a userspace program
expected a valid Json response.
Procfs$pid_fs now returns an empty array in the aforementioned
cases.
ELFLoader::layout() had a "failed" variable that was never set. This
patch checks the return value of each hook (alloc/map section and tls)
and fails the load if they return null.
I also needed to patch Process so that the alloc_section_hook and
map_section_hook actually return nullptr when allocating a region fails.
Fixes#664 :)
The TTY driver now respects the ICANON flag, enabling basic line
editing like VKILL, VERASE, VEOF and VWERASE. Additionally,
ICANON is now set by default.
Basic echoing has can now be enabled via the ECHO flag, though
more complicated echoing like ECHOCTL or ECHONL has not been
implemented.
This reverts commit 1cca5142af.
This appears to be causing intermittent triple-faults and I don't know
why yet, so I'll just revert it to keep the tree in decent shape.
Background: DoubleBuffer is a handy buffer class in the kernel that
allows you to keep writing to it from the "outside" while the "inside"
reads from it. It's used for things like LocalSocket and PTY's.
Internally, it has a read buffer and a write buffer, but the two will
swap places when the read buffer is exhausted (by reading from it.)
Before this patch, it was internally implemented as two Vector<u8>
that we would swap between when the reader side had exhausted the data
in the read buffer. Now instead we preallocate a large KBuffer (64KB*2)
on DoubleBuffer construction and use that throughout its lifetime.
This removes all the kmalloc heap traffic caused by DoubleBuffers :^)
After we clear the FPU state in a thread when it uses the FPU for the
first time, we also save the clean slate in the thread's FPU state
buffer. When we're doing that, let's write through current->fpu_state()
just to make it clear what's going on.
It was actually safe, since we'd just overwritten the g_last_fpu_thread
pointer anyway, but this patch improves the communication of intent.
Spotted by Bryan Steele, thanks!
The way it gets the entropy and blasts it to the buffer is pretty
ugly IMHO, but it does work for now. (It should be replaced, by
not truncating a u32.)
It implements an (unused for now) flags argument, like Linux but
instead of OpenBSD's. This is in case we want to distinguish
between entropy sources or any other reason and have to implement
a new syscall later. Of course, learn from Linux's struggles with
entropy sourcing too.
Cloned threads (basically, forked processes) inherit the complete FPU
state of their origin thread. There was a bug in the lazy FPU state
save/restore mechanism where a cloned thread would believe it had a
buffer full of valid FPU state (because the inherited flag said so)
but the origin thread had never actually copied any FPU state into it.
This patch fixes that by forcing out an FPU state save after doing
the initial FPU initialization (FNINIT) in a thread. :^)
We were leaking 512 bytes of kmalloc memory for every new thread.
This patch fixes that, and also makes sure to zero out the FPU state
buffer after allocating it, and finally also makes the LogStream
operator<< for Thread look a little bit nicer. :^)
This is obviously not ideal, and it would be better to teach it how to
allocate more pages, etc. But since the physical page allocator itself
currently uses SlabAllocator, it's a little bit tricky :^)
Make sure we don't move accepted sockets to the Completed setup state
until we've actually constructed a FileDescription for them.
This is important, since this state transition will trigger connect()
to unblock on the client side, and the client may try writing to the
socket right away.
This makes DNS lookups way more reliable since we don't just fail to
write() right after connect()ing to LookupServer sometimes. :^)
This was causing connect() to unblock immediately for local sockets,
since that's exactly what ConnectBlocker checks for.
Instead, just move to SetupState::Completed when it's accept()ed.
The Cache entries found in `DiskBackedFileSystem` are now stored in a
`KBuffer` object, instead of relying on `kmalloc_eternal`. The number
of entries was exceeding that of the number of bytes allocated to
`kmalloc_eternal`, which in turn caused `mount()` to fail epically
when called.
Now programs can catch the SIGSEGV signal when they segfault.
This commit also introduced the send_urgent_signal_to_self method,
which is needed to send signals to a thread when handling exceptions
caused by the same thread.
Added the exception_code field to RegisterDump, removing the need
for RegisterDumpWithExceptionCode. To accomplish this, I had to
push a dummy exception code during some interrupt entries to properly
pad out the RegisterDump. Note that we also needed to change some code
in sys$sigreturn to deal with the new RegisterDump layout.
Also added a script to handle creation of GPT partitioned disk (with
GRUB config file). Block limit will be used to disallow potential access
to other partitions.
This patch adds three separate per-process fault counters:
- Inode faults
An inode fault happens when we've memory-mapped a file from disk
and we end up having to load 1 page (4KB) of the file into memory.
- Zero faults
Memory returned by mmap() is lazily zeroed out. Every time we have
to zero out 1 page, we count a zero fault.
- CoW faults
VM objects can be shared by multiple mappings that make their own
unique copy iff they want to modify it. The typical reason here is
memory shared between a parent and child process.
If we didn't find anything else that wants to run, we don't need to
update the current thread's TSS since we're just gonna return to the
same thread anyway.
This patch overloads Inode::is_directory() with a faster version that
doesn't require instantiating the whole InodeMetadata.
If you have an Ext2FSInode&, calling is_directory() should be instant
since we can just look directly at the raw inode bits.
We need these for PhysicalPage objects. Ultimately I'd like to get rid
of these objects entirely, but while we still have to deal with them,
let's at least handle large demand a bit better.
Instead of allocating and populating a Copy-on-Write bitmap for each
Region up front, wait until we actually clone the Region for sharing
with another process.
In most cases, we never need any CoW bits and we save ourselves a lot
of kmalloc() memory and time.
When splitting an Region that's already the result of an earlier split,
we have to take the Region's offset-in-VMObject into account since it
may be non-zero.
If we want to make a new entry in the disk cache when it's completely
full of dirty blocks, we'll now synchronously flush the writes at that
point. Maybe it's not ideal, but at least we can keep going.
This way clients are not required to have instantiated ByteBuffers
and can choose whatever memory scheme works best for them.
Also converted some of the Ext2FS code to use stack buffers instead.
Instead of having DiskBackedFS allocate a ByteBuffer, leave it to each
client to provide buffer space.
This is significantly faster in many cases where we can use a stack
buffer and avoid heap allocation entirely.
The hashmap cache was ridiculously slow and hurt us more than it helped
us. This patch replaces it with a flat memory cache that keeps up to
10'000 blocks in cache with a simple dirty bit.
The syncd task will wake up periodically and call flush_writes() on all
file systems, which now causes us to traverse the cache and write all
dirty blocks to disk.
There's a ton of room for improvement here, but this itself is already
drastically better when doing repeated GCC invocations.
This is a simple command that can be used to display HTML from a given
file, or from the standard input, in an HtmlView. It replaces the `tho`
(test HTML output) command.
It turns out some BIOS vendors don't support non-BCD date/time mode, but
we were relying on it being available. We no longer do this, but instead
check whether the BIOS claims to provide BCD or regular binary values for
its date/time data.
We were just blindly trusting that the bootloader would only give us
page-aligned memory regions. This is apparently not always the case,
so now we can try to repair those regions.
Fixes#601
We were always returning the full VM range of the partially-unmapped
Region to the range allocator. This caused us to re-use those addresses
for subsequent VM allocations.
This patch also skips creating a new VMObject in partial munmap().
Instead we just make split regions that point into the same VMObject.
This fixes the mysterious GCC ICE on large C++ programs.
This simplifies the ownership model and makes Region easier to reason
about. Userspace Regions are now primarily kept by Process::m_regions.
Kernel Regions are kept in various OwnPtr<Regions>'s.
Regions now only ever get unmapped when they are destroyed.
If we can't already read when we enter recvfrom() on a LocalSocket,
we'll now block the current thread until we can.
Also added a buffer_for(FileDescription&) helper so that the client
and server can share some of the code. :^)
1) Off-by-one in block allocation when block size != 1 KB
Due to a quirk in the on-disk layout of ext2, file systems with a block
size of 1 KB have block #1 as their first block, while all others start
on block #0.
2) We had no fallback mechanism when the preferred group was full
We now allocate blocks from the preferred block group as long as it's
possible, and fall back to a simple scan through all groups when the
preferred one is full.
Put one unused page on each side of VM allocations to make invalid
accesses more likely to generate crashes.
Note that we will not add this guard padding for mmap() at a specific
memory address, only to "mmap it anywhere" requests.
Made getsockopt() and setsockopt() virtual so we can handle them in the
various Socket subclasses. The subclasses map kinda nicely to "levels".
This will allow us to implement things like "traceroute", although..
I spent some time trying to do that, but then hit a wall when it turned
out that the user-mode networking in QEMU doesn't preserve TTL in the
ICMP packets passing through.
We now no longer hardcode the sigreturn syscall in
the signal trampoline. Because of the way inline asm inputs
work, I've had to enclose the trampoline in the function
signal_trampoline_dummy.
This is a freelist allocator with static size classes that works as a
complement to the generic kmalloc(). It's a lot faster than kmalloc()
since allocation just means popping from the freelist.
It's also significantly more compact when there are a lot of objects
smaller than the minimum kmalloc chunk size (32 bytes.)
This patch enables it for the Region and PhysicalPage classes.
In the PhysicalPage (8 bytes) case, it's a huge improvement since we
no longer waste 75% of the storage allocated.
There are also a number of ways this can be improved, so let's keep
working on it going forward.
This patch makes it possible to *run* text files that start with the
characters "#!" followed by an interpreter.
I've tested this with both the Serenity built-in shell and the Bash
shell, and it works as expected. :^)
If we receive an IRQ while the idle task is running, prevent it from
re-halting the CPU after the IRQ handler returns.
Instead have the idle task yield to the scheduler, so we can see if
the IRQ has unblocked something.
The introduction fchdir in the middle of the EUMERATE_SYSCALLS
expression changed other syscall numbers, which broke compatibility.
This commit fixes that by moving it to the end.
The fchdir() function is equivalent to chdir() except that the
directory that is to be the new current working directory is
specified by a file descriptor.
Also added some assertions to DirectoryEntry in case someone tries to
instantiate them with names that would overflow the name buffer.
DirectoryEntry is a crappy data structure, and the name buffer is also
crappy. Added a FIXME about replacing it with something nicer.
Before this patch, the DirectoryEntry::name buffer would overflow if
you did "touch extremely-long-file-name". Duh.
Fixes#538.
Due to the changes in signal handling m_kernel_stack_for_signal_handler_region
and m_signal_stack_user_region are no longer necessary, and so, have been
removed. I've also removed the similarly reduntant m_tss_to_resume_kernel.