vm_phys_enq_chunk() inserts a run of pages into the buddy queues. When
lazy initialization is enabled, only the first page of each run is
initialized; vm_phys_enq_chunk() thus initializes the page following the
just-inserted run.
This fails to account for the possibility that the page following the
run doesn't belong to the segment. Handle that in vm_phys_enq_chunk().
Reported by: KASAN
Reported by: syzbot+1097ef4cee8dfb240e31@syzkaller.appspotmail.com
Fixes: b16b4c22d2 ("vm_page: Implement lazy page initialization")
Have PCTRIE_RECLAIM_CALLBACK typecast one function pointer type to
another, to relieve the writer of the call back function from having
to cast its first argument from void* to member type.
Reviewed by: rlibby
Differential Revision: https://reviews.freebsd.org/D45586
vm_phys_seg_paddr_to_vm_page() expects a PA that's in bounds, but
vm_phys_find_range() purposefully returns a pointer to the end of the
last page in a segment.
Fixes: 69cbb18746 ("vm_phys: Add a vm_phys_seg_paddr_to_vm_page() helper")
FreeBSD's boot times have decreased to the point where vm_page array
initialization represents a significant fraction of the total boot time.
For example, when booting FreeBSD in Firecracker (a VMM designed to
support lightweight VMs) with 128MB and 1GB of RAM, vm_page
initialization consumes 9% (3ms) and 37% (21.5ms) of the kernel boot
time, respectively. This is generally relevant in cloud environments,
where one wants to be able to spin up VMs as quickly as possible.
This patch implements lazy initialization of (most) page structures,
following a suggestion from cperciva@. The idea is to introduce a new
free pool, VM_FREEPOOL_LAZYINIT, into which all vm_page structures are
initially placed. For this to work, we need only initialize the first
free page of each chunk placed into the buddy allocator. Then, early
page allocations draw from the lazy init pool and initialize vm_page
chunks (up to 16MB, 4096 pages) on demand. Once APs are started, an
idle-priority thread drains the lazy init pool in the background to
avoid introducing extra latency in the allocator. With this scheme,
almost all of the initialization work is moved out of the critical path.
A couple of vm_phys operations require the pool to be drained before
they can run: vm_phys_find_range() and vm_phys_unfree_page(). However,
these are rare operations. I believe that
vm_phys_find_freelist_contig() does not require any special treatment,
as it only ever accesses the first page in a power-of-2-sized free page
chunk, which is always initialized.
For now the new pool is only used on amd64 and arm64, since that's where
I can easily test and those platforms would get the most benefit.
Reviewed by: alc, kib
Differential Revision: https://reviews.freebsd.org/D40403
A subsequent patch will make this factoring more worthwhile.
No functional change intended.
Reviewed by: dougm, alc, kib, emaste
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D40400
This is useful for a subsequent patch which implements lazy
initialization of vm_page structures using a dedicate vm_phys free page
pool.
No functional change intended.
Reviewed by: alc, kib, emaste
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D40399
jemalloc performs two types of virtual memory allocations: (1) large
chunks of virtual memory, where the chunk size is a multiple of a
superpage and explicitly aligned, and (2) small allocations, mostly
128KB, where no alignment is requested. Typically, it starts with a
small allocation, and over time it makes both types of allocation.
With anon_loc being updated on every allocation, we wind up with a
repeating pattern of a small allocation, a large gap, and a large,
aligned allocation. (As an aside, we wind up allocating a reservation
for these small allocations, but it will never fill because the next
large, aligned allocation updates anon_loc, leaving a gap that will
never be filled with other small allocations.)
With this change, anon_loc isn't updated on every allocation. So, the
small allocations will be clustered together, the large allocations will
be clustered together, and there will be fewer gaps between the
anonymous memory allocations. In addition, I see a small reduction in
reservations allocated (e.g., 1.6% during buildworld), fewer partially
populated reservations, and a small increase in 64KB page promotions on
arm64.
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39845
Replace the lookup-remove loop in swp_pager_meta_free_all with a call
to SWAP_PCTRIE_RECLAIM_CALLBACK, to eliminate repeated trie searches.
Reviewed by: rlibby
Differential Revision: https://reviews.freebsd.org/D45583
Define a page_range struct to pair up the two values passed to
freerange functions. Have swp_pager_freeswapspace also take a
page_range argument rather than a pair of arguments.
In swp_pager_meta_free_all, drop a needless test and use a new
helper function to do the cleanup for each swap block.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D45562
Drop an unneeded test, a branch and a needless computation to save a
few instructions.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D45558
This reduces work done under vm_page_insert for large objects.
Reviewed by: alc, dougm, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D45486
Use the new pctrie combined lookup/insert. This is an easy application
of the new facility. There are other places where we do this for pages
that may need more plumbing to use combined lookup/insert.
Reviewed by: kib (previous version), dougm, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D45396
When vm_object_collapse() was changed in commit 98087a0 to call
vm_object_terminate(), rather than destroying the object directly, its
call to vm_reserv_break_all() should have been removed, as
vm_object_terminate() calls vm_reserv_break_all().
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D45495
Make the tlb shootdown function as a pointer. By default, it still
points to the system function smp_targeted_tlb_shootdown(). It allows
other implemenations to overwrite in the future.
Reviewed by: kib
Tested by: whu
Authored-by: Souradeep Chakrabarti <schakrabarti@microsoft.com>
Co-Authored-by: Erni Sri Satya Vennela <ernis@microsoft.com>
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D45174
One of these changes saves two instructions on an amd64
GENERIC-NODEBUG build. The rest are entirely cosmetic, because the
compiler can deduce that x is nonzero, and avoid the needless test.
Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D45331
Implement a simple heuristic to skip pointless promotion attempts by
pmap_enter_quick_locked() and moea64_enter(). Specifically, when
vm_fault() calls pmap_enter_quick() to map neighboring pages at the end
of a copy-on-write fault, there is no point in attempting promotion in
pmap_enter_quick_locked() and moea64_enter(). Promotion will fail
because the base pages have differing protection.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D45431
MFC after: 1 week
In three instances where fls(x)-1 is used, the compiler does not know
that x is nonzero and so adds needless zero checks. Using ilog(x)
instead saves, in each instance, about 4 instructions, including a
conditional, and 16 or so bytes, on an amd64 build.
Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D45330
I cannot find a time where the function was not named this.
Reviewed by: kib, markj
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45383
The LK_NOWAIT was added to suppress a witness warning, but LK_NOWITNESS
is more what we mean. This makes pbuf_ctor() more consistent with
buf_alloc(), although, unlike buf_alloc(), for pbuf there should not be
any danger of a wild locker relying on the type stability of the buf to
attempt a lock. That is, this is essentially cosmetic.
Relevant history:
- 531f8cfea0 Use dedicated lock name for pbufs
- 5875b94c74 buf_alloc(): lock the buffer with LK_NOWAIT
- c9e023541a pbuf_ctor(): lock the buffer with LK_NOWAIT
- 1fb00c8f10 buf_alloc(): Stop using LK_NOWAIT, use LK_NOWITNESS
Reviewed by: rew, kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D45360
UMA_MD_SMALL_ALLOC was recently replaced by UMA_USE_DMAP, but
da76d349b6 missed some improper uses of the old symbol.
This change makes sure that UMA_USE_DMAP is used properly in
code that selects uma_small_alloc.
Fixes: da76d349b6
Reported by: eduardo, rlibby
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D45368
This commit introduces the MINIDUMP_STARTUP_PAGE_TRACKING symbol and
uses it to simplify several instances of a complex preprocessor conditional
for adding pages allocated when bootstraping the kernel to minidumps.
Reviewed by: markj, mhorne
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D45085
This commit refactors the UMA small alloc code and
removes most UMA machine-dependent code.
The existing machine-dependent uma_small_alloc code is almost identical
across all architectures, except for powerpc where using the direct
map addresses involved extra steps in some cases.
The MI/MD split was replaced by a default uma_small_alloc
implementation that can be overridden by architecture-specific code by
defining the UMA_MD_SMALL_ALLOC symbol. Furthermore, UMA_USE_DMAP was
introduced to replace most UMA_MD_SMALL_ALLOC uses.
Reviewed by: markj, kib
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D45084
In vm_pageout_scan_inactive, release the object lock when we go to
refill the scan batch queue so that someone else has a chance to acquire
it. This improves access latency to the object when the pagedaemon is
processing many consecutive pages from a single object, and also in any
case avoids a hiccup during refill for the last touched object.
Reviewed by: alc, markj (previous version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D45288
Whenever file is created, the vnode_create_vobject() function will
try to determine its size by calling vn_getsize_locked() as size 0
is ambigious: it means either the file size is 0 or the file size
is unknown.
Introduce special value for the size argument: VNODE_NO_SIZE.
Only when it is given, the vnode_create_vobject() will try to obtain
file's size on its own.
Introduce dedicated vnode_disk_create_vobject() for use by
g_vfs_open(), so we don't have to call vn_isdisk() in the common case
(for regular files).
Handle the case of mediasize==0 in g_vfs_open().
Reviewed by: alc, kib, markj, olce
Approved by: oshogbo (mentor), allanjude (mentor)
Differential Revision: https://reviews.freebsd.org/D45244
per allocated vm_object. Otherwise, since constructors are not
idempotent, we e.g. leak device reference in case of non-managed pager.
PR: 278826
Reported by: Austin Zhang <austin.zhang@dell.com>
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D45113
vm_object_page_remove() wants to busy the page, but that won't work
here. (Kernel stack pages are always busy.)
Make the error handling path look more like vm_thread_stack_dispose().
Reported by: pho
Reviewed by: kib, bnovkov
Fixes: 7a79d06697 ("vm: improve kstack_object pindex calculation to avoid pindex holes")
Differential Revision: https://reviews.freebsd.org/D45019
fork() may allocate a new thread in one of two ways: from UMA, or cached
in a freed proc that was just allocated from UMA. In either case, KASAN
and KMSAN need to initialize some state; in particular they need to
initialize the shadow mapping of the new thread's stack.
This is done differently between KASAN and KMSAN, which is confusing.
This patch improves things a bit:
- Add a new thread_recycle() function, which moves all kernel stack
handling out of kern_fork.c, since it doesn't really belong there.
- Then, thread_alloc_stack() has only one local caller, so just inline
it.
- Avoid redundant shadow stack initialization: thread_alloc()
initializes the KMSAN shadow stack (via kmsan_thread_alloc()) even
through vm_thread_new() already did that.
- Add kasan_thread_alloc(), for consistency with kmsan_thread_alloc().
No functional change intended.
Reviewed by: khng
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D44891
This commit replaces the linear transformation of kernel virtual
addresses to kstack_object pindex values with a non-linear
scheme that circumvents physical memory fragmentation caused by
kernel stack guard pages. The new mapping scheme is used to
effectively "skip" guard pages and assign pindices for
non-guard pages in a contiguous fashion.
The new allocation scheme requires that all default-sized kstack KVAs
come from a separate, specially aligned region of the KVA space.
For this to work, this commited introduces a dedicated per-domain
kstack KVA arena used to allocate kernel stacks of default size.
The behaviour on 32-bit platforms remains unchanged due to a
significatly smaller KVA space.
Aside from fullfilling the requirements imposed by the new scheme, a
separate kstack KVA arena facilitates superpage promotion in the rest
of kernel and causes most kstacks to have guard pages at both ends.
Reviewed by: alc, kib, markj
Tested by: markj
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D38852
This fixes compiler warnings when -Wunused-arguments is enabled and
not quieted.
Reviewed by: kib, markj
Obtained from: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D44623
The swap pager itself allocates readahead pages, so should take care to
unbusy them after a read error, just as it does in the non-error case.
PR: 277538
Reviewed by: olce, dougm, alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D44646
Add a function to check whether an aligned block of vm pages are
allocated, for use with impending changes to arm64 superpage
managment.
Reviewed by: alc
Differential Revision: http://reviews.freebsd.org/D44575
Remove sys/cdefs.h and sys/socket.h includes.
Order sys/ includes alphabetically.
Do not check for NULL before free().
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
DIfferential revision: https://reviews.freebsd.org/D43444
The loop 'skip clean blocks' checking for the clean blocks in the dirty
pages might end up setting the in_hole to true when exactly at EOF at
the middle of the block, without advancing the prev_offset value. Then
the next block is not dirty, and next_offset is clipped back to poffset
+ maxsize, equal to prev_offset, failing the assertion.
Instead of asserting prev_offset < next_offset, we must skip the write.
Reported by: asomers
PR: 276191
Reviewed by: alc, markj
Tested by: asomers
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D43358
Commit 2619c5ccfe ("Avoid waiting on physical allocations that can't
possibly be satisfied") changed the return value from bool to errno.
Adjust the function description to match reality.
- Change vm_page_reclaim_contig[_domain] to return an errno instead
of a boolean. 0 indicates a successful reclaim, ENOMEM indicates
lack of available memory to reclaim, with any other error (currently
only ERANGE) indicating that reclamation is impossible for the
specified address range. Change all callers to only follow
up with vm_page_wait* in the ENOMEM case.
- Introduce vm_domainset_iter_ignore(), which marks the specified
domain as unavailable for further use by the iterator. Use this
function to ignore domains that can't possibly satisfy a physical
allocation request. Since WAITOK allocations run the iterators
repeatedly, this avoids the possibility of infinitely spinning
in domain iteration if no available domain can satisfy the
allocation request.
PR: 274252
Reported by: kevans
Tested by: kevans
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D42706
Function vm_phys_free_contig does not always free memory properly when
the npages parameter is less than max block size. Change it so that it does.
Note that this function is not currently invoked, and this error was
not triggered in earlier versions of the code.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D42891
Both system calls were stubs returning EOPNOTSUPP and libc did not
provide _ or __sys_ prefixed symbols. The actual implementation of
sbrk(2) is on top of the undocumented break(2) system call.
Technically this is a change in ABI, but no non-contrived program ever
called these syscalls.
Reviewed by: kib, emaste
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D42872
Add a function like kva_alloc that allows us to specify the alignment
of the virtual address space returned.
Reviewed by: alc, kib, markj
Sponsored by: Arm Ltd
Differential Revision: https://reviews.freebsd.org/D42788