system/freebsd-src

mirror of https://github.com/freebsd/freebsd-src synced 2024-09-19 16:23:29 +00:00

Author	SHA1	Message	Date
Mark Johnston	0401989282	vm: Round up npages and alignment for contig reclamation When searching for runs to reclaim, we need to ensure that the entire run will be added to the buddy allocator as a single unit. Otherwise, it will not be visible to vm_phys_alloc_contig() as it is currently implemented. This is a problem for allocation requests that are not a power of 2 in size, as with 9KB jumbo mbuf clusters. Reported by: alc Reviewed by: alc MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28924	2021-03-02 10:21:02 -05:00
Max Laier	14b5a3c7d5	vm pqbatch: move unmanaged page assert under pagequeue lock This KASSERT is overzealous because of the following race condition: 1) A managed page which is currently in PQ_LAUNDRY is freed. vm_page_free_prep calls vm_page_dequeue_deferred() The page state is: PQ_LAUNDRY, PGA_DEQUEUE\|PGA_ENQUEUED 2) The laundry worker comes around and pick up the page and calls vm_pageout_defer(m, PQ_LAUNDRY, true) to check if page is still in the queue. We do a vm_page_astate_load and get PQ_LAUNDRY, PGA_DEQUEUE\|PGA_ENQUEUED as per above. 3) The laundry worker is pre-empted and another thread allocates our page from the free pool. For example vm_page_alloc_domain_after calls vm_page_dequeue() and sets VPO_UNMANAGED because we are allocating for an OBJT_UNMANAGED object. The page state is: PQ_NONE, 0 - VPO_UNMANAGED 4) The laundry worker resumes, and processes vm_pageout_defer based on the stale astate which leads to a call to vm_page_pqbatch_submit, which will trip on the KASSERT. Submitted by: mlaier Reviewed by: markj, rlibby Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D28563	2021-02-24 15:56:16 -08:00
Mark Johnston	537f92cd35	uma: Update the comment above startup_alloc() to reflect reality The scheme used for early slab allocations changed in commit `a81c400e75`. Reported by: alc Reviewed by: alc MFC after: 1 week	2021-02-22 18:22:51 -05:00
Mark Johnston	23e875fd97	vm_kern: Avoid sign extension in the KVA_QUANTUM definition Otherwise, on a powerpc64 NUMA system with hashed page tables, the first-level superpage reservation size is large enough that the value of the kernel KVA arena import quantum, KVA_NUMA_IMPORT_QUANTUM, is negative and gets sign-extended when passed to vmem_set_import(). This results in a boot-time hang on such platforms. Reported by: bdragon MFC after: 3 days	2021-02-22 15:50:09 -05:00
Alex Richardson	fa2528ac64	Use atomic loads/stores when updating td->td_state KCSAN complains about racy accesses in the locking code. Those races are fine since they are inside a TD_SET_RUNNING() loop that expects the value to be changed by another CPU. Use relaxed atomic stores/loads to indicate that this variable can be written/read by multiple CPUs at the same time. This will also prevent the compiler from doing unexpected re-ordering. Reported by: GENERIC-KCSAN Test Plan: KCSAN no longer complains, kernel still runs fine. Reviewed By: markj, mjg (earlier version) Differential Revision: https://reviews.freebsd.org/D28569	2021-02-18 14:02:48 +00:00
John Baldwin	67932460c7	Add a VA_IS_CLEANMAP() macro. This macro returns true if a provided virtual address is contained in the kernel's clean submap. In CHERI kernels, the buffer cache and transient I/O map are allocated as separate regions. Abstracting this check reduces the diff relative to FreeBSD. It is perhaps slightly more readable as well. Reviewed by: kib Obtained from: CheriBSD Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D28710	2021-02-17 16:32:11 -08:00
Mark Johnston	5c18744ea9	vm: Honour the "noreuse" flag to vm_page_unwire_managed() This flag indicates that the page should be enqueued near the head of the inactive queue, skipping the LRU queue. It is used when unwiring pages from the buffer cache following direct I/O or after I/O when POSIX_FADV_NOREUSE or _DONTNEED advice was specified, or when sendfile(SF_NOCACHE) completes. For the direct I/O and sendfile cases we only enqueue the page if we decide not to free it, typically because it's mapped. Pass "noreuse" through to vm_page_release_toq() so that we actually honour the desired LRU policy for these scenarios. Reported by: bdrewery Reviewed by: alc, kib MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D28555	2021-02-10 11:10:27 -05:00
Ryan Stone	660344ca44	Add a VM flag to prevent reclaim on a failed contig allocation If a M_WAITOK contig alloc fails, the VM subsystem will try to reclaim contiguous memory twice before actually failing the request. On a system with 64GB of RAM I've observed this take 400-500ms before it finally gives up, and I believe that this will only be worse on systems with even more memory. In certain contexts this delay is extremely harmful, so add a flag that will skip reclaim for allocation requests to allow those paths to opt-out of doing an expensive reclaim. Sponsored by: Dell Inc Differential Revision: https://reviews.freebsd.org/D28422 Reviewed by: markj, kib	2021-02-03 16:16:51 -05:00
Brooks Davis	7a1591c1b6	Rename kern_mmap_req to kern_mmap Replace all uses of kern_mmap with kern_mmap_req move the old kern_mmap. Reand rename kern_mmap_req to kern_mmap . The helper saved some code churn initially, but having multiple interfaces is sub-optimal. Obtained from: CheriBSD Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D28292	2021-01-25 21:50:37 +00:00
Konstantin Belousov	420d4be3e4	vm_map_protect(): remove not needed recalculations of new_prot, new_maxprot Requested by: alc Sponsored by: The FreeBSD Foundation	2021-01-14 10:02:43 +02:00
Konstantin Belousov	0659df6fad	vm_map_protect: allow to set prot and max_prot in one go. This prevents a situation where other thread modifies map entries permissions between setting max_prot, then relocking, then setting prot, confusing the operation outcome. E.g. you can get an error that is not possible if operation is performed atomic. Also enable setting rwx for max_prot even if map does not allow to set effective rwx protection. Reviewed by: brooks, markj (previous version) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28117	2021-01-13 01:35:22 +02:00
Konstantin Belousov	9402bb44f1	vmspace_fork: preserve wx settings in the child vm map after fork Noted by: markj Sponsored by: The FreeBSD Foundation	2021-01-12 08:09:59 +02:00
Konstantin Belousov	2e1c94aa1f	Implement enforcing write XOR execute mapping policy. It is checked in vm_map_insert() and vm_map_protect() that PROT_WRITE \| PROT_EXEC are never specified together, if vm_map has MAP_WX flag set. FreeBSD control flag allows specific binary to request WX exempt, and there are per ABI boolean sysctls kern.elf{32,64}.allow_wx to enable/ disable globally. Reviewed by: emaste, jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28050	2021-01-12 01:15:43 +02:00
Mark Johnston	663de81f85	uma: Avoid unmapping direct-mapped slabs startup_alloc() uses pmap_map() to map slabs used for bootstrapping the VM. pmap_map() may ignore the hint address and simply return a range from the direct map. In this case we must not unmap the range in startup_free(). UMA uses bootstart and bootmem to track the range of KVA into which slabs are mapped if the direct map is not used. Unmap a startup slab only if it was mapped into that range. Reported by: alc Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27885	2021-01-03 11:50:31 -05:00
Ryan Libby	942951ba46	uma dbg: catch more corruption with atomics Use atomic testandset and testandclear to catch concurrent double free, and to reduce the number of atomic operations. Submitted by: jeff Reviewed by: cem, kib, markj (all previous version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22703	2020-12-31 13:02:45 -08:00
Mark Johnston	81846def34	vm: Fix some bugs in the page busying code In vm_page_busy_acquire(), load the object pointer using atomic_load_ptr() as we do elsewhere. Per the comment, the object identity must be consistent across sleeps. In vm_page_grab_sleep(), pass the correct pindex to _vm_page_busy_sleep(). The pindex is used to re-check the page's identity before going to sleep. In particular, vm_page_grab_sleep() is used in unlocked grab, so the object lock is not necessarily held when verifying the page's identity, and the pindex may change if the page is moved, or freed and re-allocated. I believe this can result in spurious VM_PAGER_FAILs from vm_page_grab_valid_unlocked() or early termination of vm_page_grab_pages_unlocked(). In vm_page_grab_pages(), pass the correct pindex to vm_page_grab_sleep(). Otherwise I believe vm_page_grab_pages() will effectively spin when attempting to busy a busy page after the first index in the range. Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27607	2020-12-27 17:01:44 -05:00
Mark Johnston	d2f1c44bc9	uma: Remove the MINBUCKET flag from the flag name list This should have been done in r368399 / commit `f8b6c51538`. Reported by: rlibby Sponsored by: The FreeBSD Foundation	2020-12-27 17:01:33 -05:00
Bryan Drewery	5fee468e83	Revert r368523 which fixed contig allocs waiting forever. This needs to account for empty NUMA domains or domains which do not satisfy the requested range. Discussed with: markj	2020-12-15 19:38:16 +00:00
Bryan Drewery	bbfec1633b	contig allocs: Don't retry forever on M_WAITOK. This restores behavior from before domain iterators were added in r327895 and r327896. The vm_domainset_iter_policy() will do a vm_wait_doms() and then restart its iterator when M_WAITOK is set. It will also force the containing loop to have M_NOWAIT. So we get an unbounded retry loop rather than the intended bounded retries that kmem_alloc_contig_pages() already handles. This also restores M_WAITOK to the vmem_alloc() call in kmem_alloc_attr_domain() and kmem_alloc_contig_domain(). Reviewed by: markj, kib MFC after: 2 weeks Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D27507	2020-12-10 20:44:29 +00:00
Mark Johnston	e574d407ae	uma: Make uma_zone_set_maxcache() work better with small limits The old implementation chose the largest bucket zone such that if the per-CPU caches are fully populated, the total number of items cached is no larger than the specified limit. If no such zone existed, UMA would not do any caching. We can now use uz_bucket_size_max to set a precise limit on the number of items in a zone's bucket, so the total size of per-CPU caches can be bounded more easily. Implement a new policy in uma_zone_set_maxcache(): choose a bucket size such that up to half of the limit can be cached in per-CPU caches, with the rest going to the full bucket cache. This fixes a problem with the kstack_cache zone: the limit of 4 * mp_ncpus items meant that the zone would not do any caching, defeating the whole purpose of the zone. That's because the smallest bucket size holds up to 2 items and we may cache up to 3 full buckets per CPU, and 2 * 3 * mp_ncpus > 4 * mp_ncpus. Reported by: mjg Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27168	2020-12-06 22:45:50 +00:00
Mark Johnston	f8b6c51538	uma: Enforce the use of uz_bucket_size_max in the free path uz_bucket_size_max is the maximum permitted bucket size. When filling a new bucket to satisfy uma_zalloc(), the bucket is populated with at most uz_bucket_size_max items. The maximum number of entries in the bucket may be larger. When freeing items, however, we will fill per-CPPU buckets up to their maximum number of entries, potentially exceeding uz_bucket_size_max. This makes it difficult to precisely limit the number of items that may be cached in a zone. For example, if one wants to limit buckets to 1 entry for a particular zone, that's not possible since the smallest bucket holds up to 2 entries. Try to solve the problem by using uz_bucket_size_max to limit the number of entries in a bucket. Note that the ub_entries field is initialized upon every bucket allocation. Most zones are not affected since they do not impose any specific limit on the maximum bucket size. While here, remove the UMA_ZONE_MINBUCKET flag. It was unused and we now have uma_zone_set_maxcache() to control the zone's cache size more precisely. Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27167	2020-12-06 22:45:39 +00:00
Mark Johnston	8a6776ca0f	uma: Use atomic load for uz_sleepers This field is updated locklessly. Sponsored by: The FreeBSD Foundation	2020-12-06 22:45:22 +00:00
Mark Johnston	991f23ef20	uma: Avoid allocating buckets with the cross-domain lock held Allocation of a bucket can trigger a cross-domain free in the bucket zone, e.g., if the per-CPU alloc bucket is empty, we free it and get migrated to a remote domain. This can lead to deadlocks since a bucket zone may allocate buckets from itself or a pair of bucket zones could be allocating from each other. Fix the problem by dropping the cross-domain lock before allocating a new bucket and handling refill races. Use a list of empty buckets to ensure that we can make forward progress. Reported by: imp, mjg (witness(9) warnings) Discussed with: jeff Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27341	2020-11-30 16:18:33 +00:00
Konstantin Belousov	cd85379104	Make MAXPHYS tunable. Bump MAXPHYS to 1M. Replace MAXPHYS by runtime variable maxphys. It is initialized from MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys. Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer cache buffers exactly to atop(maxbcachebuf) (currently it is sized to atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1. The +1 for pbufs allow several pbuf consumers, among them vmapbuf(), to use unaligned buffers still sized to maxphys, esp. when such buffers come from userspace (). Overall, we save significant amount of otherwise wasted memory in b_pages[] for buffer cache buffers, while bumping MAXPHYS to desired high value. Eliminate all direct uses of the MAXPHYS constant in kernel and driver sources, except a place which initialize maxphys. Some random (and arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted straight. Some drivers, which use MAXPHYS to size embeded structures, get private MAXPHYS-like constant; their convertion is out of scope for this work. Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs, dev/siis, where either submitted by, or based on changes by mav. Suggested by: mav () Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27225	2020-11-28 12:12:51 +00:00
Mark Johnston	1fea4b25c9	Wrap a long line in vm_pqbatch_process_page()	2020-11-19 15:41:42 +00:00
Mark Johnston	9e3e737608	Micro-optimize vm_page_pqbatch_submit() Avoid calling vm_page_domain() twice. Discussed with: alc (in D27207)	2020-11-19 15:40:58 +00:00
Mark Johnston	431fb8abd7	vm_phys: Try to clean up NUMA KPIs It can useful for code outside the VM system to look up the NUMA domain of a page backing a virtual or physical address, specifically when creating NUMA-aware data structures. We have _vm_phys_domain() for this, but the leading underscore implies that it's an internal function, and vm_phys.h has dependencies on a number of other headers. Rename vm_phys_domain() to vm_page_domain(), and _vm_phys_domain() to vm_phys_domain(). Make the latter an inline function. Add _vm_phys.h and define struct vm_phys_seg there so that it's easier to use in other headers. Include it from vm_page.h so that vm_page_domain() can be defined there. Include machine/vmparam.h from _vm_phys.h since it depends directly on some constants defined there. Reviewed by: alc Reviewed by: dougm, kib (earlier versions) Differential Revision: https://reviews.freebsd.org/D27207	2020-11-19 03:59:21 +00:00
Mark Johnston	20f02659d6	vm_map: Handle kernel map entry allocator recursion On platforms without a direct map[], vm_map_insert() may in rare situations need to allocate a kernel map entry in order to allocate kernel map entries. This poses a problem similar to the one solved for vmem boundary tags by vmem_bt_alloc(). In fact the kernel map case is a bit more complicated since we must allocate entries with the kernel map locked, whereas vmem can recurse into itself because boundary tags are allocated up-front. The solution is to add a custom slab allocator for kmapentzone which allocates KVA directly from kernel_map, bypassing the kmem_ layer. This avoids mutual recursion with the vmem btag allocator. Then, when vm_map_insert() allocates a new kernel map entry, it avoids triggering allocation of a new slab with M_NOVM until after the insertion is complete. Instead, vm_map_insert() allocates from the reserve and sets a flag in kernel_map to trigger re-population of the reserve just before the map is unlocked. This places an implicit upper bound on the number of kernel map entries that may be allocated before the kernel map lock is released, but in general a bound of 1 suffices. [*] This also comes up on amd64 with UMA_MD_SMALL_ALLOC undefined, a configuration required by some kernel sanitizers. Discussed with: kib, rlibby Reported by: andrew Tested by: pho (i386 and amd64 with !UMA_MD_SMALL_ALLOC) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26851	2020-11-11 17:16:39 +00:00
Jonathan T. Looney	7b516613aa	When destroying a UMA zone which has a reserve (set with uma_zone_reserve()), messages like the following appear on the console: "Freed UMA keg (Test zone) was not empty (0 items). Lost 528 pages of memory." When keg_drain_domain() is draining the zone, it tries to keep the number of items specified in the reservation. However, when we are destroying the UMA zone, we do not need to keep those items. Therefore, when destroying a non-secondary and non-cache zone, we should reset the keg reservation to 0 prior to draining the zone. Reviewed by: markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D27129	2020-11-10 18:12:09 +00:00
Mateusz Guzik	3a440a421d	Add more per-cpu zones. This covers powers of 2 up to 64. Example pending user is ZFS.	2020-11-09 00:34:23 +00:00
Leandro Lupori	e2d6c417e3	Implement superpages for PowerPC64 (HPT) This change adds support for transparent superpages for PowerPC64 systems using Hashed Page Tables (HPT). All pmap operations are supported. The changes were inspired by RISC-V implementation of superpages, by @markj (r344106), but heavily adapted to fit PPC64 HPT architecture and existing MMU OEA64 code. While these changes are not better tested, superpages support is disabled by default. To enable it, use vm.pmap.superpages_enabled=1. In this initial implementation, when superpages are disabled, system performance stays at the same level as without these changes. When superpages are enabled, buildworld time increases a bit (~2%). However, for workloads that put a heavy pressure on the TLB the performance boost is much bigger (see HPC Challenge and pgbench on D25237). Reviewed by: jhibbits Sponsored by: Eldorado Research Institute (eldorado.org.br) Differential Revision: https://reviews.freebsd.org/D25237	2020-11-06 14:12:45 +00:00
Mateusz Guzik	2dee296a3d	Rationalize per-cpu zones. The 2 provided zones had inconsistent naming between each other ("int" and "64") and other allocator zones (which use bytes). Follow malloc by naming them "pcpu-" + size in bytes. This is a step towards replacing ad-hoc per-cpu zones with general slabs.	2020-11-05 15:08:56 +00:00
Mark Johnston	f7db0c9532	vmspace: Convert to refcount(9) This is mostly mechanical except for vmspace_exit(). There, use the new refcount_release_if_last() to avoid switching to vmspace0 unless other processes are sharing the vmspace. In that case, upon switching to vmspace0 we can unconditionally release the reference. Remove the volatile qualifier from vm_refcnt now that accesses are protected using refcount(9) KPIs. Reviewed by: alc, kib, mmel MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27057	2020-11-04 16:30:56 +00:00
Alan Cox	ccfd886a1b	Conditionally compile struct vm_phys_seg's md_first field. This field is only used by arm64's pmap. Reviewed by: kib, markj, scottph Differential Revision: https://reviews.freebsd.org/D26907	2020-10-23 06:24:38 +00:00
Ed Maste	575a4437a9	uma: fix KTR message after r366840 Reported by: bz Sponsored by: The FreeBSD Foundation	2020-10-19 18:54:44 +00:00
Mark Johnston	f09cbea31a	uma: Respect uk_reserve in keg_drain() When a reserve of free items is configured for a zone, the reserve must not be reclaimed under memory pressure. Modify keg_drain() to simply respect the reserved pool. While here remove an always-false uk_freef == NULL check (kegs that shouldn't be drained should set _NOFREE instead), and make sure that the keg_drain() KTR statement does not reference an uninitialized variable. Reviewed by: alc, rlibby Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26772	2020-10-19 16:57:40 +00:00
Mark Johnston	1b2dcc8c54	uma: Avoid depleting keg reserves when filling a bucket zone_import() fetches a free or partially free slab from the keg and then uses its items to populate an array, typically filling a bucket. If a single allocation causes the keg to drop below its minimum reserve, the inner loop ends. However, if the bucket is still not full and M_USE_RESERVE is specified, the outer loop will continue to fetch items from the keg. If M_USE_RESERVE is specified and the number of free items is below the reserved limit, we should return only a single item. Otherwise, if the bucket size is larger than the reserve, all of the reserved items may end up in a single per-CPU bucket, invisible to other CPUs. Reviewed by: rlibby MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26771	2020-10-19 16:55:03 +00:00
Konstantin Belousov	6f3b523c9a	Avoid dump_avail[] redefinition. Move dump_avail[] extern declaration and inlines into a new header vm/vm_dumpset.h. This fixes default gcc build for mips. Reviewed by: alc, scottph Tested by: kevans (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26741	2020-10-14 22:51:40 +00:00
Bryan Drewery	c2c6fb90e0	Use unlocked page lookup for inmem() to avoid object lock contention Reviewed By: kib, markj Submitted by: mlaier Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D26653	2020-10-09 23:49:42 +00:00
Konstantin Belousov	42f96162c3	vm_page_dump_index_to_pa(): Add braces to the expression involving + and &. The precedence of the '&' operator is less than of '+'. Added braces do change the order of evaluation into the natural one, in my opinion. On the other hand, the value of the expression should not change since all elements should have page-aligned values. This fixes a gcc warning reported. Reported by: adrian Sponsored by: The FreeBSD Foundation MFC after: 1 week	2020-10-08 22:46:15 +00:00
Mark Johnston	2913cc4637	vm_pageout: Avoid rounding down the inactive scan target With helper page daemon threads, enabled by default in r364786, we divide the inactive target by the number of threads, rounding down, and sum the total number of pages freed by the threads. This sum is compared with the original target, but by rounding down we might lose pages, causing the page daemon control loop to conclude that inactive queue scanning isn't keeping up with demand for free pages. Typically this results in excessive swapping. Fix the problem by accounting for the error in the main pagedaemon thread's target. Note that by default the problem will manifest only in systems with >16 CPUs in a NUMA domain. Reviewed by: cem Discussed with: dougm Reported and tested by: dhw, glebius Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26610	2020-10-02 19:16:06 +00:00
Mark Johnston	06d8bdcbf7	uma: Use the bucket cache for cross-domain allocations uma_zalloc_domain() allocates from the requested domain instead of following a first-touch policy (the default for most zones). Currently it is only used by malloc_domainset(), and consumers free returned items with free(9) since r363834. Previously uma_zalloc_domain() worked by always going to the keg for an item. As a result, the use of UMA zone caches was unbalanced: we free items to the caches, but always allocate from the keg, skipping the caches. Make some effort to allocate from the UMA caches when performing a cross-domain allocation. This avoids blowing up the caches when something is performing many transient allocations with malloc_domainset(). Reported and tested by: dhw, glebius Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26427	2020-10-02 19:04:29 +00:00
Mark Johnston	5afdf5c1ca	uma: Use LIFO for non-SMR bucket caches When SMR was introduced, zone_put_bucket() was changed to always place full buckets at the end of the queue. However, it is generally preferable to use recently used buckets since their items are more likely to be resident in cache. So, for buckets that have no constraint on item reuse, use a last-in-first-out ordering as we did before. Reviewed by: rlibby Tested by: dhw, glebius Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26426	2020-10-02 19:04:09 +00:00
Mark Johnston	952c8964ba	uma: Remove newlines from panic messages Sponsored by: The FreeBSD Foundation	2020-10-02 19:03:42 +00:00
Mark Johnston	f31695cc64	Implement sparse core dumps Currently we allocate and map zero-filled anonymous pages when dumping core. This can result in lots of needless disk I/O and page allocations. This change tries to make the core dumper more clever and represent unbacked ranges of virtual memory by holes in the core dump file. Add a new page fault type, VM_FAULT_NOFILL, which causes vm_fault() to clean up and return an error when it would otherwise map a zero-filled page. Then, in the core dumper code, prefault all user pages and handle errors by simply extending the size of the core file. This also fixes a bug related to the fact that vn_io_fault1() does not attempt partial I/O in the face of errors from vm_fault_quick_hold_pages(): if a truncated file is mapped into a user process, an attempt to dump beyond the end of the file results in an error, but this means that valid pages immediately preceding the end of the file might not have been dumped either. The change reduces the core dump size of trivial programs by a factor of ten simply by excluding unaccessed libc.so pages. PR: 249067 Reviewed by: kib Tested by: pho MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26590	2020-10-02 17:50:22 +00:00
Mark Johnston	114484b7ec	Flag vm_reserv and vm_phys sysctls as MPSAFE. Nothing in these subsystems relies on Giant. MFC after: 1 week	2020-09-23 19:36:07 +00:00
Mark Johnston	78257765f2	Add a vmparam.h constant indicating pmap support for large pages. Enable SHM_LARGEPAGE support on arm64. Reviewed by: alc, kib Sponsored by: Juniper Networks, Inc., Klara, Inc. Differential Revision: https://reviews.freebsd.org/D26467	2020-09-23 19:34:21 +00:00
D Scott Phillips	de03184698	arm64/pmap: Sparsify pv_table Reviewed by: markj, kib Approved by: scottl (implicit) MFC after: 1 week Sponsored by: Ampere Computing, Inc. Differential Revision: https://reviews.freebsd.org/D26132	2020-09-21 22:23:57 +00:00
D Scott Phillips	7988971a99	vm_reserv: Sparsify the vm_reserv_array when VM_PHYSSEG_SPARSE On an Ampere Altra system, the physical memory is populated sparsely within the physical address space, with only about 0.4% of physical addresses backed by RAM in the range [0, last_pa]. This is causing the vm_reserv_array to be over-sized by a few orders of magnitude, wasting roughly 5 GiB on a system with 256 GiB of RAM. The sparse allocation of vm_reserv_array is controlled by defining VM_PHYSSEG_SPARSE, with the dense allocation still remaining for platforms with VM_PHYSSEG_DENSE. Reviewed by: markj, alc, kib Approved by: scottl (implicit) MFC after: 1 week Sponsored by: Ampere Computing, Inc. Differential Revision: https://reviews.freebsd.org/D26130	2020-09-21 22:22:53 +00:00
D Scott Phillips	00e6614750	Sparsify the vm_page_dump bitmap On Ampere Altra systems, the sparse population of RAM within the physical address space causes the vm_page_dump bitmap to be much larger than necessary, increasing the size from ~8 Mib to > 2 Gib (and overflowing `int` for the size). Changing the page dump bitmap also changes the minidump file format, so changes are also necessary in libkvm. Reviewed by: jhb Approved by: scottl (implicit) MFC after: 1 week Sponsored by: Ampere Computing, Inc. Differential Revision: https://reviews.freebsd.org/D26131	2020-09-21 22:21:59 +00:00
D Scott Phillips	ab041f713a	Move vm_page_dump bitset array definition to MI code These definitions were repeated by all architectures, with small variations. Consolidate the common definitons in machine independent code and use bitset(9) macros for manipulation. Many opportunities for deduplication remain in the machine dependent minidump logic. The only intended functional change is increasing the bit index type to vm_pindex_t, allowing the indexing of pages with address of 8 TiB and greater. Reviewed by: kib, markj Approved by: scottl (implicit) MFC after: 1 week Sponsored by: Ampere Computing, Inc. Differential Revision: https://reviews.freebsd.org/D26129	2020-09-21 22:20:37 +00:00
Eric van Gyzen	f9cc8410e1	vm_ooffset_t is now unsigned vm_ooffset_t is now unsigned. Remove some tests for negative values, or make other adjustments accordingly. Reported by: Coverity Reviewed by: kib markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D26214	2020-09-18 16:48:08 +00:00
Mark Johnston	97458520cc	Increase the default vm.max_user_wired value. Since r347532 (merged to stable/12) we only count user-wired pages towards the system limit. However, we now also treat pages wired by hypervisors (bhyve and virtualbox) as user-wired, so starting VMs with large amounts of RAM tends to fail due to the low limit. The purpose of the limit is to provide a seatbelt, not to impose some policy on the use of wired memory. Thus, increase the default limit to allow reasonable VM configurations to work without tuning. Reviewed by: kib Discussed with: dougm MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26424	2020-09-17 16:49:28 +00:00
Konstantin Belousov	d301b3580f	Support for userspace non-transparent superpages (largepages). Created with shm_open2(SHM_LARGEPAGE) and then configured with FIOSSHMLPGCNF ioctl, largepages posix shared memory objects guarantee that all userspace mappings of it are served by superpage non-managed mappings. Only amd64 for now, both 2M and 1G superpages can be requested, the later requires CPU feature. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652	2020-09-09 22:12:51 +00:00
Konstantin Belousov	e2e80fb3de	vm_map: Add a map entry kind that can only be clipped at specific boundary. The entries and their clip boundaries must be aligned on supported superpages sizes from pagesizes[]. vm_map operations return Mach error KERN_INVALID_ARGUMENT, which is usually translated to EINVAL, if it would require clip not at the boundary. In other words, entries force preserving virtual addresses superpage properties. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652	2020-09-09 22:02:30 +00:00
Konstantin Belousov	6cadbcd203	Add pmap_enter(9) PMAP_ENTER_LARGEPAGE flag and implement it on amd64. The flag requests entry of non-managed superpage mapping of size pagesizes[psind] into the page table. Pmap supports fake wiring of the largepage mappings. Only attributes of the largepage mapping can be changed by calling pmap_enter(9) over existing mapping, physical address of the page must be unchanged. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652	2020-09-09 21:50:24 +00:00
Konstantin Belousov	7a9f2da33c	Add vm_map_find_aligned(9). Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652	2020-09-09 21:44:59 +00:00
Konstantin Belousov	60cd9c95c5	Move MAP_32BIT_MAX_ADDR definition to sys/mman.h. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652	2020-09-09 21:39:06 +00:00
Konstantin Belousov	e8f77c204b	Prepare to handle non-trivial errors from vm_map_delete(). Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652	2020-09-09 21:34:31 +00:00
Konstantin Belousov	a720b31c2a	Allow consumer to customize physical pager. Add support for user-supplied callbacks into phys pager operations, providing custom getpages(), haspage(), and populate() methods implementations. Pager stores user data ptr/val in the object to provide context. Add phys_pager_allocate() helper that takes user ops table as one of the arguments. Current code for these methods is moved to the 'default' ops table, assigned automatically when vm_pager_alloc() is used. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652	2020-09-09 00:00:43 +00:00
Konstantin Belousov	67a659d282	Add kern_mmap_racct_check(), a helper to verify limits in vm_mmap*(). Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652	2020-09-08 23:48:19 +00:00
Konstantin Belousov	89d2fb14d5	Add interruptible variant of vm_wait(9), vm_wait_intr(9). Also add msleep flags argument to vm_wait_doms(9). Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652	2020-09-08 23:28:09 +00:00
Mark Johnston	aec9e7d8b0	vm_object_split(): Handle orig_object type changes. orig_object->type can change from OBJT_DEFAULT to OBJT_SWAP while vm_object_split() is sleeping. In this case some pages in new_object may be left unbusied, but vm_object_split() attempts to unbusy all of them. Track the beginning of the busied range. Add an assertion to verify that pages are not re-added to the source object while sleeping. Reported by: Olympios Petrakis <olympios.petrakis@netapp.com> Reviewed by: alc, kib Tested by: pho MFC after: 1 week Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D26223	2020-09-07 23:28:33 +00:00
Mark Johnston	a2d704d19f	Avoid unnecessary object locking in vm_page_grab_pages_unlocked(). We were needlessly acquiring the object lock to call vm_page_grab_pages() even when all of the requested pages were looked up locklessly. Fix that, stop testing for count == 0 in vm_page_grab_pages(), and add assertions to help catch this kind of mistake. Reported by: cem Reviewed by: alc, cem, dougm, jeff Differential Revision: https://reviews.freebsd.org/D26304	2020-09-02 19:59:25 +00:00
Mark Johnston	847ab36bf2	Include the psind in data returned by mincore(2). Currently we use a single bit to indicate whether the virtual page is part of a superpage. To support a forthcoming implementation of non-transparent 1GB superpages, it is useful to provide more detailed information about large page sizes. The change converts MINCORE_SUPER into a mask for MINCORE_PSIND(psind) values, indicating a mapping of size psind, where psind is an index into the pagesizes array returned by getpagesizes(3), which in turn comes from the hw.pagesizes sysctl. MINCORE_PSIND(1) is equal to the old value of MINCORE_SUPER. For now, two bits are used to record the page size, permitting values of MAXPAGESIZES up to 4. Reviewed by: alc, kib Sponsored by: Juniper Networks, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D26238	2020-09-02 18:16:43 +00:00
Mateusz Guzik	c3aa3bf97c	vm: clean up empty lines in .c and .h files	2020-09-01 21:20:45 +00:00
Vladimir Kondratyev	5d4bf0578f	LinuxKPI: Implement ksize() function. In Linux, ksize() gets the actual amount of memory allocated for a given object. This commit adds malloc_usable_size() to FreeBSD KPI which does the same. It also maps LinuxKPI ksize() to newly created function. ksize() function is used by drm-kmod. Reviewed by: hselasky, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D26215	2020-08-29 19:26:31 +00:00
Eric van Gyzen	609de97e04	vm_pageout_scan_active: ensure ps_delta is initialized Reported by: Coverity Reviewed by: markj MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D26212	2020-08-28 19:59:02 +00:00
Eric van Gyzen	a2e194654f	memstat_kvm_uma: fix reading of uma_zone_domain structures Coverity flagged the scaling by sizeof(uzd). That is the type of the pointer, so the scaling was already done by pointer arithmetic. However, this was also passing a stack frame pointer to kvm_read, so it was doubly wrong. Move ZDOM_GET into the !_KERNEL section and use it in libmemstat. Reported by: Coverity Reviewed by: markj MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D26213	2020-08-28 19:50:40 +00:00
Mark Johnston	aea9103e06	Use a large kmem arena import size on NUMA systems. This helps minimize internal fragmentation that occurs when 2MB imports are interleaved across NUMA domains. Virtually all KVA allocations on direct map platforms consume more than one page, so the fragmentation manifests as runs of 511 4KB page mappings in the kernel. Reviewed by: alc, kib Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26050	2020-08-26 14:31:48 +00:00
Conrad Meyer	74f5530d7a	vm_pageout: Scale worker threads with CPUs Autoscale vm_pageout worker threads from r364129 with CPU count. The default is arbitrarily chosen to be 16 CPUs per worker thread, but can be adjusted with the vm.pageout_cpus_per_thread tunable. There will never be less than 1 thread per populated NUMA domain, and the previous arbitrary upper limit (at most ncpus/2 threads per NUMA domain) is preserved. Care is taken to gracefully handle asymmetric NUMA nodes, such as empty node systems (e.g., AMD 2990WX) and systems with nodes of varying size (e.g., some larger >20 core Intel Haswell/Broadwell Xeon). Reviewed by: kib, markj Sponsored by: Isilon Differential Revision: https://reviews.freebsd.org/D26152	2020-08-25 21:36:56 +00:00
Mark Johnston	411096d034	Permit vm_page_wire() to be called on pages not belonging to an object. For such pages ref_count is effectively a consumer-managed field, but there is no harm in calling vm_page_wire() on them. vm_page_unwire_noq() handles them as well. Relax the vm_page_wire() assertions to permit this case which is triggered by some out-of-tree code. [1] Also guard a conditional assertion with INVARIANTS. Otherwise the conditions are evaluated even though the result is unused. [2] Reported by: bz, cem [1], kib [2] Reviewed by: dougm, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26173	2020-08-25 13:45:06 +00:00
Matt Macy	9e5787d228	Merge OpenZFS support in to HEAD. The primary benefit is maintaining a completely shared code base with the community allowing FreeBSD to receive new features sooner and with less effort. I would advise against doing 'zpool upgrade' or creating indispensable pools using new features until this change has had a month+ to soak. Work on merging FreeBSD support in to what was at the time "ZFS on Linux" began in August 2018. I first publicly proposed transitioning FreeBSD to (new) OpenZFS on December 18th, 2018. FreeBSD support in OpenZFS was finally completed in December 2019. A CFT for downstreaming OpenZFS support in to FreeBSD was first issued on July 8th. All issues that were reported have been addressed or, for a couple of less critical matters there are pull requests in progress with OpenZFS. iXsystems has tested and dogfooded extensively internally. The TrueNAS 12 release is based on OpenZFS with some additional features that have not yet made it upstream. Improvements include: project quotas, encrypted datasets, allocation classes, vectorized raidz, vectorized checksums, various command line improvements, zstd compression. Thanks to those who have helped along the way: Ryan Moeller, Allan Jude, Zack Welch, and many others. Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D25872	2020-08-25 02:21:27 +00:00
Mateusz Guzik	feabaaf995	cache: drop the always curthread argument from reverse lookup routines Note VOP_VPTOCNP keeps getting it as temporary compatibility for zfs. Tested by: pho	2020-08-24 08:57:02 +00:00
Andrew Gallatin	791dda877f	uma: record allocation failures due to zone limits The zone limit mechanism was recently reworked, and allocation failures due to limits being exceeded were inadvertently no longer being recorded. This would lead to, for example, mbuf allocation failures not being indicated in netstat -m or vmstat -z Reviewed by: markj Sponsored by: Netflix	2020-08-21 18:31:57 +00:00
Mateusz Guzik	7ad2a82da2	vfs: drop the error parameter from vn_isdisk, introduce vn_isdisk_error Most consumers pass NULL.	2020-08-19 02:51:17 +00:00
Mark Johnston	b21b022a81	Revert r364310. Some of the resulting fallout in CAM does not appear straightforward to fix, so simply revert the commit for now in the absence of a better solution. Discussed with: mjg Reported by: dhw	2020-08-18 14:09:49 +00:00
Gleb Smirnoff	1921bb7b68	With INVARIANTS panic immediately if M_WAITOK is requested in a non-sleepable context. Previously only _sleep() would panic. This will catch misuse of M_WAITOK at development stage rather than at stress load stage. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D26027	2020-08-17 15:37:08 +00:00
Mark Johnston	7efe14cb99	Commit a missing piece of r364302. This had failed to apply due to a merge conflict. Reported by: Jenkins MFC with: r364302	2020-08-17 14:06:51 +00:00
Mark Johnston	7dd979dfef	Remove the VM map zone. Today, the zone is only used to allocate a trio of kernel maps: the kernel map itself, and the exec and pipe submaps. Maps for user processes are dynamically allocated but are embedded in the vmspace structure, which is allocated from its own zone. Make the aforementioned kernel maps statically allocated and get rid of the zone. While here, remove a stale comment above vmspace_alloc() and change the names of locks initialized in vm_map_init() to match vmspace_zinit(). Reported by: alc Reviewed by: alc, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26052	2020-08-17 13:02:01 +00:00
Konstantin Belousov	ffae7ea935	vm_object: allow paging_in_progress to be acquired after object termination. The vm objects are type-stable, and can be accessed even after the last reference is dropped, or in case of vnode objects, after vgone() destroyed it as well. Stop asserting that pip == 0 after vm_object_terminate() waited for existing owners to drop it, we only want to drain them before setting OBJ_DEAD flag. Also stop asserting pip == 0 in object destructor. Update comments explaining the interaction between paging_in_progress and termination. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D25968	2020-08-16 20:57:02 +00:00
Konstantin Belousov	419e5698a0	Atomically update vm_object vnp_size, where atomic is available. This will be used later, where it matters on 32bit arches. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D25968	2020-08-16 20:52:24 +00:00
Mateusz Guzik	a92a971bbb	vfs: remove the thread argument from vget It was already asserted to be curthread. Semantic patch: @@ expression arg1, arg2, arg3; @@ - vget(arg1, arg2, arg3) + vget(arg1, arg2)	2020-08-16 17:18:54 +00:00
Conrad Meyer	ea7b737a6f	vm_pageout: Correct threshold calculation on single-CPU systems Reported by: Michael Butler X-MFC-With: r364129	2020-08-14 18:48:48 +00:00
Conrad Meyer	b7883452d4	Back out unrelated change Reported by: kib, markj X-MFC-With: r364129	2020-08-12 00:21:30 +00:00
Conrad Meyer	0292c54bdb	Add support for multithreading the inactive queue pageout within a domain. In very high throughput workloads, the inactive scan can become overwhelmed as you have many cores producing pages and a single core freeing. Since Mark's introduction of batched pagequeue operations, we can now run multiple inactive threads working on independent batches. To avoid confusing the pid and other control algorithms, I (Jeff) do this in a mpi-like fan out and collect model that is driven from the primary page daemon. It decides whether the shortfall can be overcome with a single thread and if not dispatches multiple threads and waits for their results. The heuristic is based on timing the pageout activity and averaging a pages-per-second variable which is exponentially decayed. This is visible in sysctl and may be interesting for other purposes. I (Jeff) have verified that this does indeed double our paging throughput when used with two threads. With four we tend to run into other contention problems. For now I would like to commit this infrastructure with only a single thread enabled. The number of worker threads per domain can be controlled with the 'vm.pageout_threads_per_domain' tunable. Submitted by: jeff (earlier version) Discussed with: markj Tested by: pho Sponsored by: probably Netflix (based on contemporary commits) Differential Revision: https://reviews.freebsd.org/D21629	2020-08-11 20:37:45 +00:00
Mark Johnston	af32cefd7c	Check the UMA zone's full bucket cache before short-circuiting an alloc. The global "bucketdisable" flag indicates that we are in a low memory situation and should avoid allocating buckets. However, in the allocation path we were checking it before the full bucket cache and bailing even if the cache is non-empty. Defer the check so that we have a shot at allocating from the cache. This came up because M_NOWAIT allocations from the buf trie node zone must always succeed. In one scenario, all of the preallocated trie nodes were in the bucket list, and a new slab allocation could not succeed due to a memory shortage. The short-circuiting caused an allocation failure which triggered a panic. Reported by: pho Reviewed by: cem Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25980	2020-08-10 20:34:45 +00:00
Brooks Davis	9f9cc3f989	Preserve ASLR vm_map flags across fork In the most common case (fork+execve) this doesn't matter, but further attempts to apply entropy would fail in (e.g.) a pre-fork server. Reported by: Alfredo Mazzinghi Reviewed by: kib, markj Obtained from: CheriBSD MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D25966	2020-08-06 16:20:20 +00:00
Mark Johnston	efec381dd1	Remove most lingering references to the page lock in comments. Finish updating comments to reflect new locking protocols introduced over the past year. In particular, vm_page_lock is now effectively unused. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25868	2020-08-04 14:59:43 +00:00
Mark Johnston	96ad26eefb	Remove free_domain() and uma_zfree_domain(). These functions were introduced before UMA started ensuring that freed memory gets placed in domain-local caches. They no longer serve any purpose since UMA now provides their functionality by default. Remove them to simplyify the kernel memory allocator interfaces a bit. Reviewed by: cem, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25937	2020-08-04 13:58:36 +00:00
Mark Johnston	958d8f527c	Remove the volatile qualifier from busy_lock. Use atomic(9) to load the lock state. Some places were doing this already, so it was inconsistent. In initialization code, the lock state is still initialized with plain stores. Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25861	2020-07-29 19:38:49 +00:00
Mark Johnston	f72e5be58a	vm_page_xbusy_claim(): Use atomics to update busy lock state. vm_page_xbusy_claim() could clobber the waiter bit. For its original use, kernel memory pages, this was not a problem since nothing would ever block on the busy lock for such pages. r363607 introduced a new use where this could in principle be a problem. Fix the problem by using atomic_cmpset to update the lock owner. Since this macro is defined only for INVARIANTS kernels the extra overhead doesn't seem prohibitive. Reported by: vangyzen Reviewed by: alc, kib, vangyzen Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25859	2020-07-28 19:50:39 +00:00
Mark Johnston	782ebde52e	vm_page_free_invalid(): Relax the xbusy assertion. vm_page_assert_xbusied() asserts that the busying thread is the current thread. For some uses of vm_page_free_invalid() (e.g., error handling in vnode_pager_generic_getpages_done()), this condition might not hold. Reported by: Jenkins via trasz Reviewed by: chs, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25828	2020-07-27 14:25:10 +00:00
Doug Moore	00fd73d2da	Fix an overflow bug in the blist allocator that needlessly capped max swap size by dividing a value, which was always a multiple of 64, by 64. Remove the code that reduced max swap size down to that cap. Eliminate the distinction between BLIST_BMAP_RADIX and BLIST_META_RADIX. Call them both BLIST_RADIX. Make improvments to the blist self-test code to silence compiler warnings and to test larger blists. Reported by: jmallett Reviewed by: alc Discussed with: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D25736	2020-07-25 18:29:10 +00:00
Mateusz Guzik	ee74412269	vm: fix swap reservation leak and clean up surrounding code The code did not subtract from the global counter if per-uid reservation failed. Cleanup highlights: - load overcommit once - move per-uid manipulation to dedicated routines - don't fetch wire count if requested size is below the limit - convert return type from int to bool - ifdef the routines with _KERNEL to keep vm.h compilable by userspace Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D25787	2020-07-24 13:23:32 +00:00
Mateusz Guzik	126a2470b9	vm: annotate swap_reserved with __exclusive_cache_line The counter keeps being updated all the time and variables read afterwards share the cacheline. Note this still fundamentally does not scale and needs to be replaced, in the meantime gets a bandaid. brk1_processes -t 52 ops/s: before: 8598298 after: 9098080	2020-07-23 08:42:16 +00:00
Chuck Silvers	1bd12a3bb2	Fix vnode_pager handling of read ahead/behind pages when a disk read fails. Rather than marking the read ahead/behind pages valid even though they were not initialized, free them using the new function vm_page_free_invalid(). Reviewed by: markj, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D25430	2020-07-17 23:10:35 +00:00
Chuck Silvers	4dfa06e114	Add a new function vm_page_free_invalid() for freeing invalid pages that might be wired. If the page is wired then it cannot be freed now, but the thread that eventually unwires it will free it at that point. Reviewed by: markj, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D25430	2020-07-17 23:09:36 +00:00
Chuck Silvers	c3dbadc1fd	Revert my change from r361855 in favor of a better fix. Reviewed by: markj, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D25430	2020-07-17 23:08:01 +00:00
Mark Johnston	a7752896f0	Add vm_map_valid_range_KBI(). This is required for standalone module builds. Reported by: hselasky Reviewed by: dougm, hselasky, kib MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D25650	2020-07-13 16:39:27 +00:00
Scott Long	ffc568ba8b	Revert r362998, r326999 while a better compatibility strategy is devised.	2020-07-09 22:38:36 +00:00
Scott Long	b302c2e5c9	Migrate the feature of excluding RAM pages to use "excludelist" as its nomenclature. MFC after: 1 week	2020-07-07 20:33:11 +00:00
Conrad Meyer	8a64110e43	vm: Add missing WITNESS warnings for M_WAITOK allocation vm_map_clip_{end,start} and lookup_clip_start allocate memory M_WAITOK for !system_map vm_maps. Add WITNESS warning annotation for !system_map callers who may be holding non-sleepable locks. Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D25283	2020-06-29 16:54:00 +00:00
Mark Johnston	8c277118d8	Fix UMA's first-touch policy on systems with empty domains. Suppose a thread is running on a CPU in a NUMA domain with no physical RAM. When an item is freed to a first-touch zone, it ends up in the cross-domain bucket. When the bucket is full, it gets placed in another domain's bucket queue. However, when allocating an item, UMA will always go to the keg upon a per-CPU cache miss because the empty domain's bucket queue will always be empty. This means that a non-empty domain's bucket queues can grow very rapidly on such systems. For example, it can easily cause mbuf allocation failures when the zone limit is reached. Change cache_alloc() to follow a round-robin policy when running on an empty domain. Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25355	2020-06-28 21:35:04 +00:00
Konstantin Belousov	ee06cffcd2	vm_page_free_prep(): correct description of the required page and object state. Reviewed by: markj Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D25482	2020-06-27 02:31:39 +00:00
Mark Johnston	84242cf68a	Call swap_pager_freespace() from vm_object_page_remove(). All vm_object_page_remove() callers, except linux_invalidate_mapping_pages() in the LinuxKPI, free swap space when removing a range of pages from an object. The LinuxKPI case appears to be an unintentional omission that could result in leaked swap blocks, so unconditionally free swap space in vm_object_page_remove() to protect against similar bugs in the future. Reviewed by: alc, kib Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25329	2020-06-25 15:21:21 +00:00
Jeff Roberson	c8b0a88b8d	Clarify some language. Favor primary where both master and primary were used in conjunction with secondary.	2020-06-20 20:21:04 +00:00
Edward Tomasz Napierala	52c81be11a	Add linux_madvise(2) instead of having Linux apps call the native FreeBSD madvise(2) directly. While some of the flag values match, most don't. PR: kern/230160 Reported by: markj Reviewed by: markj Discussed with: brooks, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25272	2020-06-20 18:29:22 +00:00
Mark Johnston	cdd02f43b9	Revert r362360. This commit was simply wrong since two different objects are locked. Reported by: lwhsu, pho Pointy hat: markj	2020-06-19 11:04:49 +00:00
Mark Johnston	f034074034	Restore a check unintentionally dropped in r362361. MFC with: r362361	2020-06-19 04:18:20 +00:00
Mark Johnston	0f1e6ec591	Add a helper function for validating VA ranges. Functions which take untrusted user ranges must validate against the bounds of the map, and also check for wraparound. Instead of having the same logic duplicated in a number of places, add a function to check. Reviewed by: dougm, kib Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D25328	2020-06-19 03:32:04 +00:00
Mark Johnston	61b006887e	Fix a double object unlock in vm_object_backing_collapse_wait(). Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25327	2020-06-19 03:31:46 +00:00
Conrad Meyer	a116b5d3e4	vm: Drop vm_map_clip_{start,end} macro wrappers No functional change. Reviewed by: dougm, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D25282	2020-06-16 22:53:56 +00:00
Eric van Gyzen	8cc8c5864a	Honor db_pager_quit in some vm_object ddb commands These can be rather verbose. MFC after: 2 weeks Sponsored by: Dell EMC Isilon	2020-06-12 21:53:08 +00:00
Mateusz Guzik	7ce3a31286	vm: rework swap_pager_status to execute in constant time The lock-protected iteration is trivially avoidable. This removes a serialisation point from Linux binaries (which end up calling here from the sysinfo syscall).	2020-06-09 14:16:18 +00:00
Chuck Silvers	bd7d64f548	Don't mark pages as valid if reading the contents from disk fails. Instead, just skip marking pages valid if the read fails. Future attempts to access such pages will notice that they are not marked valid and try to read them from disk again. Reviewed by: kib, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D25138	2020-06-06 00:47:59 +00:00
Ed Maste	4d13f78444	Correct terminology in vm.imply_prot_max sysctl description As with r361769 (man page), PROT_* are properly called protections, not permissions. MFC after: 1 week MFC with: r361769 Sponsored by: The FreeBSD Foundation	2020-06-04 01:49:29 +00:00
Mateusz Guzik	1c58c09f5a	uma: hide item_domain under ifdef NUMA Fixes build warnings on mips.	2020-05-29 08:30:35 +00:00
Mark Johnston	81302f1d77	Fix boot on systems where NUMA domain 0 is unpopulated. - Add vm_phys_early_add_seg(), complementing vm_phys_early_alloc(), to ensure that segments registered during hammer_time() are placed in the right domain. Otherwise, since the SRAT is not parsed at that point, we just add them to domain 0, which may be incorrect and results in a domain with only several MB worth of memory. - Fix uma_startup1() to try allocating memory for zones from any domain. If domain 0 is unpopulated, the allocation will simply fail, resulting in a page fault slightly later during boot. - Change _vm_phys_domain() to return -1 for addresses not covered by the affinity table, and change vm_phys_early_alloc() to handle wildcard domains. This is necessary on amd64, where the page array is dense and pmap_page_array_startup() may allocate page table pages for non-existent page frames. Reported and tested by: Rafael Kitover <rkitover@gmail.com> Reviewed by: cem (earlier version), kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25001	2020-05-28 19:41:00 +00:00
Konstantin Belousov	fe0dcc402f	Simplify the condition to enable superpage mappings in vm_fault_soft_fast(). The list of arches list there matches the list of arches where default VM_NRESERVLEVEL > 0. Before sparc64 removal, that was the only arch that defined VM_NRESERVLEVEL > 0 to help with cache coloring, but did not implemented superpages. Now it can be simplified. Submitted by: alc Reviewed by: markj	2020-05-27 21:44:26 +00:00
Justin Hibbits	d4ed51f329	Properly sort ifdef archs in vm_fault_soft_fast superpage guards. Sort broken in r360887.	2020-05-27 01:35:46 +00:00
Mark Johnston	dc2b320563	Allocate UMA per-CPU counters earlier. Otherwise anything counted before SI_SUB_VM_CONF is discarded. However, it is useful to be able to see stats from allocations done early during boot. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D24756	2020-05-14 16:06:54 +00:00
Kyle Evans	c79cee7136	kernel: provide panicky version of __unreachable __builtin_unreachable doesn't raise any compile-time warnings/errors on its own, so problems with its usage can't be easily detected. While it would be nice for this situation to change and compilers to at least add a warning for trivial cases where local state means the instruction can't be reached, this isn't the case at the moment and likely will not happen. This commit adds an __assert_unreachable, whose intent is incredibly clear: it asserts that this instruction is unreachable. On INVARIANTS builds, it's a panic(), and on non-INVARIANTS it expands to __unreachable(). Existing users of __unreachable() are converted to __assert_unreachable, to improve debuggability if this assumption is violated. Reviewed by: mjg Differential Revision: https://reviews.freebsd.org/D23793	2020-05-13 18:07:37 +00:00
Justin Hibbits	65bbba25d2	powerpc64: Implement Radix MMU for POWER9 CPUs Summary: POWER9 supports two MMU formats: traditional hashed page tables, and Radix page tables, similar to what's presesnt on most other architectures. The PowerISA also specifies a process table -- a table of page table pointers-- which on the POWER9 is only available with the Radix MMU, so we can take advantage of it with the Radix MMU driver. Written by Matt Macy. Differential Revision: https://reviews.freebsd.org/D19516	2020-05-11 02:33:37 +00:00
Mark Johnston	a9ea09e548	Re-check for wirings after busying the page in vm_page_release_locked(). A concurrent unlocked lookup can wire the page after vm_page_release_locked() releases the last wiring, in which case vm_page_release_locked() must not free the page. Once the xbusy lock is acquired, that, the object lock and the fact that the page is unmapped ensure that the wire count cannot increase, so re-check for new wirings after the page is xbusied. Update the comment above vm_page_wired() to reflect the new synchronization rules. Reported by: glebius Reviewed by: alc, jeff, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D24592	2020-04-28 13:51:41 +00:00
Mark Johnston	f13fa9df05	Use a single VM object for kernel stacks. Previously we allocated a separate VM object for each kernel stack. However, fully constructed kernel stacks are cached by UMA, so there is no harm in using a single global object for all stacks. This reduces memory consumption and makes it easier to define a memory allocation policy for kernel stack pages, with the aim of reducing physical memory fragmentation. Add a global kstack_object, and use the stack KVA address to index into the object like we do with kernel_object. Reviewed by: kib Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D24473	2020-04-26 20:08:57 +00:00
Mark Johnston	33655d9546	Factor out the kmem contig page alloc and reclamation code. kmem_alloc_attr_domain() and kmem_alloc_contig_domain() duplicated each other's page allocation and reclamation logic. Place it in a single function to make it easier to add additional consumers. No functional change intended. Reviewed by: jeff, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D24475	2020-04-21 16:01:44 +00:00
Mark Johnston	303b77029b	Minimize conditional compilation for handling of M_EXEC. This simplifies some planned changes. No functional change intended. Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D24474	2020-04-21 15:55:28 +00:00
Mark Johnston	70e68b19a4	Handle trashed queue pointers in vm_page_acquire_unlocked(). vm_page_acquire_unlocked() relies on type-stability of vm_page structures and assumes that the listq linkage pointers always point to a vm_page or are NULL. QUEUE_MACRO_DEBUG_TRASH breaks that assumption, so add an explicit check for a trashed queue pointer before dereferencing. Reported and tested by: pho Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D24472	2020-04-20 14:45:17 +00:00
Bryan Drewery	adc0388117	Remove dead code leftover from r331018. Sponsored by: Dell EMC	2020-03-31 01:12:53 +00:00
Konstantin Belousov	abfdf76791	VOP_GETPAGES_ASYNC(): consistently call iodone() callback in case of error. Reviewed by: glebius, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24038	2020-03-30 21:44:30 +00:00
Konstantin Belousov	a7c55b3e1b	ddb show pginfo: print pages reference value in hex. It is more useful this way after the VPRC_ flags were introduced. Sponsored by: The FreeBSD Foundation	2020-03-28 12:21:52 +00:00
Jeff Roberson	d1105e9441	Check for busy or wired in vm_page_relookup(). Some callers will only keep a page wired and expect it to still be present. Reported by: delphij@FreeBSD.org Reviewed by: kib	2020-03-11 22:25:45 +00:00
Mark Johnston	54007ce8ae	Clean up uma_int.h a bit. This makes it easier to write libkvm programs that access UMA data structures. - Remove a couple of unused slab functions and make others local to uma_core.c. Similarly move SLAB_BITSETS, which affects the layout of slab structures, to uma_core.c. - Stop defining the slab structures under _KERNEL. There's no real reason they can't be visible to userspace like the rest of UMA's structures are. - Group KEG_ASSERT_COLD with other keg macros. - Convert an assertion about MAXMEMDOM to use _Static_assert. No functional change intended. Discussed with: jeff Reviewed by: rlibby Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23980	2020-03-07 15:37:23 +00:00
Mark Johnston	3fba886874	Move SMR pointer type definition and access macros to smr_types.h. The intent is to provide a header that can be included by other headers without introducing too much pollution. smr.h depends on various headers and will likely grow over time, but is less likely to be required by system headers. Rename SMR_TYPE_DECLARE() to SMR_POINTER(): - One might use SMR to protect more than just pointers; it could be used for resizeable arrays, for example, so TYPE seems too generic. - It is useful to be able to define anonymous SMR-protected pointer types and the _DECLARE suffix makes that look wrong. Reviewed by: jeff, mjg, rlibby Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23988	2020-03-07 00:55:46 +00:00
Brooks Davis	3823a5990a	Remove an apparently incorrect assertion. Without this change mips64 fails to boot. Discussed with: markj Sponsored by: DARPA	2020-03-06 23:31:09 +00:00
Mark Johnston	d869a17e62	Use COUNTER_U64_DEFINE_EARLY() in places where it simplifies things. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23978	2020-03-06 19:10:00 +00:00
Brooks Davis	d718de812f	Introduce kern_mmap_req(). This presents an extensible interface to the generic mmap(2) implementation via a struct pointer intended to use a designated initializer or compount literal. We take advantage of the mandatory zeroing of fields not listed in the initializer. Remove kern_mmap_fpcheck() and use kern_mmap_req(). The motivation for this change is a desire to keep the core implementation from growing an ever-increasing number of arguments that must be specified in the correct order for the lowest-level implementations. In CheriBSD we have already added two more arguments. Reviewed by: kib Discussed with: kevans Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D23164	2020-03-04 21:27:12 +00:00
Mark Johnston	1ed42f6fdd	Avoid doubly wiring a newly allocated page in vm_page_grab_valid(). This fixes a regression from r358363. Reported by: manu, jbeich Tested by: jbeich	2020-03-01 22:09:11 +00:00
Mateusz Guzik	7f746c9fcc	vm: add debug to uma_zone_set_smr Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D23902	2020-03-01 21:49:16 +00:00
Jeff Roberson	6be21eb778	Provide a lock free alternative to resolve bogus pages. This is not likely to be much of a perf win, just a nice code simplification. Reviewed by: markj, kib Differential Revision: https://reviews.freebsd.org/D23866	2020-02-28 21:42:48 +00:00
Jeff Roberson	7aaf252c96	Convert a few triviail consumers to the new unlocked grab API. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23847	2020-02-28 20:34:30 +00:00
Jeff Roberson	3f39f80ab3	Support the NOCREAT flag for grab_valid_unlocked. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D23865	2020-02-28 20:32:35 +00:00
Jeff Roberson	1a0c234eb2	Simplify vref() code in object_reference. The local temporary is no longer necessary. Fix formatting errors. Reported by: mjg Discussed with: kib	2020-02-28 20:30:53 +00:00
Mark Johnston	c99d0c5801	Add a blocking counter KPI. refcount(9) was recently extended to support waiting on a refcount to drop to zero, as this was needed for a lockless VM object paging-in-progress counter. However, this adds overhead to all uses of refcount(9) and doesn't really match traditional refcounting semantics: once a counter has dropped to zero, the protected object may be freed at any point and it is not safe to dereference the counter. This change removes that extension and instead adds a new set of KPIs, blockcount_*, for use by VM object PIP and busy. Reviewed by: jeff, kib, mjg Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23723	2020-02-28 16:05:18 +00:00
Jeff Roberson	fe835cbf5f	A pair of performance improvements. Swap buckets on free as well as alloc so that alloc is always the most cache-hot data. When selecting a zone domain for the round-robin bucket cache use the local domain unless there is a severe imbalance. This does not affinitize memory, only locks and queues. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D23824	2020-02-27 08:23:10 +00:00
Jeff Roberson	c49be4f1c6	Add unlocked grab* function variants that use lockless radix code to lookup pages. These variants will fall back to their locked counterparts if the page is not present. Discussed with: kib, markj Differential Revision: https://reviews.freebsd.org/D23449	2020-02-27 02:37:27 +00:00
Ed Maste	acb8858f05	Return ENOTSUP for mmap/mprotect if prot not subset of prot_max From POSIX, [ENOTSUP] The implementation does not support the combination of accesses requested in the prot argument. This fits the case that prot contains permissions which are not a subset of prot_max. Reviewed by: brooks, cem Relnotes: Yes Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23843	2020-02-26 20:03:43 +00:00
Pawel Biernacki	7029da5c36	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718	2020-02-26 14:26:36 +00:00
Doug Moore	36b01270d1	The last argument to swp_pager_getswapspace is always 1. Remove that argument. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D23810	2020-02-24 04:01:09 +00:00
Mark Johnston	7ca5539285	Allow swap_pager_putpages() to allocate one block at a time. The minimum allocation size of 4 blocks is an old policy that came with the "new" swap pager in r42957. Since then the blist allocator has gotten better at reducing fragmentation; for example, with r349777 it can return a range that spans multiple leaves. When swap space is close to being exhaused, the minimum of 4 blocks most likely exacerbates memory pressure, so reduce it to 1. Reported by: alc Tested by: pho Reviewed by: alc, dougm, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23763	2020-02-23 17:59:51 +00:00
Ryan Libby	eaa17d4291	sys/vm: quiet -Wwrite-strings Discussed with: kib Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D23796	2020-02-23 03:32:04 +00:00
Mark Johnston	0464f16e91	Constify uma_zcache_create() and uma_zsecond_create()'s "name" argument. It is already internally handled as a pointer to a const string, in particular by uma_zcreate(). Fix indentation while here. MFC after: 1 week	2020-02-22 17:44:28 +00:00
Kyle Evans	cef81f8f01	vm_radix: prefer __builtin_unreachable() to an unreachable panic() This provides the needed hint to GCC and offers an annotation for readers to observe that it's in-fact impossible to hit this point. We'll get hit with a a -Wswitch error if the enum applicable to the switch above were to get expanded without the new value(s) being handled.	2020-02-22 16:20:04 +00:00
Jeff Roberson	226dd6db47	Add an atomic-free tick moderated lazy update variant of SMR. This enables very cheap read sections with free-to-use latencies and memory overhead similar to epoch. On a recent AMD platform a read section cost 1ns vs 5ns for the default SMR. On Xeon the numbers should be more like 1 ns vs 11. The memory consumption should be proportional to the product of the free rate and 2*1/hz while normal SMR consumption is proportional to the product of free rate and maximum read section time. While here refactor the code to make future additions more straightforward. Name the overall technique Global Unbound Sequences (GUS) and adjust some comments accordingly. This helps distinguish discussions of the general technique (SMR) vs this specific implementation (GUS). Discussed with: rlibby, markj	2020-02-22 03:44:10 +00:00
Warner Losh	cafbf0c664	Don't convert all lower-layer errors to EIO. Don't convert all lower layer errors to EIO. Instead, pass the actual error up the stack. This will allow the upper layers that look for ENXIO to react properly to that signal from the lower layers and, for UFS, unmount the filesystem. Reviewed by: kib@ Differential Revision: https://reviews.freebsd.org/D23755	2020-02-20 01:33:01 +00:00
Warner Losh	65252dc903	Don't spam the console with an additional, and useless, error message. There's no need to spam the console with this error message. If there's an I/O error, the disk/cam driver will report it at the lower levels. If that's an actual problem, the upper layers will report that. Reviewed by: kib@ Differential Revision: https://reviews.freebsd.org/D23756	2020-02-20 00:34:46 +00:00
Jeff Roberson	4b3dac72b3	Silence a gcc warning about no return from a function that handles every possible enum in a switch statement. I verified that this emits nothing as expected on clang. radix relies on constant propagation to eliminate any branching from these access routines. Reported by: lwhsu/tinderbox	2020-02-19 22:34:22 +00:00
Jeff Roberson	1ddda2eb24	Use SMR to provide a safe unlocked lookup for vm_radix. The tree is kept correct for readers with store barriers and careful ordering. The existing object lock serializes writers. Consumers will be introduced in later commits. Reviewed by: markj, kib Differential Revision: https://reviews.freebsd.org/D23446	2020-02-19 19:58:31 +00:00
Jeff Roberson	c6fd3e23f7	Use per-domain locks for the bucket cache. This gives much better concurrency when there are a large number of cores per-domain and multiple domains. Avoid taking the lock entirely if it will not be productive. ROUNDROBIN domains will have mixed memory in each domain and will load balance to all domains. While here refactor the zone/domain separation and bucket limits to simplify callers. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D23673	2020-02-19 18:48:46 +00:00
Jeff Roberson	e9ceb9dd11	Don't release xbusy on kmem pages. After lockless page lookup we will not be able to guarantee that they can be racquired without blocking. Reviewed by: kib Discussed with: markj Differential Revision: https://reviews.freebsd.org/D23506	2020-02-19 09:10:11 +00:00
Jeff Roberson	6c5f36ff30	Eliminate some unnecessary uses of UMA_ZONE_VM. Only zones involved in virtual address or physical page allocation need to be marked with this flag. Reviewed by: markj Tested by: pho Differential Revision: https://reviews.freebsd.org/D23712	2020-02-19 08:17:27 +00:00
Mark Johnston	34e2051faf	Remove swblk_t. It was used only to store the bounds of each swap device. However, since swblk_t is a signed 32-bit int and daddr_t is a signed 64-bit int, swp_pager_isondev() may return an invalid result if swap devices are repeatedly added and removed and sw_end for a device ends up becoming a negative number. Note that the removed comment about maximum swap size still applies. Reviewed by: jeff, kib Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23666	2020-02-17 15:11:07 +00:00
Mark Johnston	725b4ff001	Fix a swap block allocation race. putpages' allocation of swap blocks is done under the global sw_dev lock. Previously it would drop that lock before inserting the allocated blocks into the object's trie, creating a window in which swap blocks are allocated but are not visible to swapoff. This can cause swp_pager_strategy() to fail and panic the system. Fix the problem bluntly, by allocating swap blocks under the object lock. Reviewed by: jeff, kib Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23665	2020-02-17 15:10:41 +00:00
Mark Johnston	c90d075be4	Fix object locking races in swapoff(2). swap_pager_swapoff_object()'s goal is to allocate pages for all valid swap blocks belonging to the object, for which there is no resident page. If the page corresponding to a block is already resident and valid, the block can simply be discarded. The existing implementation tries to minimize the number of I/Os used. For each cluster of swap blocks, it finds maximal runs of valid swap blocks not resident in memory, and valid resident pages. During this processing, the object lock may be dropped in several places: when calling getpages, or when blocking on a busy page in vm_page_grab_pages(). While the lock is dropped, another thread may free swap blocks, causing getpages to page in stale data. Fix the problem following a suggestion from Jeff: use getpages' readahead capability to perform clustering rather than doing it ourselves. The simplies the code a bit without reintroducing the old behaviour of performing one I/O per page. Reviewed by: jeff Reported by: dhw, gallatin Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23664	2020-02-17 15:09:40 +00:00
Jeff Roberson	ed581bf68f	Add a simple accessor that returns the bytes of memory consumed by a zone.	2020-02-17 01:59:55 +00:00
Jeff Roberson	f212367b42	Refactor _vm_page_busy_sleep to reduce the delta between the various sleep routines and introduce a variant that supports lockless sleep. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23612	2020-02-17 01:08:00 +00:00
Jeff Roberson	70260874ac	UMA has become more particular about zone types. Use the right allocator calls in uma_zwait().	2020-02-17 01:06:18 +00:00
Jeff Roberson	6d88d784f8	Slightly restructure uma_zalloc* to generate better code from clang and reduce duplication among zalloc functions. Reviewed by: markj Discussed with: mjg Differential Revision: https://reviews.freebsd.org/D23672	2020-02-16 01:07:19 +00:00
Mateusz Guzik	3379d2f926	vm: use new capsicum helpers	2020-02-15 01:29:07 +00:00
Mateusz Guzik	23ed568caa	vm: remove no longer needed atomic_load_ptr casts	2020-02-14 23:16:29 +00:00
Mark Johnston	06ef60525f	Fix handling of WAITFAIL in vm_page_grab() and vm_page_grab_pages(). After sleeping through a memory shortage, we must return NULL rather than retry. Discussed with: jeff Reported by: pho Sponsored by: The FreeBSD Foundation	2020-02-13 23:18:35 +00:00
Mark Johnston	cefc92e1a2	Update the zone-global count of cached items in bucket_cache_reclaim(). This was missed in r351673. The count is used to enfore cache limits, which are rarely used. Discussed with: jeff Sponsored by: The FreeBSD Foundation	2020-02-13 23:15:21 +00:00
Jeff Roberson	543117bed8	Fix a case where ub_seq would fail to be set if the cross bucket was flushed due to memory pressure. Reviewed by: markj Differential Revision: http://reviews.freebsd.org/D23614	2020-02-13 20:58:51 +00:00
Mateusz Guzik	3acb6572fc	Store offset into zpcpu allocations in the per-cpu area. This shorten zpcpu_get and allows more optimizations. Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D23570	2020-02-12 11:11:22 +00:00
Mark Johnston	4ab3aee8fb	Reduce lock hold time in keg_drain(). Maintain a count of free slabs in the per-domain keg structure and use that to clear the free slab list in constant time for most cases. This helps minimize lock contention induced by reclamation, in preparation for proactive trimming of excesses of free memory. Reviewed by: jeff, rlibby Tested by: pho Differential Revision: https://reviews.freebsd.org/D23532	2020-02-11 20:06:33 +00:00
Jonathan T. Looney	3c200db9d2	Modify the vm.panic_on_oom sysctl to take a count of events. Currently, the vm.panic_on_oom sysctl is a boolean which controls the behavior of the VM system when it encounters an out-of-memory situation. If set to 0, the VM system kills the largest process. If set to any other value, the VM system will initiate a panic. This change makes the sysctl a count of events. If set to 0, the VM system kills the largest process. If set to any other value, the VM system will kill the largest process until it has seen the specified number of out-of-memory events. Once it reaches the specified number of events, it will initiate a panic. This change is helpful in capturing cores when the system is in a perpetual cycle of out-of-memory events (as opposed to just hitting one or two sporadic out-of-memory events). Reviewed by: kib MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D23601	2020-02-10 18:06:38 +00:00
Ryan Libby	bae55c4aec	uma: remove UMA_ZFLAG_CACHEONLY flag UMA_ZFLAG_CACHEONLY was essentially the same thing as UMA_ZONE_VM, but with a more confusing name. Remove the flag, make UMA_ZONE_VM an inherit flag, and replace all references. Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23516	2020-02-06 08:32:25 +00:00
Ryan Libby	33e5a1ea3b	uma: multipage chicken switch Add a switch to allow disabling multipage slabs, in order to facilitate measuring memory usage and performance effects. The tunable vm.debug.uma_multipage_slabs defaults to 1 and can be set to 0 to disable. The name may change soon. Reviewed by: markj (previous version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23487	2020-02-04 22:40:45 +00:00
Ryan Libby	27ca37acb7	uma: grow slabs to enforce minimum memory efficiency Memory efficiency can be poor with awkward item sizes (e.g. 1/2 or 1 page size + epsilon). In order to achieve a minimum memory efficiency, select a slab size with a potentially larger number of pages if it yields a lower portion of waste. This may mean using page_alloc instead of uma_small_alloc, which could be more costly. Discussed with: jeff, mckusick Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23239	2020-02-04 22:40:34 +00:00
Ryan Libby	ec0d828071	uma: add UMA_ZONE_CONTIG, and a default contig_alloc For now, copy the mbuf allocator. Reviewed by: jeff, markj (previous version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23237	2020-02-04 22:40:11 +00:00
Ryan Libby	5ba16cf3d7	uma: pcpu_page_free needs to startup_free pages from startup_alloc After r357392, it is apparent that we do have some early-boot PCPU zones. Make it so we can safely free pages from them if they are actually used during early boot. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23496	2020-02-04 22:39:58 +00:00
Jeff Roberson	ee9e43f8dd	Add an explicit busy state for free pages. This improves behavior with potential bugs that access freed pages as well as providing a path towards lockless page lookup. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23444	2020-02-04 20:33:01 +00:00
Jeff Roberson	e84130a0c0	Use literal bucket sizes for smaller buckets rather than the rounding system. Small bucket sizes already pack well even if they are an odd number of words. This prevents any potential new instances of the problem fixed in r357463 as well as making the system easier to understand. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D23494	2020-02-04 20:28:06 +00:00
Konstantin Belousov	8d34a3bf7d	Enable vm_object_mightbedirty() and vm_object_page_clean() for swap objects backing tmpfs vnodes data. The clean scan is limited to only remove write permissions from the mapped pages of the objects. This fixes the issue that tmpfs vnode mtime is not updated from writes to the mmaped area after the initial page-in. Noted by: mjg Reviewed by: markj Discussed with: jeff Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D23432	2020-02-04 19:03:37 +00:00
Jeff Roberson	dc3915c8c6	Use STAILQ instead of TAILQ for bucket lists. We only need FIFO behavior and this is more space efficient. Stop queueing recently used buckets to the head of the list. If the bucket goes to a different processor the cache coherency will be more expensive. We already try to encourage cache-hot behavior in the per-cpu layer. Reviewed by: rlibby Differential Revision: https://reviews.freebsd.org/D23493	2020-02-04 02:41:24 +00:00
Mark Johnston	36cb95c736	Disable the smallest UMA bucket size on 32-bit platforms. With r357314, sizeof(struct uma_bucket) grew to 16 bytes on 32-bit platforms, so BUCKET_SIZE(4) is 0. This resulted in the creation of a bucket zone for buckets with zero capacity. A more general fix is planned, but for now this bandaid allows 32-bit platforms to boot again. PR: 243837 Discussed with: jeff Reported by: pho, Jenkins via lwhsu Tested by: pho Sponsored by: The FreeBSD Foundation	2020-02-03 19:29:02 +00:00
Warner Losh	58aa35d429	Remove sparc64 kernel support Remove all sparc64 specific files Remove all sparc64 ifdefs Removee indireeect sparc64 ifdefs	2020-02-03 17:35:11 +00:00
Mateusz Guzik	f1fa1ba3d0	Fix up various vnode-related asserts which did not dump the used vnode	2020-02-03 14:25:32 +00:00
Jeff Roberson	f96d4157a7	Fix a bug in r356776 where the page allocator was not properly restored to the percpu page allocator after it had been temporarily overridden by startup_alloc. Reported by: pho, bdragon	2020-02-01 23:46:30 +00:00
Mark Johnston	f0a273c00f	Remove a couple of lingering usages of the page lock. Update vm_page_scan_contig() and vm_page_reclaim_run() to stop using vm_page_change_lock(). It has no use after r356157. Remove vm_page_change_lock() now that it has no users. Remove an unncessary check for wirings in vm_page_scan_contig(), which was previously checking twice. The check is racy until vm_page_reclaim_run() ensures that the page is unmapped, so one check is sufficient. Reviewed by: jeff, kib (previous versions) Tested by: pho (previous version) Differential Revision: https://reviews.freebsd.org/D23279	2020-02-01 18:23:51 +00:00
Mateusz Guzik	643656cfaf	vfs: replace VOP_MARKATIME with VOP_MMAPPED The routine is only provided by ufs and is only used on mmap and exec. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23422	2020-02-01 06:46:55 +00:00
Jeff Roberson	9e47b34110	Fix LINT build with MEMGUARD.	2020-01-31 02:03:22 +00:00
Jeff Roberson	d4665eaa66	Implement a safe memory reclamation feature that is tightly coupled with UMA. This is in the same family of algorithms as Epoch/QSBR/RCU/PARSEC but is a unique algorithm. This has 3x the performance of epoch in a write heavy workload with less than half of the read side cost. The memory overhead is significantly lessened by limiting the free-to-use latency. A synthetic test uses 1/20th of the memory vs Epoch. There is significant further discussion in the comments and code review. This code should be considered experimental. I will write a man page after it has settled. After further validation the VM will begin using this feature to permit lockless page lookups. Both markj and cperciva tested on arm64 at large core counts to verify fences on weaker ordering architectures. I will commit a stress testing tool in a follow-up. Reviewed by: mmacy, markj, rlibby, hselasky Discussed with: sbahara Differential Revision: https://reviews.freebsd.org/D22586	2020-01-31 00:49:51 +00:00
Konstantin Belousov	b70f6e1513	Restore OOM logic on page fault after r357026. Right now OOM is initiated unconditionally on the page allocation failure, after the wait. Reported by: Mark Millard <marklmi@yahoo.com> Reviewed by: cy, markj Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D23409	2020-01-29 12:02:47 +00:00
Konstantin Belousov	cd0047f3a9	Handle a race of collapse with a retrying fault. Both vm_object_scan_all_shadowed() and vm_object_collapse_scan() might observe an invalid page left in the default backing object by the fault handler that retried. Check for the condition and refuse to collapse. Reported and tested by: pho Reviewed by: jeff Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D23331	2020-01-24 19:42:53 +00:00
Doug Moore	c7b23459b2	Most uses of vm_map_clip_start follow a call to vm_map_lookup. Define an inline function vm_map_lookup_clip_start that invokes them both and use it in places that invoke both. Drop a couple of local variables made unnecessary by this function. Reviewed by: markj Tested by: pho Differential Revision: https://reviews.freebsd.org/D22987	2020-01-24 07:48:11 +00:00
Mark Johnston	e6bd3a812d	vm_map_submap(): Avoid unnecessary clipping. A submap can only be created from an entry spanning the entire request range. In particular, if vm_map_lookup_entry() returns false or the returned entry contains "end". Since the only use of submaps in FreeBSD is for the static pipe and execve argument KVA maps, this has no functional effect. Github PR: https://github.com/freebsd/freebsd/pull/420 Submitted by: Wuyang Chung <wuyang.chung1@gmail.com> (original) Reviewed by: dougm, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23299	2020-01-23 16:45:10 +00:00
Jeff Roberson	fb4d37eac1	(fault 9/9) Move zero fill into a dedicated function to make the object lock state more clear. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23326	2020-01-23 05:23:37 +00:00
Jeff Roberson	be9d4fd6b4	(fault 8/9) Restructure some code to reduce duplication and simplify flow control. Reviewed by: dougm, kib, markj Differential Revision: https://reviews.freebsd.org/D23321	2020-01-23 05:22:02 +00:00
Jeff Roberson	df794f5caf	(fault 7/9) Move fault population and allocation into a dedicated function Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23320	2020-01-23 05:19:39 +00:00
Jeff Roberson	5909dafea9	(fault 6/9) Move getpages and associated logic into a dedicated function. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23311	2020-01-23 05:18:00 +00:00
Jeff Roberson	91eb2e908f	(fault 5/9) Move the backing_object traversal into a dedicated function. Reviewed by: dougm, kib, markj Differential Revision: https://reviews.freebsd.org/D23310	2020-01-23 05:14:41 +00:00
Jeff Roberson	5936b6a8f1	(fault 4/9) Move copy-on-write into a dedicated function. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23304	2020-01-23 05:11:01 +00:00
Jeff Roberson	fcb0475833	(fault 3/9) Move map relookup into a dedicated function. Add a new VM return code KERN_RESTART which means, deallocate and restart in fault. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23303	2020-01-23 05:07:01 +00:00
Jeff Roberson	c308a3a6c9	(fault 2/9) Move map lookup into a dedicated function. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23302	2020-01-23 05:05:39 +00:00
Jeff Roberson	2c2f4413cc	(fault 1/9) Move a handful of stack variables into the faultstate. This additionally fixes a potential bug/pessimization where we could fail to reload the original fault_type on restart. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23301	2020-01-23 05:03:34 +00:00
Ryan Libby	8d1c459ae5	uma: fix zone domain overlaying pcpu cache with disabled cpus UMA zone structures have two arrays at the end which are sized according to the machine: an array of CPU count length, and an array of NUMA domain count length. The CPU counting was wrong in the case where some CPUs are disabled (when mp_ncpus != mp_maxid + 1), and this caused the second array to be overlaid with the first. Reported by: olivier Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23318	2020-01-23 04:56:38 +00:00
Ryan Libby	7e2406774e	uma: report leaks more accurately Previously UMA had some false negatives in the leak report at keg destruction time, where it only reported leaks if there were free items in the slab layer (rather than allocated items), which notably would not be true for single-item slabs (large items). Now, report a leak if there are any allocated pages, and calculate and report the number of allocated items rather than free items. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23275	2020-01-23 04:56:34 +00:00
Jeff Roberson	91e31c3c08	Consistently use busy and vm_page_valid() rather than touching page bits directly. This improves API compliance, asserts, etc. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23283	2020-01-23 04:54:49 +00:00
Jeff Roberson	530cc6a25d	Some architectures with DMAP still consume boot kva. Simplify the test for claiming kva in uma_startup2() to handle this. Reported by: bdragon	2020-01-23 03:37:35 +00:00
Jeff Roberson	5949b1ca8c	Move readahead and dropbehind fault functionality into a helper routine for clarity. Reviewed by: dougm, kib, markj Differential Revision: https://reviews.freebsd.org/D23282	2020-01-21 00:12:57 +00:00
Jeff Roberson	1e40fe41c5	Reduce object locking in vm_fault. Once we have an exclusively busied page we no longer need an object lock. This reduces the longest hold times and eliminates some trylock code blocks. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23034	2020-01-20 22:49:52 +00:00
Jeff Roberson	d6e13f3b4d	Don't hold the object lock while calling getpages. The vnode pager does not want the object lock held. Moving this out allows further object lock scope reduction in callers. While here add some missing paging in progress calls and an assert. The object handle is now protected explicitly with pip. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23033	2020-01-19 23:47:32 +00:00
Jeff Roberson	9c83ff2d86	It has not been possible to recursively terminate a vnode object for some time now. Eliminate the dead code that supports it. Approved by: kib, markj Differential Revision: https://reviews.freebsd.org/D22908	2020-01-19 18:36:03 +00:00
Jeff Roberson	98087a066f	Make collapse synchronization more explicit and allow it to complete during paging. Shadow objects are marked with a COLLAPSING flag while they are collapsing with their backing object. This gives us an explicit test rather than overloading paging-in-progress. While split is on-going we mark an object with SPLIT. These two operations will modify the swap tree so they must be serialized and swap_pager_getpages() can now directly detect these conditions and page more conservatively. Callers to vm_object_collapse() now will reliably wait for a collapse to finish so that the backing chain is as short as possible before other decisions are made that may inflate the object chain. For example, split, coalesce, etc. It is now safe to run fault concurrently with collapse. It is safe to increase or decrease paging in progress with no lock so long as there is another valid ref on increase. This change makes collapse more reliable as a secondary benefit. The primary benefit is making it safe to drop the object lock much earlier in fault or never acquire it at all. This was tested with a new shadow chain test script that uncovered long standing bugs and will be integrated with stress2. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22908	2020-01-19 18:30:23 +00:00
Andrew Gallatin	2052680238	pcpu_page_alloc: guard against empty NUMA domains Some systems, such as higher end Threadripper, may have NUMA domains with no physical memory, Don't allocate from these domains. This fixes a "panic: vm_wait in early boot" on my 2990WX desktop Reviewed by: jeff Sponsored by: Netflix	2020-01-18 18:25:37 +00:00
Jeff Roberson	5844774900	Fix a long standing bug that was made worse in r355765. When we are cowing a page that was previously mapped read-only it exists in pmap until pmap_enter() returns. However, we held no reference to the original page after the copy was complete. This allowed vm_object_scan_all_shadowed() to collapse an object that still had pages mapped. To resolve this, add another page pointer to the faultstate so we can keep the page xbusy until we're done with pmap_enter(). Handle busy pages in scan_all_shadowed. This is already done in vm_object_collapse_scan(). Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23155	2020-01-17 03:44:04 +00:00
Jeff Roberson	a81c400e75	Simplify VM and UMA startup by eliminating boot pages. Instead use careful ordering to allocate early pages in the same way boot pages were but only as needed. After the KVA allocator has started up we allocate the KVA that we consumed during boot. This also makes the boot pages freeable since they have vm_page structures allocated with the rest of memory. Parts of this patch were written and tested by markj. Reviewed by: glebius, markj Differential Revision: https://reviews.freebsd.org/D23102	2020-01-16 05:01:21 +00:00
Alexander Motin	ace409ce9c	Restore loop break in vm_pageout_lowmem(). r355004 removed return statement from this loop with intention to also call uma_reclaim_wakeup(). But in case of vm.lowmem_period=0 it causes infinite loop. Reviewed by: markj Sponsored by: iXsystems, Inc.	2020-01-14 03:27:57 +00:00
Ryan Libby	9b8db4d0a0	uma: split slabzone into two sizes By allowing more items per slab, we can improve memory efficiency for small allocs. If we were just to increase the bitmap size of the slabzone, we would then waste slabzone memory. So, split slabzone into two zones, one especially for 8-byte allocs (512 per slab). The practical effect should be reduced memory usage for counter(9). Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23149	2020-01-14 02:14:15 +00:00
Ryan Libby	e63a1c2f52	uma: fixup some ktr messages Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23148	2020-01-14 02:13:46 +00:00
Mateusz Guzik	a314aba874	vm: add missing CLTFLAG_MPSAFE annotations This covers all vm/* files.	2020-01-12 05:08:57 +00:00
Gleb Smirnoff	9328cbc047	Always multiple vm.pgcache_zone_max to number of CPUs, and rename it respectively. The tunable controls how big is the size of per-cpu vm page cache. Previously the value was split for all CPUs in system, so configuring same value on machines with different count of CPUs yielded in different cache size available to a particular CPU. Reviewed by: markj Obtained from: Netflix	2020-01-10 19:32:08 +00:00
Mark Johnston	860bb7a04c	UMA: Don't destroy zones after the system shutdown process starts. Some kernel subsystems, notably ZFS, will destroy UMA zones from a shutdown eventhandler. This causes the zone to be drained. For slabs that are mapped into KVA this can be very expensive and so it needlessly delays the shutdown process. Add a new state to the "booted" variable, BOOT_SHUTDOWN. Once kern_reboot() starts invoking shutdown handlers, turn uma_zdestroy() into a no-op, provided that the zone does not have a custom finalization routine. PR: 242427 Reviewed by: jeff, kib, rlibby MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23066	2020-01-09 19:17:42 +00:00
Ryan Libby	4a8b575c6b	uma: unify layout paths and improve efficiency Unify the keg layout selection paths (keg_small_init, keg_large_init, keg_cachespread_init), and slightly improve memory efficiecy by: - using the padding of the final item to store the slab header, - not going OFFPAGE if we have a choice unless it improves efficiency. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23048	2020-01-09 02:03:17 +00:00
Ryan Libby	54c5ae804f	uma: reorganize flags - Garbage collect UMA_ZONE_PAGEABLE & UMA_ZONE_STATIC. - Move flag VTOSLAB from public to private. - Introduce public NOTPAGE flag and make HASH private. - Introduce public NOTOUCH flag and make OFFPAGE private. - Update man page. The net effect of this should be to make the contract with clients more clear. Clients should choose constraints, UMA will figure out how to implement them. This also breaks the confusing double meaning of OFFPAGE. Reviewed by: jeff, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D23016	2020-01-09 02:03:03 +00:00
Jeff Roberson	79c9f9429a	Fix uma boot pages calculations on NUMA machines that also don't have MD_UMA_SMALL_ALLOC. This is unusual but not impossible. Fix the alignemnt of zones while here. This was already correct because uz_cpu strongly aligned the zone structure but the specified alignment did not match reality and involved redundant defines. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D23046	2020-01-06 02:51:19 +00:00
Jeff Roberson	bfb6b7a121	The fix in r356353 was insufficient. Not every architecture returns 0 for EARLY_COUNTER. Only amd64 seems to. Suggested by: markj Reported by: lwhsu Reviewed by: markj PR: 243117	2020-01-05 22:54:25 +00:00
Kyle Evans	2180f6c6f1	kern_mmap: restore character deleted in transit Pointy hat to: kevans X-MFC-With: r356359	2020-01-04 23:51:44 +00:00
Kyle Evans	18348a2369	kern_mmap: add a variant that allows caller to inspect fp Linux mmap rejects mmap() on a write-only file with EACCES. linux_mmap_common currently does a fun dance to grab the fp associated with the passed in fd, validates it, then drops the reference and calls into kern_mmap(). Doing so is perhaps both fragile and premature; there's still plenty of chance for the request to get rejected with a more appropriate error, and it's prone to a race where the file we ultimately mmap has changed after it drops its referenced. This change alleviates the need to do this by providing a kern_mmap variant that allows the caller to inspect the fp just before calling into the fileop layer. The callback takes flags, prot, and maxprot as one could imagine scenarios where any of these, in conjunction with the file itself, may influence a caller's decision. The file type check in the linux compat layer has been removed; EINVAL is seemingly not an appropriate response to the file not being a vnode or device. The fileop layer will reject the operation with ENODEV if it's not supported, which more closely matches the common linux description of mmap(2) return values. If we discover that we're allowing an mmap() on a file type that Linux normally wouldn't, we should restrict those explicitly. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D22977	2020-01-04 23:39:58 +00:00
Jeff Roberson	31c251a046	Fix an assertion introduced in r356348. On architectures without UMA_MD_SMALL_ALLOC vmem has a more complicated startup sequence that violated the new assert. Resolve this by rewriting the COLD asserts to look at the per-cpu allocation counts for evidence of api activity. Discussed with: rlibby Reviewed by: markj Reported by: lwhsu	2020-01-04 19:29:25 +00:00
Jeff Roberson	dfe13344f5	UMA NUMA flag day. UMA_ZONE_NUMA was a source of confusion. Make the names more consistent with other NUMA features as UMA_ZONE_FIRSTTOUCH and UMA_ZONE_ROUNDROBIN. The system will now pick a select a default depending on kernel configuration. API users need only specify one if they want to override the default. Remove the UMA_XDOMAIN and UMA_FIRSTTOUCH kernel options and key only off of NUMA. XDOMAIN is now fast enough in all cases to enable whenever NUMA is. Reviewed by: markj Discussed with: rlibby Differential Revision: https://reviews.freebsd.org/D22831	2020-01-04 18:48:13 +00:00
Jeff Roberson	91d947bfbe	Sort cross-domain frees into per-domain buckets before inserting these onto their respective bucket lists. This is a several order of magnitude improvement in contention on the keg lock under heavy free traffic while requiring only an additional bucket per-domain worth of memory. Discussed with: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22830	2020-01-04 07:56:28 +00:00
Jeff Roberson	8b987a7769	Use per-domain keg locks. This provides both a lock and separate space accounting for each NUMA domain. Independent keg domain locks are important with cross-domain frees. Hashed zones are non-numa and use a single keg lock to protect the hash table. Reviewed by: markj, rlibby Differential Revision: https://reviews.freebsd.org/D22829	2020-01-04 03:30:08 +00:00
Jeff Roberson	727c691857	Use a separate lock for the zone and keg. This provides concurrency between populating buckets from the slab layer and fetching full buckets from the zone layer. Eliminate some nonsense locking patterns where we lock to fetch a single variable. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D22828	2020-01-04 03:15:34 +00:00
Jeff Roberson	4bd61e19a2	Use atomics for the zone limit and sleeper count. This relies on the sleepq to serialize sleepers. This patch retains the existing sleep/wakeup paradigm to limit 'thundering herd' wakeups. It resolves a missing wakeup in one case but otherwise should be bug for bug compatible. In particular, there are still various races surrounding adjusting the limit via sysctl that are now documented. Discussed with: markj Reviewed by: rlibby Differential Revision: https://reviews.freebsd.org/D22827	2020-01-04 03:04:46 +00:00
Mateusz Guzik	b249ce48ea	vfs: drop the mostly unused flags argument from VOP_UNLOCK Filesystems which want to use it in limited capacity can employ the VOP_UNLOCK_FLAGS macro. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D21427	2020-01-03 22:29:58 +00:00
Mark Johnston	f7607c300b	Clear queue operation flags when migrating a page to another queue. The page daemon loops may move pages back to the active queue if references are detected. In this case we must take care to clear existing queue operation flags. In particular, PGA_REQUEUE_HEAD may be set, and that flag is only valid if the page belongs to the inactive queue. Also fix a bug in the active queue scan where we were updating "old" instead of "new". This would only have been hit in rare cases where the page moved out of the active queue after the beginning of the scan. Reported by: Bob Prohaska, Idwer Vollering Tested by: Idwer Vollering Reviewed by: alc, kib Differential Revision: https://reviews.freebsd.org/D23001	2020-01-02 19:26:04 +00:00
Doug Moore	668a8aa83b	The map-entry clipping functions modify start and end entries of an entry in the vm_map, making invariants related to the max_free entry field invalid. Move the clipping work into vm_map_entry_link, so that linking is okay when the new entry clips a current entry, and the vm_map doesn't have to be briefly corrupted. Change assertions and conditions in SPLAY_{LEFT,RIGHT}_STEP since the max_free invariants can now be trusted in all cases. Tested by: pho Reviewed by: alc Differential Revision: https://reviews.freebsd.org/D22897	2019-12-31 22:20:54 +00:00
Mark Johnston	758b2c02bb	Restore a vm_page_wired() check in vm_page_mvqueue() after r356156. We now set PGA_DEQUEUE on a managed page when it is wired after allocation, and vm_page_mvqueue() ignores pages with this flag set, ensuring that they do not end up in the page queues. However, this is not sufficient for managed fictitious pages or pages managed by the TTM. In particular, the TTM makes use of the plinks.q queue linkage fields for its own purposes. PR: 242961 Reported and tested by: Greg V <greg@unrelenting.technology>	2019-12-29 20:01:03 +00:00
Mark Johnston	9b888dd9bd	Clear queue op flags in vm_page_mvqueue(). This fixes a regression in r356155, introduced at the last minute. In particular, we must clear PGA_REQUEUE_HEAD before inserting into any queue besides PQ_INACTIVE since that operation is implemented only for PQ_INACTIVE. Reported by: pho, Jenkins via lwhsu	2019-12-29 15:39:43 +00:00
Mark Johnston	727150ff03	Remove some unused functions. The previous series of patches orphaned some vm_page functions, so remove them. Reviewed by: dougm, kib Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22886	2019-12-28 19:04:29 +00:00
Mark Johnston	dc71caa037	Update the vm_page.h block comment to reflect recent changes. Explain the new locking rules for per-page queue state updates. Reviewed by: jeff, kib Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22884	2019-12-28 19:04:15 +00:00
Mark Johnston	9f5632e6c8	Remove page locking for queue operations. With the previous reviews, the page lock is no longer required in order to perform queue operations on a page. It is also no longer needed in the page queue scans. This change effectively eliminates remaining uses of the page lock and also the false sharing caused by multiple pages sharing a page lock. Reviewed by: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22885	2019-12-28 19:04:00 +00:00
Mark Johnston	b7f30bff2f	Generalize lazy dequeue logic for wired pages. Some recent work aims to remove the use of the page lock for synchronizing updates to page queue state. This change adds a mechanism to preserve the existing behaviour of lazily dequeuing wired pages, which was previously synchronized using the page lock. Handle this by setting PGA_DEQUEUE when a managed page's wire count transitions from 0 to 1. When the page daemon encounters a page with a flag in PGA_QUEUE_OP_MASK set, it creates a batch queue entry for that page, but in so doing it does not modify the page itself and thus racing with a concurrent free of the page is harmless. The flag is advisory; the page daemon still checks for wirings after acquiring the object and page xbusy locks. vm_page_unwire_managed() now clears PGA_DEQUEUE on a 1->0 transition. It must do this before dropping the reference to avoid a use-after-free but also handles races with concurrent wirings to ensure that PGA_DEQUEUE is not left unset on a wired page. Reviewed by: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22882	2019-12-28 19:03:46 +00:00
Mark Johnston	f3f38e2580	Start implementing queue state updates using fcmpset loops. This is in preparation for eliminating the use of the vm_page lock for protecting queue state operations. Introduce the vm_page_pqstate_commit_*() functions. These functions act as helpers around vm_page_astate_fcmpset() and are specialized for specific types of operations. vm_page_pqstate_commit() wraps these functions. Convert a number of routines to use these new helpers. Use vm_page_release_toq() in vm_page_unwire() and vm_page_release() to atomically release a wiring reference and release the page into a queue. This has the side effect that vm_page_unwire() will leave the page in the active queue if it is already present there. Convert the page queue scans to use the new helpers. Simplify vm_pageout_reinsert_inactive(), which requeues pages that were found to be busy during an inactive queue scan, to avoid duplicating the work of vm_pqbatch_process_page(). In particular, if PGA_REQUEUE or PGA_REQUEUE_HEAD is set, let that be handled during batch processing. Reviewed by: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22770 Differential Revision: https://reviews.freebsd.org/D22771 Differential Revision: https://reviews.freebsd.org/D22772 Differential Revision: https://reviews.freebsd.org/D22773 Differential Revision: https://reviews.freebsd.org/D22776	2019-12-28 19:03:32 +00:00
Mark Johnston	3c01c56b0e	Don't update per-page activation counts in the swapout code. This avoids duplicating the work of the page daemon's active queue scan. Moreover, this duplication was inconsistent: - PGA_REFERENCED is not counted in act_count unless pmap_ts_referenced() returned 0, but the page daemon always counts PGA_REFERENCED towards the activation count. - The swapout daemon always activates a referenced page, but the page daemon only does so when the containing object is mapped at least once. The main purpose of swapout_deactivate_pages() is to shrink the number of pages mapped into a given pmap. To do this without unmapping active pages, use the non-destructive pmap_is_referenced() instead of the destructive pmap_ts_referenced() and deactivate pages accordingly. This simplifies some future changes to the locking protocol for page queue state. Reviewed by: kib Discussed with: jeff Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D22674	2019-12-28 19:03:17 +00:00
Konstantin Belousov	df8db6ddb9	vm_object_shadow(): fix object reference leak. In r355270 by me, vm_object_shadow() was changed to handle the reference counting for the shared case, but the extra reference that was done in vmspace_fork() for the shared/need_copy case was not removed. Submitted by: jeff	2019-12-28 16:40:44 +00:00
Mark Johnston	5541eb27d6	Remove some stale comments from the page allocator. Since r352110 the page lock is not required to wire pages in any context.	2019-12-27 23:19:21 +00:00

... 3 4 5 6 7 ...

4706 commits