Commit graph

373 commits

Author SHA1 Message Date
Ryan Libby 8d1c459ae5 uma: fix zone domain overlaying pcpu cache with disabled cpus
UMA zone structures have two arrays at the end which are sized according
to the machine: an array of CPU count length, and an array of NUMA
domain count length.  The CPU counting was wrong in the case where some
CPUs are disabled (when mp_ncpus != mp_maxid + 1), and this caused the
second array to be overlaid with the first.

Reported by:	olivier
Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23318
2020-01-23 04:56:38 +00:00
Ryan Libby 7e2406774e uma: report leaks more accurately
Previously UMA had some false negatives in the leak report at keg
destruction time, where it only reported leaks if there were free items
in the slab layer (rather than allocated items), which notably would not
be true for single-item slabs (large items).  Now, report a leak if
there are any allocated pages, and calculate and report the number of
allocated items rather than free items.

Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23275
2020-01-23 04:56:34 +00:00
Jeff Roberson 530cc6a25d Some architectures with DMAP still consume boot kva. Simplify the test for
claiming kva in uma_startup2() to handle this.

Reported by:	bdragon
2020-01-23 03:37:35 +00:00
Andrew Gallatin 2052680238 pcpu_page_alloc: guard against empty NUMA domains
Some systems, such as higher end Threadripper, may have
NUMA domains with no physical memory, Don't allocate
from these domains.

This fixes a "panic: vm_wait in early boot" on my 2990WX desktop

Reviewed by:	jeff
Sponsored by:	Netflix
2020-01-18 18:25:37 +00:00
Jeff Roberson a81c400e75 Simplify VM and UMA startup by eliminating boot pages. Instead use careful
ordering to allocate early pages in the same way boot pages were but only
as needed.  After the KVA allocator has started up we allocate the KVA that
we consumed during boot.  This also makes the boot pages freeable since they
have vm_page structures allocated with the rest of memory.

Parts of this patch were written and tested by markj.

Reviewed by:	glebius, markj
Differential Revision:	https://reviews.freebsd.org/D23102
2020-01-16 05:01:21 +00:00
Ryan Libby 9b8db4d0a0 uma: split slabzone into two sizes
By allowing more items per slab, we can improve memory efficiency for
small allocs.  If we were just to increase the bitmap size of the
slabzone, we would then waste slabzone memory.  So, split slabzone into
two zones, one especially for 8-byte allocs (512 per slab).  The
practical effect should be reduced memory usage for counter(9).

Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23149
2020-01-14 02:14:15 +00:00
Ryan Libby e63a1c2f52 uma: fixup some ktr messages
Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23148
2020-01-14 02:13:46 +00:00
Mateusz Guzik a314aba874 vm: add missing CLTFLAG_MPSAFE annotations
This covers all vm/* files.
2020-01-12 05:08:57 +00:00
Mark Johnston 860bb7a04c UMA: Don't destroy zones after the system shutdown process starts.
Some kernel subsystems, notably ZFS, will destroy UMA zones from a
shutdown eventhandler.  This causes the zone to be drained.  For slabs
that are mapped into KVA this can be very expensive and so it needlessly
delays the shutdown process.

Add a new state to the "booted" variable, BOOT_SHUTDOWN.  Once
kern_reboot() starts invoking shutdown handlers, turn uma_zdestroy()
into a no-op, provided that the zone does not have a custom finalization
routine.

PR:		242427
Reviewed by:	jeff, kib, rlibby
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23066
2020-01-09 19:17:42 +00:00
Ryan Libby 4a8b575c6b uma: unify layout paths and improve efficiency
Unify the keg layout selection paths (keg_small_init, keg_large_init,
keg_cachespread_init), and slightly improve memory efficiecy by:
 - using the padding of the final item to store the slab header,
 - not going OFFPAGE if we have a choice unless it improves efficiency.

Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23048
2020-01-09 02:03:17 +00:00
Ryan Libby 54c5ae804f uma: reorganize flags
- Garbage collect UMA_ZONE_PAGEABLE & UMA_ZONE_STATIC.
 - Move flag VTOSLAB from public to private.
 - Introduce public NOTPAGE flag and make HASH private.
 - Introduce public NOTOUCH flag and make OFFPAGE private.
 - Update man page.

The net effect of this should be to make the contract with clients more
clear.  Clients should choose constraints, UMA will figure out how to
implement them.  This also breaks the confusing double meaning of
OFFPAGE.

Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23016
2020-01-09 02:03:03 +00:00
Jeff Roberson 79c9f9429a Fix uma boot pages calculations on NUMA machines that also don't have
MD_UMA_SMALL_ALLOC.  This is unusual but not impossible.  Fix the alignemnt
of zones while here.  This was already correct because uz_cpu strongly
aligned the zone structure but the specified alignment did not match
reality and involved redundant defines.

Reviewed by:	markj, rlibby
Differential Revision:	https://reviews.freebsd.org/D23046
2020-01-06 02:51:19 +00:00
Jeff Roberson bfb6b7a121 The fix in r356353 was insufficient. Not every architecture returns 0 for
EARLY_COUNTER.  Only amd64 seems to.

Suggested by:	markj
Reported by:	lwhsu
Reviewed by:	markj
PR:		243117
2020-01-05 22:54:25 +00:00
Jeff Roberson 31c251a046 Fix an assertion introduced in r356348. On architectures without
UMA_MD_SMALL_ALLOC vmem has a more complicated startup sequence that
violated the new assert.  Resolve this by rewriting the COLD asserts to
look at the per-cpu allocation counts for evidence of api activity.

Discussed with:	rlibby
Reviewed by:	markj
Reported by:	lwhsu
2020-01-04 19:29:25 +00:00
Jeff Roberson dfe13344f5 UMA NUMA flag day. UMA_ZONE_NUMA was a source of confusion. Make the names
more consistent with other NUMA features as UMA_ZONE_FIRSTTOUCH and
UMA_ZONE_ROUNDROBIN.  The system will now pick a select a default depending
on kernel configuration.  API users need only specify one if they want to
override the default.

Remove the UMA_XDOMAIN and UMA_FIRSTTOUCH kernel options and key only off
of NUMA.  XDOMAIN is now fast enough in all cases to enable whenever NUMA
is.

Reviewed by:	markj
Discussed with:	rlibby
Differential Revision:	https://reviews.freebsd.org/D22831
2020-01-04 18:48:13 +00:00
Jeff Roberson 91d947bfbe Sort cross-domain frees into per-domain buckets before inserting these
onto their respective bucket lists.  This is a several order of magnitude
improvement in contention on the keg lock under heavy free traffic while
requiring only an additional bucket per-domain worth of memory.

Discussed with:		markj, rlibby
Differential Revision:	https://reviews.freebsd.org/D22830
2020-01-04 07:56:28 +00:00
Jeff Roberson 8b987a7769 Use per-domain keg locks. This provides both a lock and separate space
accounting for each NUMA domain.  Independent keg domain locks are important
with cross-domain frees.  Hashed zones are non-numa and use a single keg
lock to protect the hash table.

Reviewed by:	markj, rlibby
Differential Revision:	https://reviews.freebsd.org/D22829
2020-01-04 03:30:08 +00:00
Jeff Roberson 727c691857 Use a separate lock for the zone and keg. This provides concurrency
between populating buckets from the slab layer and fetching full buckets
from the zone layer.  Eliminate some nonsense locking patterns where
we lock to fetch a single variable.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22828
2020-01-04 03:15:34 +00:00
Jeff Roberson 4bd61e19a2 Use atomics for the zone limit and sleeper count. This relies on the
sleepq to serialize sleepers.  This patch retains the existing sleep/wakeup
paradigm to limit 'thundering herd' wakeups.  It resolves a missing wakeup
in one case but otherwise should be bug for bug compatible.  In particular,
there are still various races surrounding adjusting the limit via sysctl
that are now documented.

Discussed with:	markj
Reviewed by:	rlibby
Differential Revision:	https://reviews.freebsd.org/D22827
2020-01-04 03:04:46 +00:00
Jeff Roberson cc7ce83ae0 Further reduce the cacheline footprint of fast allocations by duplicating
the zone size and flags fields in the per-cpu caches.  This allows fast
alloctions to proceed only touching the single per-cpu cacheline and
simplifies the common case when no ctor/dtor is specified.

Reviewed by:	markj, rlibby
Differential Revision:	https://reviews.freebsd.org/D22826
2019-12-25 20:57:24 +00:00
Jeff Roberson 376b1ba394 Optimize fast path allocations by storing bucket headers in the per-cpu
cache area.  This allows us to check on bucket space for all per-cpu
buckets with a single cacheline access and fewer branches.

Reviewed by:	markj, rlibby
Differential Revision:	https://reviews.freebsd.org/D22825
2019-12-25 20:50:53 +00:00
Jeff Roberson 3639ac42e5 Fix a bug with _NUMA domains introduced in r339686. When M_NOWAIT is
specified there was no loop termination condition in keg_fetch_slab().

Reported by:	pho
Reviewed by:	markj
2019-12-25 19:26:35 +00:00
Ryan Libby 815db20425 uma dbg: flexible size for slab debug bitset too
Recently (r355315) the size of the struct uma_slab bitset field us_free
became dynamic instead of conservative.  Now, make the debug bitset
size dynamic too.  The debug bitset is INVARIANTS-only, so in fact we
don't care too much about the space savings that results from this, but
enabling minimally-sized slabs on INVARIANTS builds is still important
in order to be able to test new slab layouts effectively.

Reviewed by:	jeff (previous version), markj (previous version)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22759
2019-12-14 05:21:56 +00:00
Mark Johnston 325c4ced0d Restore the reservation of boot pages for bucket zones after r355707.
uma_startup2() sets booted = BOOT_BUCKETS after calling bucket_init(),
but before that assignment, startup_alloc() will use pages from the
reserved pool, so the bucket zones themselves are still allocated using
startup pages.

Reviewed by:	rlibby
Reported by:	Jenkins via lwhsu
Differential Revision:	https://reviews.freebsd.org/D22797
2019-12-13 18:28:01 +00:00
Ryan Libby d82c8ffb16 Revert r355706 & r355710
The quick fix didn't work.  I'll sort it out tomorrow.

Revert r355710: "libmemstat: unbreak build"
Revert r355706: "uma dbg: flexible size for slab debug bitset too"
2019-12-13 11:21:28 +00:00
Ryan Libby f7af501519 uma: report slab efficiency
Reviewed by:	jeff
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22766
2019-12-13 09:32:09 +00:00
Ryan Libby 3182660a85 uma: delay bucket_init() until we might actually enable buckets
This helps with a bootstrapping problem in upcoming work.

We don't first enable buckets until uma_startup2(), so we can delay
bucket creation until then.  The other two paths to bucket_enable() are
both later, one in the pageout daemon (SI_SUB_KTHREAD_PAGE vs SI_SUB_VM)
and one in uma_timeout() (first activated in uma_startup3()).  Note that
although some bucket functions are accessible before uma_startup2()
(e.g. bucket_select() in zone_ctor()), none of them inspect ubz_zone.

Discussed with:	jeff
Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22765
2019-12-13 09:32:03 +00:00
Ryan Libby 7508f15ff1 uma dbg: flexible size for slab debug bitset too
Recently (r355315) the size of the struct uma_slab bitset field us_free
became dynamic instead of conservative.  Now, make the debug bitset
size dynamic too.  The debug bitset is INVARIANTS-only, so in fact we
don't care too much about the space savings that results from this, but
enabling minimally-sized slabs on INVARIANTS builds is still important
in order to be able to test new slab layouts effectively.

Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22759
2019-12-13 09:31:59 +00:00
Ryan Libby 6d204a6a0e uma: pretty print zone flags sysctl
Requested by:	jeff
Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22748
2019-12-11 06:50:55 +00:00
Jeff Roberson 3b490537f4 Fix two problems with r355149. The sysctl name collision code assumed that
zones would never be freed.  In the case of tmpfs this was not true.  While
here test for the right bit to disable the keg related sysctls for zones
that don't have kegs.

Reported by:	pho
Reviewed by:	rlibby
Differential Revision:	https://reviews.freebsd.org/D22655
2019-12-08 01:55:23 +00:00
Jeff Roberson 1e0701e1e5 Use a variant slab structure for offpage zones. This saves space in
embedded slabs but also is an opportunity to tidy up code and add
accessor inlines.

Reviewed by:	markj, rlibby
Differential Revision:	https://reviews.freebsd.org/D22609
2019-12-08 01:15:06 +00:00
Andrew Turner b75c4efcd2 Fix the signature for zone_import and zone_release
These are cast to uma_import and uma_release functions. Use the signature
for these in the zone functions.

This was found with an experimental Kernel CFI. It will complain if the
signature is different than what a function pointer expects. The
simplest way to fix these is to correct the signature.

Reviewed by:	rlibby
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D22671
2019-12-04 18:40:05 +00:00
Jeff Roberson 9b78b1f433 Use a precise bit count for the slab free items in UMA. This significantly
shrinks embedded slab structures.

Reviewed by:	markj, rlibby (prior version)
Differential Revision:	https://reviews.freebsd.org/D22584
2019-12-02 22:44:34 +00:00
Jeff Roberson 6d6a03d7a8 Handle large mallocs by going directly to kmem. Taking a detour through
UMA does not provide any additional value.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22563
2019-11-29 03:14:10 +00:00
Jeff Roberson 584061b480 Garbage collect the mostly unused us_keg field. Use appropriately named
union members in vm_page.h to store the zone and slab.  Remove some nearby
dead code.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22564
2019-11-28 07:49:25 +00:00
Ryan Libby 35ec24f362 uma: move sysctl vm.uma defn out from under INVARIANTS
Fix non-INVARIANTS builds after r355149.

Reported by:	Michael Butler <imb@protected-networks.net>
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22588
2019-11-28 04:15:16 +00:00
Jeff Roberson 20a4e15451 Implement a sysctl tree for uma zones to assist in debugging and provide
more statistcs than are exported via the ABI stable vmstat interface.
Rename uz_count to uz_bucket_size because even I was confused by the
name after returning to the source years later.

Reviewed by:	rlibby
Differential Revision:	https://reviews.freebsd.org/D22554
2019-11-28 00:19:09 +00:00
Jeff Roberson 0a81b4395e Refactor uma_zfree_arg into several functions to make control flow more
clear and icache usage cleaner.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22491
2019-11-27 23:19:06 +00:00
Ryan Libby ca293436d1 uma: trash memory when ctor/dtor supplied too
On INVARIANTS kernels, UMA has a use-after-free detection mechanism.
This mechanism previously required that all of the ctor/dtor/uminit/fini
arguments to uma_zcreate() be NULL in order to function.  Now, it only
requires that uminit and fini be NULL; now, the trash ctor and dtor will
be called in addition to any supplied ctor or dtor.

Also do a little refactoring for readability of the resulting logic.

This enables use-after-free detection for more zones, and will allow for
simplification of some callers that worked around the previous
restriction (see kern_mbuf.c).

Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D20722
2019-11-27 19:49:55 +00:00
Jeff Roberson beb8beef81 Refactor uma_zalloc_arg(). It is a mess of gotos and code which doesn't
make sense after many partial refactors.  Attempt to make a smaller cache
footprint for the fast path.

Reviewed by:	markj, rlibby
Differential Revision:	https://reviews.freebsd.org/D22470
2019-11-26 22:17:02 +00:00
Mark Johnston 003cf08ba9 Revise the page cache size policy.
In r353734 the use of the page caches was limited to systems with a
relatively large amount of RAM per CPU.  This was to mitigate some
issues reported with the system not able to keep up with memory pressure
in cases where it had been able to do so prior to the addition of the
direct free pool cache.  This change re-enables those caches.

The change modifies uma_zone_set_maxcache(), which was introduced
specifically for the page cache zones.  Rather than using it to limit
only the full bucket cache, have it also set uz_count_max to provide an
upper bound on the per-CPU cache size that is consistent with the number
of items requested.  Remove its return value since it has no use.

Enable the page cache zones unconditionally, and limit them to 0.1% of
the domain's pages.  The limit can be overridden by the
vm.pgcache_zone_max tunable as before.

Change the item size parameter passed to uma_zcache_create() to the
correct size, and stop setting UMA_ZONE_MAXBUCKET.  This allows the page
cache buckets to be adaptively sized, like the rest of UMA's caches.
This also causes the initial bucket size to be small, so only systems
which benefit from large caches will get them.

Reviewed by:	gallatin, jeff
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22393
2019-11-22 16:30:47 +00:00
Jeff Roberson 71353f7a2f When we set OFFPAGE to limit fragmentation we should also set VTOSLAB
so that we avoid the hashtables.  The hashtable is now only required if
a zone is created with OFFPAGE specified initially, not internally.  This
flag signals to UMA that it can't touch the allocated memory and so
can't store a slab pointer in the containing page.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22453
2019-11-20 01:57:33 +00:00
Konstantin Belousov 08034d1006 Include cache zones into zone_foreach() where appropriate.
The r354367 is reverted since it is subsumed by this, more complete, approach.

Suggested by:	markj
Reviewed by:	alc. glebius, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22242
2019-11-10 09:25:19 +00:00
Konstantin Belousov 432fc36da1 Switch cache zones from early counters to real implementation.
Early counter mock can be only used on BSP for amd64, when APs try to
update it that causes random memory corruption.

N.B.  This is a temporary patch to plug the corruption for now, while
a proper solution for handling cache zones in zone_foreach() is being
developed.

In collaboration with:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation, Mellanox Technologies
2019-11-05 21:38:48 +00:00
Mark Johnston 1de9724e55 Avoid reloading bucket pointers in uma_vm_zone_stats().
The correctness of per-CPU cache accounting in that function is
dependent on reading per-CPU pointers exactly once.  Ensure that
the compiler does not emit multiple loads of those pointers.

Reported and tested by:	pho
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22081
2019-10-22 14:20:06 +00:00
Conrad Meyer 0223790f8d Fix braino in r353429
cy@ points out that I got parameter order backwards between definition and
invocation of the helper function.  He is totally correct.  The earlier
version of this patch predated the XFree column so this is one I introduced,
rather than the original author.

Submitted by:	cy
Reported by:	cy
X-MFC-With:	r353429
2019-10-11 06:02:03 +00:00
Conrad Meyer 46d70077be ddb: Add CSV option, sorting to 'show (malloc|uma)'
Add /i option for machine-parseable CSV output.  This allows ready copy/
pasting into more sophisticated tooling outside of DDB.

Add total zone size ("Memory Use") as a new column for UMA.

For both, sort the displayed list on size (print the largest zones/types
first).  This is handy for quickly diagnosing "where has my memory gone?" at
a high level.

Submitted by:	Emily Pettigrew <Emily.Pettigrew AT isilon.com> (earlier version)
Sponsored by:	Dell EMC Isilon
2019-10-11 01:31:31 +00:00
Mark Johnston 08cfa56ea3 Extend uma_reclaim() to permit different reclamation targets.
The page daemon periodically invokes uma_reclaim() to reclaim cached
items from each zone when the system is under memory pressure.  This
is important since the size of these caches is unbounded by default.
However it also results in bursts of high latency when allocating from
heavily used zones as threads miss in the per-CPU caches and must
access the keg in order to allocate new items.

With r340405 we maintain an estimate of each zone's usage of its
(per-NUMA domain) cache of full buckets.  Start making use of this
estimate to avoid reclaiming the entire cache when under memory
pressure.  In particular, introduce TRIM, DRAIN and DRAIN_CPU
verbs for uma_reclaim() and uma_zone_reclaim().  When trimming, only
items in excess of the estimate are reclaimed.  Draining a zone
reclaims all of the cached full buckets (the previous behaviour of
uma_reclaim()), and may further drain the per-CPU caches in extreme
cases.

Now, when under memory pressure, the page daemon will trim zones
rather than draining them.  As a result, heavily used zones do not incur
bursts of bucket cache misses following reclamation, but large, unused
caches will be reclaimed as before.

Reviewed by:	jeff
Tested by:	pho (an earlier version)
MFC after:	2 months
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D16667
2019-09-01 22:22:43 +00:00
Mark Johnston b48d4efe75 Handle UMA_ANYDOMAIN in kstack_import().
The kernel thread stack zone performs first-touch allocations by
default, and must handle the case where the local memory domain
is empty.  For most UMA zones this is handled in the keg layer,
but cache zones currently must implement a policy for this case.
Simply use a round-robin policy if UMA_ANYDOMAIN is passed.

Reported and tested by:	bcran
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
2019-08-25 21:14:46 +00:00
Jeff Roberson eda1b01647 Implement a MINBUCKET zone flag so we can use minimal caching on zones that
may be expensive to cache.

Reviewed by:	markj, kib
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D20930
2019-08-06 23:04:59 +00:00