Commit graph

2825 commits

Author SHA1 Message Date
Alan Cox 9648f3447d Clean up the start of vm_page_alloc(). In particular, eliminate an
assertion that is no longer required.  Long ago, calls to vm_page_alloc()
from an interrupt handler had to specify VM_ALLOC_INTERRUPT so that
vm_page_alloc() would not attempt to reclaim a PQ_CACHE page from another vm
object.  Today, with the synchronization on a vm object's collection of
PQ_CACHE pages, this is no longer an issue.  In fact, VM_ALLOC_INTERRUPT now
reclaims PQ_CACHE pages just like VM_ALLOC_{NORMAL,SYSTEM}.

MFC after:	3 weeks
2011-01-16 17:33:34 +00:00
Konstantin Belousov c6c9025b65 For consistency, use kernel_object instead of &kernel_object_store
when initializing the object mutex. Do the same for kmem_object.

Discussed with:	alc
MFC after:	1 week
2011-01-15 21:56:38 +00:00
Alan Cox ff5958e785 For some time now, the kernel and kmem objects have been ordinary
OBJT_PHYS objects.  Thus, there is no need for handling them specially
in vm_fault().  In fact, this special case handling would have led to
an assertion failure just before the call to pmap_enter().

Reviewed by:	kib@
MFC after:	6 weeks
2011-01-15 19:21:28 +00:00
John Baldwin 58ccf5b41c Remove unneeded includes of <sys/linker_set.h>. Other headers that use
it internally contain nested includes.

Reviewed by:	bde
2011-01-11 13:59:06 +00:00
Konstantin Belousov 50a57dfbec Move repeated MAXSLP definition from machine/vmparam.h to sys/vmmeter.h.
Update the outdated comments describing MAXSLP and the process
selection algorithm for swap out.

Comments wording and reviewed by:	alc
2011-01-09 12:50:44 +00:00
Alan Cox 27772ddf45 Eliminate a redundant alignment directive on the page locks array. 2011-01-09 04:34:02 +00:00
Alan Cox ce8a13bdb9 Eliminate the counting of vm_page_pa_tryrelock calls. We really don't
need it anymore.  Moreover, its implementation had a type mismatch, a
long is not necessarily an uint64_t.  (This mismatch was hidden by
casting.)  Move the remaining two counters up a level in the sysctl
hierarchy.  There is no reason for them to be under the vm.pmap node.

Reviewed by:	kib
2011-01-08 22:45:22 +00:00
Alan Cox 17f6a17bf7 Release the page lock early in vm_pageout_clean(). There is no reason to
hold this lock until the end of the function.

With the aforementioned change to vm_pageout_clean(), page locks don't need
to support recursive (MTX_RECURSE) or duplicate (MTX_DUPOK) acquisitions.

Reviewed by:	kib
2011-01-03 00:41:56 +00:00
Alan Cox edf93b25d3 Make a couple refinements to r216799 and r216810. In particular, revise
a comment and move it to its proper place.

Reviewed by:	kib
2011-01-01 17:39:38 +00:00
Rebecca Cran 4c18dec9a9 There can be more than 0x20000000 swap meta blocks allocated if a swap-backed
md(4) device is used. Don't panic when deallocating such a device if swap
has been used.

PR:	kern/133170
Discussed with:	kib
MFC after:	3 days
2011-01-01 16:59:05 +00:00
Konstantin Belousov 50cfe7fa50 Remove OBJ_CLEANING flag. The vfs_setdirty_locked_object() is the only
consumer of the flag, and it used the flag because OBJ_MIGHTBEDIRTY
was cleared early in vm_object_page_clean, before the cleaning pass
was done. This is no longer true after r216799.

 Moreover, since OBJ_CLEANING is a flag, and not the counter, it could
be reset too prematurely when parallel vm_object_page_clean() are
performed.

Reviewed by:	alc (as a part of the bigger patch)
MFC after:	1 month (after r216799 is merged)
2010-12-29 22:26:49 +00:00
Alan Cox fef87167c9 There is no point in vm_contig_launder{,_page}() flushing held pages,
instead skip over them.  As long as a page is held, it can't be reclaimed by
contigmalloc(M_WAITOK).  Moreover, a held page may be undergoing
modification, e.g., vmapbuf(), so even if the hold were released before the
completion of contigmalloc(), the page might have to be flushed again.

MFC after:	3 weeks
2010-12-29 20:35:36 +00:00
Konstantin Belousov 3280870dca Move the increment of vm object generation count into
vm_object_set_writeable_dirty().

Fix an issue where restart of the scan in vm_object_page_clean() did
not removed write permissions for newly added pages or, if the mapping
for some already scanned page changed to writeable due to fault.
Merge the two loops in vm_object_page_clean(), doing the remove of
write permission and cleaning in the same loop. The restart of the
loop then correctly downgrade writeable mappings.

Fix an issue where a second caller to msync() might actually return
before the first caller had actually completed flushing the
pages. Clear the OBJ_MIGHTBEDIRTY flag after the cleaning loop, not
before.

Calls to pmap_is_modified() are not needed after pmap_remove_write()
there.

Proposed, reviewed and tested by:	alc
MFC after:	1 week
2010-12-29 12:53:53 +00:00
Alan Cox a5dbab5444 Correct a typo in vm_fault_quick_hold_pages().
Reported by:	Bartosz Stec
2010-12-28 20:02:30 +00:00
Alan Cox 4de2261903 Move vm_object_print()'s prototype to the expected place. 2010-12-27 07:12:22 +00:00
Alan Cox 0b47b37621 Retire vm_fault_quick(). It's no longer used.
Reviewed by:	kib@
2010-12-25 23:54:50 +00:00
Alan Cox 82de724fe1 Introduce and use a new VM interface for temporarily pinning pages. This
new interface replaces the combined use of vm_fault_quick() and
pmap_extract_and_hold() throughout the kernel.

In collaboration with:	kib@
2010-12-25 21:26:56 +00:00
Alan Cox acd11c7499 Introduce vm_fault_hold() and use it to (1) eliminate a long-standing race
condition in proc_rwmem() and to (2) simplify the implementation of the
cxgb driver's vm_fault_hold_user_pages().  Specifically, in proc_rwmem()
the requested read or write could fail because the targeted page could be
reclaimed between the calls to vm_fault() and vm_page_hold().

In collaboration with:	kib@
MFC after:	6 weeks
2010-12-20 22:49:31 +00:00
Alan Cox 8c22654d7e Implement and use a single optimized function for unholding a set of pages.
Reviewed by:	kib@
2010-12-17 22:41:22 +00:00
Alan Cox 7984ab250b Change memguard_fudge() so that it can handle km_max being zero. Not
every platform defines VM_KMEM_SIZE_MAX, and on those platforms km_max
will be zero.

Reviewed by:	mdf
Tested by:	marius
2010-12-14 05:47:35 +00:00
Max Laier a5db445da4 Fix a long standing (from the original 4.4BSD lite sources) race between
vmspace_fork and vm_map_wire that would lead to "vm_fault_copy_wired: page
missing" panics.  While faulting in pages for a map entry that is being
wired down, mark the containing map as busy.  In vmspace_fork wait until the
map is unbusy, before we try to copy the entries.

Reviewed by:	kib
MFC after:	5 days
Sponsored by:	Isilon Systems, Inc.
2010-12-09 21:02:22 +00:00
Jayachandran C. 48772ca4aa Revert the vm/vm_page.c change in r216317.
This adds back changes in r216141, which was reverted by the above
check in.
2010-12-09 07:39:06 +00:00
Jayachandran C. aa93efedd8 swi_vm() for mips. 2010-12-09 06:54:06 +00:00
Edward Tomasz Napierala a2f510e8ec Fix comment intentation. 2010-12-04 17:41:58 +00:00
Warner Losh 6f1a8765be To make minidumps work properly on mips for memory that's direct
mapped and entered via vm_page_setup, keep track of it like we do
for amd64.

# A separate commit will be made to move this to a capability-based ifdef
# rather than arch-based ifdef.

Submitted by:	alc@
MFC after:	1 week
2010-12-03 04:39:48 +00:00
Edward Tomasz Napierala ef694c1ac4 Replace pointer to "struct uidinfo" with pointer to "struct ucred"
in "struct vm_object".  This is required to make it possible to account
for per-jail swap usage.

Reviewed by:	kib@
Tested by:	pho@
Sponsored by:	FreeBSD Foundation
2010-12-02 17:37:16 +00:00
Alan Cox 05cb58f669 Correct an error in the allocation of the vm_page_dump array in
vm_page_startup().  Specifically, the dump_avail array should be used
instead of the phys_avail array to calculate the size of vm_page_dump.  For
example, the pages for the message buffer are allocated prior to
vm_page_startup() by subtracting them from the last entry in the phys_avail
array, but the first thing that vm_page_startup() does after creating the
vm_page_dump array is to set the bits corresponding to the message buffer
pages in that array.  However, these bits might not actually exist in the
array, because the size of the array is determined by the current value in
the last entry of the phys_avail array.  In general, the only reason why
this doesn't always result in an out-of-bounds array access is that the size
of the vm_page_dump array is rounded up to the next page boundary.  This
change eliminates that dependence on rounding (and luck).

MFC after:	6 weeks
2010-12-01 03:35:19 +00:00
Jayachandran C. aa54636620 Fix issue noted by alc while reviewing r215938:
The current implementation of vm_page_alloc_freelist() does not handle
order > 0 correctly. Remove order parameter to the function and use it
only for order 0 pages.

Submitted by:	alc
2010-11-28 05:51:31 +00:00
Konstantin Belousov 780636b72a After the sleep caused by encountering a busy page, relookup the page.
Submitted and reviewed by:	alc
Reprted and tested by:	pho
MFC after:	5 days
2010-11-24 12:25:17 +00:00
Konstantin Belousov 3157c50313 Eliminate the mab, maf arrays and related variables.
The change also fixes off-by-one error in the calculation of mreq.

Suggested and reviewed by:	alc
Tested by:	pho
MFC after:	5 days
2010-11-21 10:18:28 +00:00
Alan Cox 17ea6f00d5 Optimize vm_object_terminate().
Reviewed by:	kib
MFC after:	1 week
2010-11-20 22:30:09 +00:00
Konstantin Belousov 4c7b9a2063 The runlen returned from vm_pageout_flush() might be zero legitimately,
when mreq page has status VM_PAGER_AGAIN.

MFC after:	5 days
2010-11-20 17:27:38 +00:00
Alan Cox 00f8bffc22 Reduce the amount of detail printed by vm_page_free_toq() when it panics.
Reviewed by:	kib
2010-11-19 17:49:08 +00:00
Max Laier 85f2a0c91e Off by one page in vm_reserv_reclaim_contig(): Also reclaim reservations
with only a single free page if that satisfies the requested size.

MFC after:	3 days
Reviewed by:	alc
2010-11-19 04:30:33 +00:00
Konstantin Belousov 1e8a675c73 vm_pageout_flush() might cache the pages that finished write to the
backing storage. Such pages might be then reused, racing with the
assert in vm_object_page_collect_flush() that verified that dirty
pages from the run (most likely, pages with VM_PAGER_AGAIN status) are
write-protected still. In fact, the page indexes for the pages that
were removed from the object page list should be ignored by
vm_object_page_clean().

Return the length of successfully written run from vm_pageout_flush(),
that is, the count of pages between requested page and first page
after requested with status VM_PAGER_AGAIN. Supply the requested page
index in the array to vm_pageout_flush(). Use the returned run length
to forward the index of next page to clean in vm_object_page_clean().

Reported by:	avg
Reviewed by:	alc
MFC after:	1 week
2010-11-18 21:09:02 +00:00
Konstantin Belousov 4166faaee0 Only increment object generation count when inserting the page into
object page list.  The only use of object generation count now is a
restart of the scan in vm_object_page_clean(), which makes sense to do
on the page addition. Page removals do not affect the dirtiness of the
object, as well as manipulations with the shadow chain.

Suggested and reviewed by:	alc
MFC after:    1 week
2010-11-18 20:46:28 +00:00
Konstantin Belousov 7022f954c3 Do not use __FreeBSD_version prefix for the special osrel version.
The ports/Mk/bsd.port.mk uses sys/param.h to fetch osrel, and cannot
grok several constants with the prefix.

Reported and tested by:	    swell.k gmail com
MFC after:   1 week
2010-11-14 21:59:11 +00:00
Konstantin Belousov 94bce4535d Use symbolic names instead of hardcoding values for magic p_osrel constants.
MFC after:   1 week
2010-11-14 18:24:12 +00:00
Konstantin Belousov 9a6d144ff8 Implement a (soft) stack guard page for auto-growing stack mappings.
The unmapped page separates the tip of the stack and possible adjanced
segment, making some uses of stack overflow harder.  The stack growing
code refuses to expand the segment to the last page of the reseved
region when sysctl security.bsd.stack_guard_page is set to 1. The
default value for sysctl and accompanying tunable is 0.

Please note that mmap(MAP_FIXED) still can place a mapping right up to
the stack, making continuous region.

Reviewed by:	alc
MFC after:	1 week
2010-11-14 17:53:52 +00:00
Alan Cox 2cf36c8f67 Enable reservation-based physical memory allocation. Even without the
creation of large page mappings in the pmap, it can provide modest
performance benefits.  In particular, for a "buildworld" on a 2x 1GHz
Ultrasparc IIIi it reduced the wall clock time by 2.2% and the system
time by 12.6%.

Tested by:	marius@
2010-11-10 17:57:34 +00:00
Alan Cox e48262487a In case the stack size reaches its limit and its growth must be restricted,
ensure that grow_amount is a multiple of the page size.  Otherwise, the
kernel may crash in swap_reserve_by_uid() on HEAD and FreeBSD 8.x, and
produce a core file with a missing stack on FreeBSD 7.x.

Diagnosed and reported by: jilles
Reviewed by:	kib
MFC after:	1 week
2010-11-07 21:40:34 +00:00
Oleksandr Tymoshenko 903ba3da86 - Add minidump support for FreeBSD/mips 2010-11-07 03:09:02 +00:00
John Baldwin e9a069d8af Update startup_alloc() to support multi-page allocations and allow internal
zones whose objects are larger than a page to use startup_alloc().  This
allows allocation of zone objects during early boot on machines with a large
number of CPUs since the resulting zone objects are larger than a page.

Submitted by:	trema
Reviewed by:	attilio
MFC after:	1 week
2010-11-04 15:33:50 +00:00
Alan Cox d689bc0082 Correct some format strings used by sysctls.
MFC after:	1 week
2010-10-30 18:00:53 +00:00
John Baldwin 1a587ef2a5 - Make 'vm_refcnt' volatile so that compilers won't be tempted to treat
its value as a loop invariant.  Currently this is a no-op because
  'atomic_cmpset_int()' clobbers all memory on current architectures.
- Use atomic_fetchadd_int() instead of an atomic_cmpset_int() loop to drop
  a reference in vmspace_free().

Reviewed by:	alc
MFC after:	1 month
2010-10-21 17:29:32 +00:00
Andriy Gapon 55144670c2 PG_BUSY -> VPO_BUSY, PG_WANTED -> VPO_WANTED in manual pages and comments
Reviewed by:	alc
MFC after:	4 days
2010-10-20 05:17:23 +00:00
Matthew D Fleming 20ed0cb0c6 uma_zfree(zone, NULL) should do nothing, to match free(9).
Noticed by:	Ron Steinke <rsteinke at isilon dot com>
MFC after:	3 days
2010-10-19 16:06:00 +00:00
Lawrence Stewart 1c6cae9711 Change uma_zone_set_max to return the effective value of "nitems" after
rounding. The same value can also be obtained with uma_zone_get_max, but this
change avoids a caller having to make two back-to-back calls.

Sponsored by:	FreeBSD Foundation
Reviewed by:	gnn, jhb
2010-10-16 04:41:45 +00:00
Lawrence Stewart c4ae7908a7 - Simplify implementation of uma_zone_get_max.
- Add uma_zone_get_cur which returns the current approximate occupancy of
  a zone. This is useful for providing stats via sysctl amongst other things.

Sponsored by:	FreeBSD Foundation
Reviewed by:	gnn, jhb
MFC after:	2 weeks
2010-10-16 04:14:45 +00:00
Alan Cox f8616ebfae If vm_map_find() is asked to allocate a superpage-aligned region of virtual
addresses that is greater than a superpage in size but not a multiple of
the superpage size, then vm_map_find() is not always expanding the kernel
pmap to support the last few small pages being allocated.  These failures
are not commonplace, so this was first noticed by someone porting FreeBSD
to a new architecture.  Previously, we grew the kernel page table in
vm_map_findspace() when we found the first available virtual address.
This works most of the time because we always grow the kernel pmap or page
table by an amount that is a multiple of the superpage size.  Now, instead,
we defer the call to pmap_growkernel() until we are committed to a range
of virtual addresses in vm_map_insert().  In general, there is another
reason to prefer calling pmap_growkernel() in vm_map_insert().  It makes
it possible for someone to do the equivalent of an mmap(MAP_FIXED) on the
kernel map.

Reported by:	Svatopluk Kraus
Reviewed by:	kib@
MFC after:	3 weeks
2010-10-04 16:49:40 +00:00
Matthew D Fleming d69b01efc2 Replace an XXX comment with the appropriate code.
Submitted by:	alc
2010-09-20 20:41:59 +00:00
Alan Cox da0483096d Allow a POSIX shared memory object that is opened for read but not for
write to nonetheless be mapped PROT_WRITE and MAP_PRIVATE, i.e.,
copy-on-write.

(This is a regression in the new implementation of POSIX shared memory
objects that is used by HEAD and RELENG_8.  This bug does not exist in
RELENG_7's user-level, file-based implementation.)

PR:		150260
MFC after:	3 weeks
2010-09-19 19:42:04 +00:00
Alan Cox 8304adaac6 Make refinements to r212824. In particular, don't make
vm_map_unlock_nodefer() part of the synchronization interface for maps.

Add comments to vm_map_unlock_and_wait() and vm_map_wakeup() describing
how they should be used.  In particular, describe the deferred deallocations
issue with vm_map_unlock_and_wait().

Redo the implementation of vm_map_unlock_and_wait() so that it passes
along the caller's file and line information, just like the other map
locking primitives.

Reviewed by:	kib
X-MFC after:	r212824
2010-09-19 17:43:22 +00:00
Konstantin Belousov 0b367bd8c0 Adopt the deferring of object deallocation for the deleted map entries
on map unlock to the lock downgrade and later read unlock operation.

System map entries cannot be backed by OBJT_VNODE objects, no need to
defer deallocation for them. Map entries from user maps do not require
the owner map for deallocation, and can be accumulated in the
thread-local list for freeing when a user map is unlocked.

Move the collection of entries for deferred reclamation into
vm_map_delete(). Create helper vm_map_process_deferred(), that is
called from locations where processing is feasible. Do not process
deferred entries in vm_map_unlock_and_wait() since map_sleep_mtx is
held.

Reviewed by:	alc, rstone (previous versions)
Tested by:	pho
MFC after:	2 weeks
2010-09-18 15:03:31 +00:00
Matthew D Fleming 4e6571599b Re-add r212370 now that the LOR in powerpc64 has been resolved:
Add a drain function for struct sysctl_req, and use it for a variety
of handlers, some of which had to do awkward things to get a large
enough SBUF_FIXEDLEN buffer.

Note that some sysctl handlers were explicitly outputting a trailing
NUL byte.  This behaviour was preserved, though it should not be
necessary.

Reviewed by:    phk (original patch)
2010-09-16 16:13:12 +00:00
Matthew D Fleming 404a593e28 Revert r212370, as it causes a LOR on powerpc. powerpc does a few
unexpected things in copyout(9) and so wiring the user buffer is not
sufficient to perform a copyout(9) while holding a random mutex.

Requested by: nwhitehorn
2010-09-13 18:48:23 +00:00
Matthew D Fleming dd67e2103c Add a drain function for struct sysctl_req, and use it for a variety of
handlers, some of which had to do awkward things to get a large enough
FIXEDLEN buffer.

Note that some sysctl handlers were explicitly outputting a trailing NUL
byte.  This behaviour was preserved, though it should not be necessary.

Reviewed by:	phk
2010-09-09 18:33:46 +00:00
Nathan Whitehorn 42768fec0f On architectures with non-tree-based page tables like PowerPC, every page
in a range must be checked when calling pmap_remove(). Calling
pmap_remove() from vm_pageout_map_deactivate_pages() with the entire range
of the map could result in attempting to demap an extraordinary number
of pages (> 10^15), so iterate through each map entry and unmap each of
them individually.

MFC after:	6 weeks
2010-09-09 13:32:58 +00:00
Ryan Stone d473d3a164 Fix a typo in r212281. uintptr -> uintptr_t
Pointy hat to:  rstone

Approved by:    emaste (mentor)
MFC after:      2 weeks
2010-09-07 02:51:11 +00:00
Ryan Stone 0d41964095 In munmap() downgrade the vm_map_lock to a read lock before taking a read
lock on the pmc-sx lock.  This prevents a deadlock with
pmc_log_process_mappings, which has an exclusive lock on pmc-sx and tries
to get a read lock on a vm_map.  Downgrading the vm_map_lock in munmap
allows pmc_log_process_mappings to continue, preventing the deadlock.

Without this change I could cause a deadlock on a multicore 8.1-RELEASE
system by having one thread constantly mmap'ing and then munmap'ing a
PROT_EXEC mapping in a loop while I repeatedly invoked and stopped pmcstat
in system-wide sampling mode.

Reviewed by:	fabient
Approved by:	emaste (mentor)
MFC after:	2 weeks
2010-09-07 00:23:45 +00:00
Andriy Gapon a9b89cf1c1 vm_page.c: include opt_msgbuf.h for MSGBUF_SIZE use in vm_page_startup
vm_page_startup uses MSGBUF_SIZE value for adding msgbuf pages to minidump.
If opt_msgbuf.h is not included and MSGBUF_SIZE is overriden in kernel
config, then not all msgbuf pages will be dumped.  And most importantly,
struct msgbuf itself will not be included.  Thus the dump would look
corrupted/incomplete to tools like kgdb, dmesg, etc that try to access
struct msgbuf as one of the first things they do when working on a crash
dump.

MFC after:	5 days
2010-09-03 10:40:53 +00:00
Matthew D Fleming a2a200a24d Have memguard(9) crash with an easier-to-debug message on double-free.
Reviewed by:    zml
MFC after:      3 weeks
2010-08-31 17:43:47 +00:00
Matthew D Fleming 6d3ed393d6 The realloc case for memguard(9) will copy too many bytes when
reallocating to a smaller-sized allocation.  Fix this issue.

Noticed by:     alc
Reviewed by:    alc
Approved by:    zml (mentor)
MFC after:      3 weeks
2010-08-31 16:57:58 +00:00
Alan Cox 74ffb9af15 Add the MAP_PREFAULT_READ option to mmap(2).
Reviewed by:	jhb, kib
2010-08-28 16:57:07 +00:00
Andre Oppermann e49471b04b Add uma_zone_get_max() to obtain the effective limit after a call
to uma_zone_set_max().

The UMA zone limit is not exactly set to the value supplied but
rounded up to completely fill the backing store increment (a page
normally).  This can lead to surprising situations where the number
of elements allocated from UMA is higher than the supplied limit
value.  The new get function reads back the effective value so that
the supplied limit value can be adjusted to the real limit.

Reviewed by:	jeffr
MFC after:	1 week
2010-08-16 14:24:00 +00:00
Matthew D Fleming f02d86e269 Fix compile. It seemed better to have memguard.c include opt_vm.h in
case future compile-time knobs were added that it wants to use.
Also add include guards and forward declarations to vm/memguard.h.

Approved by:    zml (mentor)
MFC after:      1 month
2010-08-12 16:54:43 +00:00
Matthew D Fleming e3813573bd Rework memguard(9) to reserve significantly more KVA to detect
use-after-free over a longer time.  Also release the backing pages of
a guarded allocation at free(9) time to reduce the overhead of using
memguard(9).  Allow setting and varying the malloc type at run-time.
Add knobs to allow:

 - randomly guarding memory
 - adding un-backed KVA guard pages to detect underflow and overflow
 - a lower limit on the size of allocations that are guarded

Reviewed by:    alc
Reviewed by:    brueffer, Ulrich Spörlein <uqs spoerlein net> (man page)
Silence from:   -arch
Approved by:    zml (mentor)
MFC after:      1 month
2010-08-11 22:10:37 +00:00
Konstantin Belousov 3979450b4c Add new make_dev_p(9) flag MAKEDEV_ETERNAL to inform devfs that created
cdev will never be destroyed. Propagate the flag to devfs vnodes as
VV_ETERNVALDEV. Use the flags to avoid acquiring devmtx and taking a
thread reference on such nodes.

In collaboration with:	pho
MFC after:	1 month
2010-08-06 09:42:15 +00:00
John Baldwin a3870a1826 Very rough first cut at NUMA support for the physical page allocator. For
now it uses a very dumb first-touch allocation policy.  This will change in
the future.
- Each architecture indicates the maximum number of supported memory domains
  via a new VM_NDOMAIN parameter in <machine/vmparam.h>.
- Each cpu now has a PCPU_GET(domain) member to indicate the memory domain
  a CPU belongs to.  Domain values are dense and numbered from 0.
- When a platform supports multiple domains, the default freelist
  (VM_FREELIST_DEFAULT) is split up into N freelists, one for each domain.
  The MD code is required to populate an array of mem_affinity structures.
  Each entry in the array defines a range of memory (start and end) and a
  domain for the range.  Multiple entries may be present for a single
  domain.  The list is terminated by an entry where all fields are zero.
  This array of structures is used to split up phys_avail[] regions that
  fall in VM_FREELIST_DEFAULT into per-domain freelists.
- Each memory domain has a separate lookup-array of freelists that is
  used when fulfulling a physical memory allocation.  Right now the
  per-domain freelists are listed in a round-robin order for each domain.
  In the future a table such as the ACPI SLIT table may be used to order
  the per-domain lookup lists based on the penalty for each memory domain
  relative to a specific domain.  The lookup lists may be examined via a
  new vm.phys.lookup_lists sysctl.
- The first-touch policy is implemented by using PCPU_GET(domain) to
  pick a lookup list when allocating memory.

Reviewed by:	alc
2010-07-27 20:33:50 +00:00
Edward Tomasz Napierala fd6f4ffb27 Fix commented out resource limit check in mlockall(2). It's still racy,
but at least less misleading.
2010-07-27 19:26:18 +00:00
Alan Cox 2af6e14d39 Introduce exec_alloc_args(). The objective being to encapsulate the
details of the string buffer allocation in one place.

Eliminate the portion of the string buffer that was dedicated to storing
the interpreter name.  The pointer to the interpreter name can simply be
made to point to the appropriate argument string.

Reviewed by:	kib
2010-07-27 17:31:03 +00:00
Alan Cox 9e4e511499 Change the order in which the file name, arguments, environment, and
shell command are stored in exec*()'s demand-paged string buffer.  For
a "buildworld" on an 8GB amd64 multiprocessor, the new order reduces
the number of global TLB shootdowns by 31%.  It also eliminates about
330k page faults on the kernel address space.

Change exec_shell_imgact() to use "args->begin_argv" consistently as
the start of the argument and environment strings.  Previously, it
would sometimes use "args->buf", which is the start of the overall
buffer, but no longer the start of the argument and environment
strings.  While I'm here, eliminate unnecessary passing of "&length"
to copystr(), where we don't actually care about the length of the
copied string.

Clean up the initialization of the exec map.  In particular, use the
correct size for an entry, and express that size in the same way that
is used when an entry is allocated.  The old size was one page too
large.  (This discrepancy originated in 2004 when I rewrote
exec_map_first_page() to use sf_buf_alloc() instead of the exec map
for mapping the first page of the executable.)

Reviewed by:	kib
2010-07-25 17:43:38 +00:00
Jayachandran C. 49ca10d40c Redo the page table page allocation on MIPS, as suggested by
alc@.

The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.

This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards  will introduce
a race condtion(noted by alc@).

The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the  function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().

Reviewed by:	alc
2010-07-21 09:27:00 +00:00
Alan Cox b99348e5ea Add support for the VM_ALLOC_COUNT() hint to vm_page_alloc(). Consequently,
the maintenance of vm_pageout_deficit can be localized to just two places:
vm_page_alloc() and vm_pageout_scan().

This change also corrects an off-by-one error in the maintenance of
vm_pageout_deficit.  Historically, the buffer cache functions, allocbuf()
and vm_hold_load_pages(), have not taken into account that vm_page_alloc()
already increments vm_pageout_deficit by one.

Reviewed by:	kib
2010-07-09 19:38:30 +00:00
Konstantin Belousov 1d9e77f6bf Make VM_ALLOC_RETRY flag mandatory for vm_page_grab(). Assert that the
flag is always provided, and unconditionally retry after sleep for the
busy page or failed allocation.

The intent is to remove VM_ALLOC_RETRY eventually.

Proposed and reviewed by:	alc
2010-07-08 08:37:51 +00:00
Konstantin Belousov 5f195aa32e Add the ability for the allocflag argument of the vm_page_grab() to
specify the increment of vm_pageout_deficit when sleeping due to page
shortage. Then, in allocbuf(), the code to allocate pages when extending
vmio buffer can be replaced by a call to vm_page_grab().

Suggested and reviewed by:	alc
MFC after:	2 weeks
2010-07-05 21:13:32 +00:00
Konstantin Belousov 757216f3a5 Several cleanups for the r209686:
- remove unused defines;
- remove unused curgeneration argument for vm_object_page_collect_flush();
- always assert that vm_object_page_clean() is called for OBJT_VNODE;
- move vm_page_find_least() into for() statement initial clause.

Submitted by:	alc
2010-07-04 19:02:32 +00:00
Konstantin Belousov e239bb9730 Reimplement vm_object_page_clean(), using the fact that vm object memq
is ordered by page index. This greatly simplifies the implementation,
since we no longer need to mark the pages with VPO_CLEANCHK to denote
the progress. It is enough to remember the current position by index
before dropping the object lock.

Remove VPO_CLEANCHK and VM_PAGER_IGNORE_CLEANCHK as unused.
Garbage-collect vm.msync_flush_flags sysctl.

Suggested and reviewed by:	alc
Tested by:	pho
2010-07-04 11:26:56 +00:00
Konstantin Belousov b382c10a57 Introduce a helper function vm_page_find_least(). Use it in several places,
which inline the function.

Reviewed by:	alc
Tested by:	pho
MFC after:	1 week
2010-07-04 11:13:33 +00:00
Alan Cox b64400a03f Improve the comment and man page for vm_page_alloc(). Specifically,
document one of the optional flags; clarify which of the flags are
optional (and which are not), and remove mention of a restriction on
the reclamation of cached pages that no longer holds since version 7.

MFC after:	1 week
2010-07-03 18:25:37 +00:00
Alan Cox cdaba1f2be Push down the acquisition of the page queues lock into
vm_pageout_page_stats().  In particular, avoid acquiring the page
queues lock unless iterating over the active queue.
2010-07-02 20:56:22 +00:00
Alan Cox 4b0640310a Use vm_page_prev() instead of vm_page_lookup() in the implementation of
vm_fault()'s automatic delete-behind heuristic.
vm_page_prev() is typically faster.
2010-07-02 19:59:18 +00:00
Alan Cox 9cf5198832 With the demise of page coloring, the page queue macros no longer serve any
useful purpose.  Eliminate them.

Reviewed by:	kib
2010-07-02 15:02:51 +00:00
Alan Cox 95976f3f38 Simplify entry to vm_pageout_clean(). Expect the page to be locked.
Previously, the caller unlocked the page, and vm_pageout_clean()
immediately reacquired the page lock.  Also, assert rather than test
that the page is neither busy nor held.  Since vm_pageout_clean() is
called with the object and page locked, the page can't have changed
state since the caller verified that the page is neither busy nor
held.
2010-06-30 17:20:33 +00:00
Alan Cox 91b4f42767 Introduce vm_page_next() and vm_page_prev(), and use them in
vm_pageout_clean().  When iterating over a range of pages, these functions
can be cheaper than vm_page_lookup() because their implementation takes
advantage of the vm_object's memq being ordered.

Reviewed by:	kib@
MFC after:	3 weeks
2010-06-21 23:27:24 +00:00
Sean Bruno bf96595915 Add a new column to the output of vmstat -z to indicate the number
of times the system was forced to sleep when requesting a new allocation.

Expand the debugger hook, db_show_uma, to display these results as well.

This has proven to be very useful in out of memory situations when
it is not known why systems have become sluggish or fail in odd ways.

Reviewed by:	rwatson alc
Approved by:	scottl (mentor) peter
Obtained from:	Yahoo Inc.
2010-06-15 19:28:37 +00:00
Alan Cox 9ee2165f5d Eliminate checks for a page having a NULL object in vm_pageout_scan()
and vm_pageout_page_stats().  These checks were recently introduced by
the first page locking commit, r207410, but they are not needed.  At
the same time, eliminate some redundant accesses to the page's object
field.  (These accesses should have neen eliminated by r207410.)

Make the assertion in vm_page_flag_set() stricter.  Specifically, only
managed pages should have PG_WRITEABLE set.

Add a comment documenting an assertion to vm_page_flag_clear().

It has long been the case that fictitious pages have their wire count
permanently set to one.  Add comments to vm_page_wire() and
vm_page_unwire() documenting this.  Add assertions to these functions
as well.

Update the comment describing vm_page_unwire().  Much of the old
comment had little to do with vm_page_unwire(), but a lot to do with
_vm_page_deactivate().  Move relevant parts of the old comment to
_vm_page_deactivate().

Only pages that belong to an object can be paged out.  Therefore, it
is pointless for vm_page_unwire() to acquire the page queues lock and
enqueue such pages in one of the paging queues.  Generally speaking,
such pages are immediately freed after the call to vm_page_unwire().
Previously, it was the call to vm_page_free() that reacquired the page
queues lock and removed these pages from the paging queues.  Now, we
will never acquire the page queues lock for this case.  (It is also
worth noting that since both vm_page_unwire() and vm_page_free()
occurred with the page locked, the page daemon never saw the page with
its object field set to NULL.)

Change the panic with vm_page_unwire() to provide a more precise message.

Reviewed by:	kib@
2010-06-14 19:54:19 +00:00
John Baldwin 3aa6d94e0c Update several places that iterate over CPUs to use CPU_FOREACH(). 2010-06-11 18:46:34 +00:00
Alan Cox ce18658792 Reduce the scope of the page queues lock and the number of
PG_REFERENCED changes in vm_pageout_object_deactivate_pages().
Simplify this function's inner loop using TAILQ_FOREACH(), and shorten
some of its overly long lines.  Update a stale comment.

Assert that PG_REFERENCED may be cleared only if the object containing
the page is locked.  Add a comment documenting this.

Assert that a caller to vm_page_requeue() holds the page queues lock,
and assert that the page is on a page queue.

Push down the page queues lock into pmap_ts_referenced() and
pmap_page_exists_quick().  (As of now, there are no longer any pmap
functions that expect to be called with the page queues lock held.)

Neither pmap_ts_referenced() nor pmap_page_exists_quick() should ever
be passed an unmanaged page.  Assert this rather than returning "0"
and "FALSE" respectively.

ARM:

Simplify pmap_page_exists_quick() by switching to TAILQ_FOREACH().

Push down the page queues lock inside of pmap_clearbit(), simplifying
pmap_clear_modify(), pmap_clear_reference(), and pmap_remove_write().
Additionally, this allows for avoiding the acquisition of the page
queues lock in some cases.

PowerPC/AIM:

moea*_page_exits_quick() and moea*_page_wired_mappings() will never be
called before pmap initialization is complete.  Therefore, the check
for moea_initialized can be eliminated.

Push down the page queues lock inside of moea*_clear_bit(),
simplifying moea*_clear_modify() and moea*_clear_reference().

The last parameter to moea*_clear_bit() is never used.  Eliminate it.

PowerPC/BookE:

Simplify mmu_booke_page_exists_quick()'s control flow.

Reviewed by:	kib@
2010-06-10 16:56:35 +00:00
Jayachandran C. 17dca144a2 Make vm_contig_grow_cache() extern, and use it when vm_phys_alloc_contig()
fails to allocate MIPS page table pages.  The current usage of VM_WAIT in
case of vm_phys_alloc_contig() failure is not correct, because:

"There is no guarantee that any of the available free (or cached) pages
after the VM_WAIT will fall within the range of suitable physical
addresses.  Every time this function sleeps and a single page is freed
(or cached) by someone else, this function will be reawakened.  With
a little bad luck, you could spin indefinitely."

We also add low and high parameters to vm_contig_grow_cache() and
vm_contig_launder() so that we restrict vm_contig_launder() to the range
of pages we are interested in.

Reported by: alc

Reviewed by:	alc
Approved by:	rrs (mentor)
2010-06-04 06:35:36 +00:00
Konstantin Belousov 4d65036b4f Do not leak vm page lock in vm_contig_launder(), vm_pageout_page_lock()
always returns with the page locked.

Submitted by:	alc
Pointy hat to:	kib
2010-06-03 18:34:34 +00:00
Konstantin Belousov 2bbfbc3fe2 Add assertion and comment in vm_page_flag_set() describing the expectations
when the PG_WRITEABLE flag is set.

Reviewed by:	alc
2010-06-03 10:11:45 +00:00
Alan Cox f4e10cdaa6 Maintain the pretense that we support 32KB pages for the sake of the ia64
LINT build.
2010-06-03 02:24:53 +00:00
Alan Cox c8fa870982 Minimize the use of the page queues lock for synchronizing access to the
page's dirty field.  With the exception of one case, access to this field
is now synchronized by the object lock.
2010-06-02 15:46:37 +00:00
Alan Cox 8f0d5d3b9f When I pushed down the page queues lock into pmap_is_modified(), I created
an ordering dependence: A pmap operation that clears PG_WRITEABLE and calls
vm_page_dirty() must perform the call first.  Otherwise, pmap_is_modified()
could return FALSE without acquiring the page queues lock because the page
is not (currently) writeable, and the caller to pmap_is_modified() might
believe that the page's dirty field is clear because it has not seen the
effect of the vm_page_dirty() call.

When I pushed down the page queues lock into pmap_is_modified(), I
overlooked one place where this ordering dependence is violated:
pmap_enter().  In a rare situation pmap_enter() can be called to replace a
dirty mapping to one page with a mapping to another page.  (I say rare
because replacements generally occur as a result of a copy-on-write fault,
and so the old page is not dirty.)  This change delays clearing PG_WRITEABLE
until after vm_page_dirty() has been called.

Fixing the ordering dependency also makes it easy to introduce a small
optimization: When pmap_enter() used to replace a mapping to one page with a
mapping to another page, it freed the pv entry for the first mapping and
later called the pv entry allocator for the new mapping.  Now, pmap_enter()
attempts to recycle the old pv entry, saving two calls to the pv entry
allocator.

There is no point in setting PG_WRITEABLE on unmanaged pages, so don't.
Update a comment to reflect this.

Tidy up the variable declarations at the start of pmap_enter().
2010-05-29 17:10:45 +00:00
Alan Cox c46b90e90a Push down page queues lock acquisition in pmap_enter_object() and
pmap_is_referenced().  Eliminate the corresponding page queues lock
acquisitions from vm_map_pmap_enter() and mincore(), respectively.  In
mincore(), this allows some additional cases to complete without ever
acquiring the page queues lock.

Assert that the page is managed in pmap_is_referenced().

On powerpc/aim, push down the page queues lock acquisition from
moea*_is_modified() and moea*_is_referenced() into moea*_query_bit().
Again, this will allow some additional cases to complete without ever
acquiring the page queues lock.

Reorder a few statements in vm_page_dontneed() so that a race can't lead
to an old reference persisting.  This scenario is described in detail by a
comment.

Correct a spelling error in vm_page_dontneed().

Assert that the object is locked in vm_page_clear_dirty(), and restrict the
page queues lock assertion to just those cases in which the page is
currently writeable.

Add object locking to vnode_pager_generic_putpages().  This was the one
and only place where vm_page_clear_dirty() was being called without the
object being locked.

Eliminate an unnecessary vm_page_lock() around vnode_pager_setsize()'s call
to vm_page_clear_dirty().

Change vnode_pager_generic_putpages() to the modern-style of function
definition.  Also, change the name of one of the parameters to follow
virtual memory system naming conventions.

Reviewed by:	kib
2010-05-26 18:00:44 +00:00
Alan Cox e98d019d3c Eliminate the acquisition and release of the page queues lock from
vfs_busy_pages().  It is no longer needed.

Submitted by:	kib
2010-05-25 02:26:25 +00:00
Alan Cox 567e51e18c Roughly half of a typical pmap_mincore() implementation is machine-
independent code.  Move this code into mincore(), and eliminate the
page queues lock from pmap_mincore().

Push down the page queues lock into pmap_clear_modify(),
pmap_clear_reference(), and pmap_is_modified().  Assert that these
functions are never passed an unmanaged page.

Eliminate an inaccurate comment from powerpc/powerpc/mmu_if.m:
Contrary to what the comment says, pmap_mincore() is not simply an
optimization.  Without a complete pmap_mincore() implementation,
mincore() cannot return either MINCORE_MODIFIED or MINCORE_REFERENCED
because only the pmap can provide this information.

Eliminate the page queues lock from vfs_setdirty_locked_object(),
vm_pageout_clean(), vm_object_page_collect_flush(), and
vm_object_page_clean().  Generally speaking, these are all accesses
to the page's dirty field, which are synchronized by the containing
vm object's lock.

Reduce the scope of the page queues lock in vm_object_madvise() and
vm_page_dontneed().

Reviewed by:	kib (an earlier version)
2010-05-24 14:26:57 +00:00
Konstantin Belousov a6e38685f3 When waiting for the busy page, do not unlock the object unless unlock
cannot be avoided.

Reviewed by:	alc
MFC after:	1 week
2010-05-20 08:51:01 +00:00
Alan Cox aa12e8b71d The page queues lock is no longer required by vm_page_set_invalid(), so
eliminate it.

Assert that the object containing the page is locked in
vm_page_test_dirty().  Perform some style clean up while I'm here.

Reviewed by:	kib
2010-05-18 16:40:29 +00:00