Commit graph

592 commits

Author SHA1 Message Date
Konstantin Belousov 4648ba0a0f Make mmap(MAP_STACK) search for the available address space, similar
to !MAP_STACK mapping requests.  For MAP_STACK | MAP_FIXED, clear any
mappings which could previously exist in the used range.

For this, teach vm_map_find() and vm_map_fixed() to handle
MAP_STACK_GROWS_DOWN or _UP cow flags, by calling a new
vm_map_stack_locked() helper, which is factored out from
vm_map_stack().

The side effect of the change is that MAP_STACK started obeying
MAP_ALIGNMENT and MAP_32BIT flags.

Reported by:	rwatson
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-06-09 03:37:41 +00:00
Alan Cox dd05fa1945 Add a page size field to struct vm_page. Increase the page size field when
a partially populated reservation becomes fully populated, and decrease this
field when a fully populated reservation becomes partially populated.

Use this field to simplify the implementation of pmap_enter_object() on
amd64, arm, and i386.

On all architectures where we support superpages, the cost of creating a
superpage mapping is roughly the same as creating a base page mapping.  For
example, both kinds of mappings entail the creation of a single PTE and PV
entry.  With this in mind, use the page size field to make the
implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little
smarter.  Previously, if MAP_PREFAULT_PARTIAL was specified to
vm_map_pmap_enter(), that function would only map base pages.  Now, it will
create up to 96 base page or superpage mappings.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2014-06-07 17:12:26 +00:00
Konstantin Belousov 5930251a9d Remove the assert which can be triggered by the userspace. The
situation checked by assert is verified to not take place in
vm_map_wire(), and protection permissions on the wired entry can be
revoked afterward.

Reported by:	markj
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-05-28 00:45:35 +00:00
Konstantin Belousov 7032434e98 When exec_new_vmspace() decides that current vmspace cannot be reused
on execve(2), it calls vmspace_exec(), which frees the current
vmspace.  The thread executing an exec syscall gets new vmspace
assigned, and old vmspace is freed if only referenced by the current
process.  The free operation includes pmap_release(), which
de-constructs the paging structures used by hardware.

If the calling process is multithreaded, other threads are suspended
in the thread_suspend_check(), and need to be unsuspended and run to
be able to exit on successfull exec.  Now, since the old vmspace is
destroyed, paging structures are invalid, threads are resumed on the
non-existent pmaps (page tables), which leads to triple fault on x86.

To fix, postpone the free of old vmspace until the threads are resumed
and exited.  To avoid modifications to all image activators all of
which use exec_new_vmspace(), memoize the current (old) vmspace in
kern_execve(), and notify it about the need to call vmspace_free()
with a thread-private flag TDP_EXECVMSPC.

http://bugs.debian.org/743141

Reported by:	Ivo De Decker <ivo.dedecker@ugent.be> through secteam
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-05-20 09:19:35 +00:00
Alan Cox afaa41f6b8 On a fork allow read-only wired pages to be copy-on-write shared between the
parent and child processes.  Previously, we copied these pages even though
they are read only.  However, the reason for copying them is historical and
no longer exists.  In recent times, vm_map_protect() has developed the
ability to copy pages when write access is added to wired copy-on-write
pages.  So, in this case, copy-on-write sharing of wired pages is not to be
feared.  It is not going to lead to copy-on-write faults on wired memory.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2014-05-13 13:20:23 +00:00
Alan Cox dd006a1b14 With the new-and-improved vm_fault_copy_entry() (r265843), we can always
avoid soft page faults when adding write access to user wired entries in
vm_map_protect().  Previously, we only avoided the soft page fault when
the underlying pages were copy-on-write.  In other words, we avoided the
pages faults that might sleep on page allocation, but not the trivial
page faults to update the physical map.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2014-05-11 17:41:29 +00:00
Alan Cox d9a9209abe About 9% of the pmap_protect() calls being performed by vm_map_copy_entry()
are unnecessary.  Eliminate the unnecessary calls.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2014-05-10 19:47:00 +00:00
Konstantin Belousov 44bbc3b77d When printing the map with the ddb 'show procvm' command, do not dump
page queues for the backing objects.  The queues are huge and clutter
the display, when mostly the map entries and its backing storage is
interesting.

The page queues can be seen with ddb 'show object' command.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-05-10 16:36:13 +00:00
Konstantin Belousov 3d95614f9d Print the entry address in addition to the object. The variable is
typically optimized out and debuggers cannot find its value.

Sponsored by:	    The FreeBSD Foundation
MFC after:	1 week
2014-05-10 16:30:48 +00:00
Bryan Drewery 44f1c91610 Rename global cnt to vm_cnt to avoid shadowing.
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from:	arch@
Sponsored by:	EMC / Isilon Storage Division
2014-03-22 10:26:09 +00:00
Konstantin Belousov 997ac6905f Initialize vm_map_entry member wiring_thread on the map entry creation.
This was missed in r253190.

Reported by:	hps, peter
Tested by:	hps
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-03-21 13:55:57 +00:00
Konstantin Belousov b61a53d43d Do not coalesce stack entry, vm_map_stack() asserts that the requested
region is claimed by a new entry.

Pass MAP_STACK_GROWS_DOWN and MAP_STACK_GROWS_UP flags to
vm_map_insert() from vm_map_stack(), to really turn off coalescing
code and call to vm_map_simplify_entry() [1].

Reported by:	avg, peter, many
Tested by:	avg, peter
Noted by:	avg [1]
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-12-27 16:59:47 +00:00
Konstantin Belousov 79e9451f07 Vm map code performs clipping when map entry covers region which is
larger than the operational region.  If the op region size is zero,
clipping would create a zero-sized map entry.  The result is that vm
map splay starts behaving inconsistently, sometimes returning
zero-sized entry, sometimes the next (or previous) entry.

One step further, it could result in e.g. vm_map_wire() setting
MAP_ENTRY_IN_TRANSITION on the zero-sized entry, but failing to clear
it in the done part.  The vm_map_delete() than hangs forever waiting
for the flag removal.

Verify for zero-length requests and act as if it is always successfull
without performing any action on the address space.

Diagnosed by:	pho
Tested by:	pho (previous version)
Reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-11-20 09:03:48 +00:00
Konstantin Belousov ff3ae454c0 Add assertions to cover all places in the wiring and unwiring code
where MAP_ENTRY_IN_TRANSITION is set or cleared.

Tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-11-20 08:47:54 +00:00
Alan Cox f872f6eaf5 Both the vm_map and vmspace zones are defined as "no free". So, there is no
point in defining a fini function for these zones.

Reviewed by:	kib
Approved by:	re (glebius)
Sponsored by:	EMC / Isilon Storage Division
2013-09-22 17:48:10 +00:00
Neel Natu 74d1d2b7cc Merge the following changes from projects/bhyve_npt_pmap:
- add fields to 'struct pmap' that are required to manage nested page tables.
- add a parameter to 'vmspace_alloc()' that can be used to override the
  default pmap initialization routine 'pmap_pinit()'.

These changes are pushed ahead of the remaining changes in 'bhyve_npt_pmap'
in anticipation of the upcoming KBI freeze for 10.0.

Reviewed by:	kib@, alc@
Approved by:	re (glebius)
2013-09-20 17:06:49 +00:00
John Baldwin edb572a38c Add a mmap flag (MAP_32BIT) on 64-bit platforms to request that a mapping use
an address in the first 2GB of the process's address space.  This flag should
have the same semantics as the same flag on Linux.

To facilitate this, add a new parameter to vm_map_find() that specifies an
optional maximum virtual address.  While here, fix several callers of
vm_map_find() to use a VMFS_* constant for the findspace argument instead of
TRUE and FALSE.

Reviewed by:	alc
Approved by:	re (kib)
2013-09-09 18:11:59 +00:00
Alan Cox 51321f7c31 Significantly reduce the cost, i.e., run time, of calls to madvise(...,
MADV_DONTNEED) and madvise(..., MADV_FREE).  Specifically, introduce a new
pmap function, pmap_advise(), that operates on a range of virtual addresses
within the specified pmap, allowing for a more efficient implementation of
MADV_DONTNEED and MADV_FREE.  Previously, the implementation of
MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as
pmap_clear_reference().  Intuitively, the problem with this implementation
is that the pmap-level locks are acquired and released and the page table
traversed repeatedly, once for each resident page in the range
that was specified to madvise(2).  A more subtle flaw with the previous
implementation is that pmap_clear_reference() would clear the reference bit
on all mappings to the specified page, not just the mapping in the range
specified to madvise(2).

Since our malloc(3) makes heavy use of madvise(2), this change can have a
measureable impact.  For example, the system time for completing a parallel
"buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%.

Note: This change only contains pmap_advise() implementations for a subset
of our supported architectures.  I will commit implementations for the
remaining architectures after further testing.  For now, a stub function is
sufficient because of the advisory nature of pmap_advise().

Discussed with: jeff, jhb, kib
Tested by:      pho (i386), marcel (ia64)
Sponsored by:   EMC / Isilon Storage Division
2013-08-29 15:49:05 +00:00
Konstantin Belousov e68c64f0ba Revert r254501. Instead, reuse the type stability of the struct pmap
which is the part of struct vmspace, allocated from UMA_ZONE_NOFREE
zone.  Initialize the pmap lock in the vmspace zone init function, and
remove pmap lock initialization and destruction from pmap_pinit() and
pmap_release().

Suggested and reviewed by:	alc (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-08-22 18:12:24 +00:00
John Baldwin 5aa60b6f21 Add new mmap(2) flags to permit applications to request specific virtual
address alignment of mappings.
- MAP_ALIGNED(n) requests a mapping aligned on a boundary of (1 << n).
  Requests for n >= number of bits in a pointer or less than the size of
  a page fail with EINVAL.  This matches the API provided by NetBSD.
- MAP_ALIGNED_SUPER is a special case of MAP_ALIGNED.  It can be used
  to optimize the chances of using large pages.  By default it will align
  the mapping on a large page boundary (the system is free to choose any
  large page size to align to that seems best for the mapping request).
  However, if the object being mapped is already using large pages, then
  it will align the virtual mapping to match the existing large pages in
  the object instead.
- Internally, VMFS_ALIGNED_SPACE is now renamed to VMFS_SUPER_SPACE, and
  VMFS_ALIGNED_SPACE(n) is repurposed for specifying a specific alignment.
  MAP_ALIGNED(n) maps to using VMFS_ALIGNED_SPACE(n), while
  MAP_ALIGNED_SUPER maps to VMFS_SUPER_SPACE.
- mmap() of a device object now uses VMFS_OPTIMAL_SPACE rather than
  explicitly using VMFS_SUPER_SPACE.  All device objects are forced to
  use a specific color on creation, so VMFS_OPTIMAL_SPACE is effectively
  equivalent.

Reviewed by:	alc
MFC after:	1 month
2013-08-16 21:13:55 +00:00
Jeff Roberson 5df87b21d3 Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

 - Normalize functions that allocate memory to use kmem_*
 - Those that allocate address space are named kva_*
 - Those that operate on maps are named kmap_*
 - Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-08-07 06:21:20 +00:00
Tim Kientzle 763d9566fe Clear entire map structure including locks so that the
locks don't accidentally appear to have been already
initialized.

In particular, this fixes a consistent kernel crash on
armv6 with:
  panic: lock "vm map (user)" 0xc09cc050 already initialized
that appeared with r251709.

PR: arm/180820
2013-07-25 03:48:37 +00:00
John Baldwin ff74a3fa6b Be more aggressive in using superpages in all mappings of objects:
- Add a new address space allocation method (VMFS_OPTIMAL_SPACE) for
  vm_map_find() that will try to alter the alignment of a mapping to match
  any existing superpage mappings of the object being mapped.  If no
  suitable address range is found with the necessary alignment,
  vm_map_find() will fall back to using the simple first-fit strategy
  (VMFS_ANY_SPACE).
- Change mmap() without MAP_FIXED, shmat(), and the GEM mapping ioctl to
  use VMFS_OPTIMAL_SPACE instead of VMFS_ANY_SPACE.

Reviewed by:	alc (earlier version)
MFC after:	2 weeks
2013-07-19 19:06:15 +00:00
Konstantin Belousov 0acea7dfde The mlockall() or VM_MAP_WIRE_HOLESOK does not interact properly with
parallel creation of the map entries, e.g. by mmap() or stack growing.
It also breaks when other entry is wired in parallel.

The vm_map_wire() iterates over the map entries in the region, and
assumes that map entries it finds are marked as in transition before,
also that any entry marked as in transition, are marked by the current
invocation of vm_map_wire().  This is not true for new entries in the
holes.

Add the thread owner of the MAP_ENTRY_IN_TRANSITION flag to struct
vm_map_entry.  In vm_map_wire() and vm_map_unwire(), only process the
entries which transition owner is the current thread.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-07-11 05:55:08 +00:00
Dag-Erling Smørgrav 5b3e02570a Fix a bug that allowed a tracing process (e.g. gdb) to write
to a memory-mapped file in the traced process's address space
even if neither the traced process nor the tracing process had
write access to that file.

Security:	CVE-2013-2171
Security:	FreeBSD-SA-13:06.mmap
Approved by:	so
2013-06-18 07:02:35 +00:00
Attilio Rao 9af6d512f5 o Relax locking assertions for vm_page_find_least()
o Relax locking assertions for pmap_enter_object() and add them also
  to architectures that currently don't have any
o Introduce VM_OBJECT_LOCK_DOWNGRADE() which is basically a downgrade
  operation on the per-object rwlock
o Use all the mechanisms above to make vm_map_pmap_enter() to work
  mostl of the times only with readlocks.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
2013-05-21 20:38:19 +00:00
Konstantin Belousov b9781cf650 Fix the assertions for the state of the object under the map entry
with the MAP_ENTRY_VN_WRITECNT flag:
- Move the assertion that verifies the state of the v_writecount and
  vnp.writecount, under the block where the object is locked.
- Check that the object type is OBJT_VNODE before asserting.

Reported by:	avg
Reviewed by:	alc
MFC after:	1 week
2013-04-09 10:04:10 +00:00
Attilio Rao 89f6b8632c Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
  - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
  - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
  - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
  - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
    (in order to avoid visibility of implementation details)
  - The read-mode operations are added:
    VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
    VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
  sys/mutex.h in consumers directly to cater its inlining functions
  using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
  consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
  the compat layer because the name clash between FreeBSD and solaris
  versions must be avoided.
  At this purpose zfs redefines the vm_object locking functions
  directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit.  Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	pjd (ZFS specific review)
Discussed with:	alc
Tested by:	pho
2013-03-09 02:32:23 +00:00
Attilio Rao a4915c21d9 Merge from vmc-playground branch:
Replace the sub-optimal uma_zone_set_obj() primitive with more modern
uma_zone_reserve_kva().  The new primitive reserves before hand
the necessary KVA space to cater the zone allocations and allocates pages
with ALLOC_NOOBJ.  More specifically:
- uma_zone_reserve_kva() does not need an object to cater the backend
  allocator.
- uma_zone_reserve_kva() can cater M_WAITOK requests, in order to
  serve zones which need to do uma_prealloc() too.
- When possible, uma_zone_reserve_kva() uses directly the direct-mapping
  by uma_small_alloc() rather than relying on the KVA / offset
  combination.

The removal of the object attribute allows 2 further changes:
1) _vm_object_allocate() becomes static within vm_object.c
2) VM_OBJECT_LOCK_INIT() is removed.  This function is replaced by
   direct calls to mtx_init() as there is no need to export it anymore
   and the calls aren't either homogeneous anymore: there are now small
   differences between arguments passed to mtx_init().

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc (which also offered almost all the comments)
Tested by:	pho, jhb, davide
2013-02-26 23:35:27 +00:00
Andrey Zonov 1cc20081df - Get rid of unused function vmspace_wired_count().
Reviewed by:	alc
Approved by:	kib (mentor)
MFC after:	1 week
2013-01-14 12:12:56 +00:00
Andrey Zonov 3ac7d29722 - Reduce kernel size by removing unnecessary pointer indirections.
GENERIC kernel size reduced in 16 bytes and RACCT kernel in 336 bytes.

Suggested by:	alc
Reviewed by:	alc
Approved by:	kib (mentor)
MFC after:	1 week
2013-01-10 12:43:58 +00:00
Andrey Zonov 7e19eda4aa - Fix locked memory accounting for maps with MAP_WIREFUTURE flag.
- Add sysctl vm.old_mlock which may turn such accounting off.

Reviewed by:	avg, trasz
Approved by:	kib (mentor)
MFC after:	1 week
2012-12-18 07:35:01 +00:00
Alan Cox 2863482058 In the past four years, we've added two new vm object types. Each time,
similar changes had to be made in various places throughout the machine-
independent virtual memory layer to support the new vm object type.
However, in most of these places, it's actually not the type of the vm
object that matters to us but instead certain attributes of its pages.
For example, OBJT_DEVICE, OBJT_MGTDEVICE, and OBJT_SG objects contain
fictitious pages.  In other words, in most of these places, we were
testing the vm object's type to determine if it contained fictitious (or
unmanaged) pages.

To both simplify the code in these places and make the addition of future
vm object types easier, this change introduces two new vm object flags
that describe attributes of the vm object's pages, specifically, whether
they are fictitious or unmanaged.

Reviewed and tested by:	kib
2012-12-09 00:32:38 +00:00
Alan Cox a922d312b0 Make a few small changes to vm_map_pmap_enter():
Add detail to the comment describing this function.  In particular,
describe what MAP_PREFAULT_PARTIAL does.

Eliminate the abrupt change in behavior when the specified address range
grows from MAX_INIT_PT pages to MAX_INIT_PT plus one pages.  Instead of
doing nothing, i.e., preloading no mappings whatsoever, map any resident
pages that fall within the start of the specified address range, i.e.,
[addr, addr + ulmin(size, ptoa(MAX_INIT_PT))).

Long ago, the vm object's list of resident pages was not ordered, so
this function had to choose between probing the global hash table of
all resident pages and iterating over the vm object's unordered list of
resident pages.  Now, the list is ordered, so there is no reason for
MAP_PREFAULT_PARTIAL to be concerned with the vm object's count of
resident changes.

MFC after:	14 days
2012-11-25 19:42:36 +00:00
Attilio Rao 2ebcd458e3 Fix DDB command "show map XXX":
- Check that an argument is always available, otherwise current map
  printing before to recurse is garbage.
- Spit out a message if an argument is not provided.
- Remove unread nlines variable.
- Use an explicit recursive function, disassociated from the
  DB_SHOW_COMMAND() body, in order to make clear prototype and recursion
  of the above mentioned function.  The code results now much less
  obscure.

Submitted by:	gianni
2012-11-12 00:30:40 +00:00
Andrey Zonov cfe52ecf0e - After r240026 sgrowsiz should be used in a safer maner.
Approved by:	kib (mentor)
MCF after:	1 week
2012-09-03 09:34:46 +00:00
Alan Cox e30df26e7b Add new pmap layer locks to the predefined lock order. Change the names
of a few existing VM locks to follow a consistent naming scheme.
2012-06-27 03:45:25 +00:00
John Baldwin 6fbe60fa8b Move the per-thread deferred user map entries list into a private list
in vm_map_process_deferred() which is then iterated to release map entries.
This avoids having a nested vm map unlock operation called from the loop
body attempt to recuse into vm_map_process_deferred().  This can happen if
the vm_map_remove() triggers the OOM killer.

Reviewed by:	alc, kib
MFC after:	1 week
2012-06-20 18:00:26 +00:00
Konstantin Belousov 83ce08538a Use the previous stack entry protection and max protection to correctly
propagate the stack execution permissions when stack is grown down.

First, curproc->p_sysent->sv_stackprot specifies maximum allowed stack
protection for current ABI, so the new stack entry was typically marked
executable always. Second, for non-main stack MAP_STACK mapping,
the PROT_ flags should be used which were specified at the mmap(2) call
time, and not sv_stackprot.

MFC after:	1 week
2012-06-10 11:31:50 +00:00
Alan Cox 13458803f4 Give vm_fault()'s sequential access optimization a makeover.
There are two aspects to the sequential access optimization: (1) read ahead
of pages that are expected to be accessed in the near future and (2) unmap
and cache behind of pages that are not expected to be accessed again.  This
revision changes both aspects.

The read ahead optimization is now more effective.  It starts with the same
initial read window as before, but arithmetically grows the window on
sequential page faults.  This can yield increased read bandwidth.  For
example, on one of my machines, a program using mmap() to read a file that
is several times larger than the machine's physical memory takes about 17%
less time to complete.

The unmap and cache behind optimization is now more selectively applied.
The read ahead window must grow to its maximum size before unmap and cache
behind is performed.  This significantly reduces the number of times that
pages are unmapped and cached only to be reactivated a short time later.

The unmap and cache behind optimization now clears each page's referenced
flag.  Previously, in the case of dirty pages, if the containing file was
still mapped at the time that the page daemon examined the dirty pages,
they would be reactivated.

From a stylistic standpoint, this revision also cleanly separates the
implementation of the read ahead and unmap/cache behind optimizations.

Glanced at:	kib
MFC after:	2 weeks
2012-05-10 15:16:42 +00:00
John Baldwin 92a5994685 Fix madvise(MADV_WILLNEED) to properly handle individual mappings larger
than 4GB.  Specifically, the inlined version of 'ptoa' of the the 'int'
count of pages overflowed on 64-bit platforms.  While here, change
vm_object_madvise() to accept two vm_pindex_t parameters (start and end)
rather than a (start, count) tuple to match other VM APIs as suggested
by alc@.
2012-03-19 18:47:34 +00:00
Konstantin Belousov 126d60823a In vm_object_page_clean(), do not clean OBJ_MIGHTBEDIRTY object flag
if the filesystem performed short write and we are skipping the page
due to this.

Propogate write error from the pager back to the callers of
vm_pageout_flush().  Report the failure to write a page from the
requested range as the FALSE return value from vm_object_page_clean(),
and propagate it back to msync(2) to return EIO to usermode.

While there, convert the clearobjflags variable in the
vm_object_page_clean() and arguments of the helper functions to
boolean.

PR:	kern/165927
Reviewed by:	alc
MFC after:	2 weeks
2012-03-17 23:00:32 +00:00
Alan Cox 79e538388f Simplify vmspace_fork()'s control flow by copying immutable data before
the vm map locks are acquired.  Also, eliminate redundant initialization
of the new vm map's timestamp.

Reviewed by:	kib
MFC after:	3 weeks
2012-02-25 17:49:59 +00:00
Konstantin Belousov 84110e7e0b Account the writeable shared mappings backed by file in the vnode
v_writecount.  Keep the amount of the virtual address space used by
the mappings in the new vm_object un_pager.vnp.writemappings
counter. The vnode v_writecount is incremented when writemappings gets
non-zero value, and decremented when writemappings is returned to
zero.

Writeable shared vnode-backed mappings are accounted for in vm_mmap(),
and vm_map_insert() is instructed to set MAP_ENTRY_VN_WRITECNT flag on
the created map entry.  During deferred map entry deallocation,
vm_map_process_deferred() checks for MAP_ENTRY_VN_WRITECOUNT and
decrements writemappings for the vm object.

Now, the writeable mount cannot be demoted to read-only while
writeable shared mappings of the vnodes from the mount point
exist. Also, execve(2) fails for such files with ETXTBUSY, as it
should be.

Noted by:	tegge
Reviewed by:	tegge (long time ago, early version), alc
Tested by:	pho
MFC after:	3 weeks
2012-02-23 21:07:16 +00:00
Konstantin Belousov 8211bd45bc Close a race due to dropping of the map lock between creating map entry
for a shared mapping and marking the entry for inheritance.
Other thread might execute vmspace_fork() in between (e.g. by fork(2)),
resulting in the mapping becoming private.

Noted and reviewed by:	alc
MFC after:	1 week
2012-02-11 17:29:07 +00:00
Attilio Rao 9fde98bba3 Introduce the same mutex-wise fix in r227758 for sx locks.
The functions that offer file and line specifications are:
- sx_assert_
- sx_downgrade_
- sx_slock_
- sx_slock_sig_
- sx_sunlock_
- sx_try_slock_
- sx_try_xlock_
- sx_try_upgrade_
- sx_unlock_
- sx_xlock_
- sx_xlock_sig_
- sx_xunlock_

Now vm_map locking is fully converted and can avoid to know specifics
about locking procedures.
Reviewed by:	kib
MFC after:	1 month
2011-11-21 12:59:52 +00:00
Attilio Rao ccdf233323 Introduce macro stubs in the mutex implementation that will be always
defined and will allow consumers, willing to provide options, file and
line to locking requests, to not worry about options redefining the
interfaces.
This is typically useful when there is the need to build another
locking interface on top of the mutex one.

The introduced functions that consumers can use are:
- mtx_lock_flags_
- mtx_unlock_flags_
- mtx_lock_spin_flags_
- mtx_unlock_spin_flags_
- mtx_assert_
- thread_lock_flags_

Spare notes:
- Likely we can get rid of all the 'INVARIANTS' specification in the
  ppbus code by using the same macro as done in this patch (but this is
  left to the ppbus maintainer)
- all the other locking interfaces may require a similar cleanup, where
  the most notable case is sx which will allow a further cleanup of
  vm_map locking facilities
- The patch should be fully compatible with older branches, thus a MFC
  is previewed (infact it uses all the underlying mechanisms already
  present).

Comments review by:	eadler, Ben Kaduk
Discussed with:		kib, jhb
MFC after:	1 month
2011-11-20 16:33:09 +00:00
Edward Tomasz Napierala afcc55f318 All the racct_*() calls need to happen with the proc locked. Fixing this
won't happen before 9.0.  This commit adds "#ifdef RACCT" around all the
"PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order
to avoid useless locking/unlocking in kernels built without "options RACCT".
2011-07-06 20:06:44 +00:00
Alan Cox 6bbee8e28a Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this
option to vm_object_page_remove() asserts that the specified range of pages
is not mapped, or more precisely that none of these pages have any managed
mappings.  Thus, vm_object_page_remove() need not call pmap_remove_all() on
the pages.

This change not only saves time by eliminating pointless calls to
pmap_remove_all(), but it also eliminates an inconsistency in the use of
pmap_remove_all() versus related functions, like pmap_remove_write().  It
eliminates harmless but pointless calls to pmap_remove_all() that were being
performed on PG_UNMANAGED pages.

Update all of the existing assertions on pmap_remove_all() to reflect this
change.

Reviewed by:	kib
2011-06-29 16:40:41 +00:00
Edward Tomasz Napierala 1ba5ad4210 Add accounting for most of the memory-related resources.
Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib (earlier version)
2011-04-05 20:23:59 +00:00
Jeff Roberson e4cd31dd3c - Merge changes to the base system to support OFED. These include
a wider arg2 for sysctl, updates to vlan code, IFT_INFINIBAND,
   and other miscellaneous small features.
2011-03-21 09:40:01 +00:00
Alan Cox 0cc74f144e Since the last parameter to vm_object_shadow() is a vm_size_t and not a
vm_pindex_t, it makes no sense for its callers to perform atop().  Let
vm_object_shadow() do that instead.
2011-02-04 21:49:24 +00:00
Alan Cox d2a444c0da Reenable the call to vm_map_simplify_entry() from vm_map_insert() for non-
MAP_STACK_* entries.  (See r71983 and r74235.)

In some cases, performing this call to vm_map_simplify_entry() halves the
number of vm map entries used by the Sun JDK.
2011-01-29 15:23:02 +00:00
Max Laier a5db445da4 Fix a long standing (from the original 4.4BSD lite sources) race between
vmspace_fork and vm_map_wire that would lead to "vm_fault_copy_wired: page
missing" panics.  While faulting in pages for a map entry that is being
wired down, mark the containing map as busy.  In vmspace_fork wait until the
map is unbusy, before we try to copy the entries.

Reviewed by:	kib
MFC after:	5 days
Sponsored by:	Isilon Systems, Inc.
2010-12-09 21:02:22 +00:00
Edward Tomasz Napierala ef694c1ac4 Replace pointer to "struct uidinfo" with pointer to "struct ucred"
in "struct vm_object".  This is required to make it possible to account
for per-jail swap usage.

Reviewed by:	kib@
Tested by:	pho@
Sponsored by:	FreeBSD Foundation
2010-12-02 17:37:16 +00:00
Konstantin Belousov 9a6d144ff8 Implement a (soft) stack guard page for auto-growing stack mappings.
The unmapped page separates the tip of the stack and possible adjanced
segment, making some uses of stack overflow harder.  The stack growing
code refuses to expand the segment to the last page of the reseved
region when sysctl security.bsd.stack_guard_page is set to 1. The
default value for sysctl and accompanying tunable is 0.

Please note that mmap(MAP_FIXED) still can place a mapping right up to
the stack, making continuous region.

Reviewed by:	alc
MFC after:	1 week
2010-11-14 17:53:52 +00:00
Alan Cox e48262487a In case the stack size reaches its limit and its growth must be restricted,
ensure that grow_amount is a multiple of the page size.  Otherwise, the
kernel may crash in swap_reserve_by_uid() on HEAD and FreeBSD 8.x, and
produce a core file with a missing stack on FreeBSD 7.x.

Diagnosed and reported by: jilles
Reviewed by:	kib
MFC after:	1 week
2010-11-07 21:40:34 +00:00
John Baldwin 1a587ef2a5 - Make 'vm_refcnt' volatile so that compilers won't be tempted to treat
its value as a loop invariant.  Currently this is a no-op because
  'atomic_cmpset_int()' clobbers all memory on current architectures.
- Use atomic_fetchadd_int() instead of an atomic_cmpset_int() loop to drop
  a reference in vmspace_free().

Reviewed by:	alc
MFC after:	1 month
2010-10-21 17:29:32 +00:00
Alan Cox f8616ebfae If vm_map_find() is asked to allocate a superpage-aligned region of virtual
addresses that is greater than a superpage in size but not a multiple of
the superpage size, then vm_map_find() is not always expanding the kernel
pmap to support the last few small pages being allocated.  These failures
are not commonplace, so this was first noticed by someone porting FreeBSD
to a new architecture.  Previously, we grew the kernel page table in
vm_map_findspace() when we found the first available virtual address.
This works most of the time because we always grow the kernel pmap or page
table by an amount that is a multiple of the superpage size.  Now, instead,
we defer the call to pmap_growkernel() until we are committed to a range
of virtual addresses in vm_map_insert().  In general, there is another
reason to prefer calling pmap_growkernel() in vm_map_insert().  It makes
it possible for someone to do the equivalent of an mmap(MAP_FIXED) on the
kernel map.

Reported by:	Svatopluk Kraus
Reviewed by:	kib@
MFC after:	3 weeks
2010-10-04 16:49:40 +00:00
Alan Cox 8304adaac6 Make refinements to r212824. In particular, don't make
vm_map_unlock_nodefer() part of the synchronization interface for maps.

Add comments to vm_map_unlock_and_wait() and vm_map_wakeup() describing
how they should be used.  In particular, describe the deferred deallocations
issue with vm_map_unlock_and_wait().

Redo the implementation of vm_map_unlock_and_wait() so that it passes
along the caller's file and line information, just like the other map
locking primitives.

Reviewed by:	kib
X-MFC after:	r212824
2010-09-19 17:43:22 +00:00
Konstantin Belousov 0b367bd8c0 Adopt the deferring of object deallocation for the deleted map entries
on map unlock to the lock downgrade and later read unlock operation.

System map entries cannot be backed by OBJT_VNODE objects, no need to
defer deallocation for them. Map entries from user maps do not require
the owner map for deallocation, and can be accumulated in the
thread-local list for freeing when a user map is unlocked.

Move the collection of entries for deferred reclamation into
vm_map_delete(). Create helper vm_map_process_deferred(), that is
called from locations where processing is feasible. Do not process
deferred entries in vm_map_unlock_and_wait() since map_sleep_mtx is
held.

Reviewed by:	alc, rstone (previous versions)
Tested by:	pho
MFC after:	2 weeks
2010-09-18 15:03:31 +00:00
Konstantin Belousov b382c10a57 Introduce a helper function vm_page_find_least(). Use it in several places,
which inline the function.

Reviewed by:	alc
Tested by:	pho
MFC after:	1 week
2010-07-04 11:13:33 +00:00
Alan Cox c46b90e90a Push down page queues lock acquisition in pmap_enter_object() and
pmap_is_referenced().  Eliminate the corresponding page queues lock
acquisitions from vm_map_pmap_enter() and mincore(), respectively.  In
mincore(), this allows some additional cases to complete without ever
acquiring the page queues lock.

Assert that the page is managed in pmap_is_referenced().

On powerpc/aim, push down the page queues lock acquisition from
moea*_is_modified() and moea*_is_referenced() into moea*_query_bit().
Again, this will allow some additional cases to complete without ever
acquiring the page queues lock.

Reorder a few statements in vm_page_dontneed() so that a race can't lead
to an old reference persisting.  This scenario is described in detail by a
comment.

Correct a spelling error in vm_page_dontneed().

Assert that the object is locked in vm_page_clear_dirty(), and restrict the
page queues lock assertion to just those cases in which the page is
currently writeable.

Add object locking to vnode_pager_generic_putpages().  This was the one
and only place where vm_page_clear_dirty() was being called without the
object being locked.

Eliminate an unnecessary vm_page_lock() around vnode_pager_setsize()'s call
to vm_page_clear_dirty().

Change vnode_pager_generic_putpages() to the modern-style of function
definition.  Also, change the name of one of the parameters to follow
virtual memory system naming conventions.

Reviewed by:	kib
2010-05-26 18:00:44 +00:00
Alan Cox 9b55fc0429 Correct an error of omission in r206819. If VMFS_TLB_ALIGNED_SPACE is
specified to vm_map_find(), then retry the vm_map_findspace() if
vm_map_insert() fails because the aligned space is already partly used.

Reported by:	Neel Natu
2010-05-02 01:25:03 +00:00
Juli Mallett ca596a25f0 o) Add a VM find-space option, VMFS_TLB_ALIGNED_SPACE, which searches the
address space for an address as aligned by the new pmap_align_tlb()
   function, which is for constraints imposed by the TLB. [1]
o) Add a kmem_alloc_nofault_space() function, which acts like
   kmem_alloc_nofault() but allows the caller to specify which find-space
   option to use. [1]
o) Use kmem_alloc_nofault_space() with VMFS_TLB_ALIGNED_SPACE to allocate the
   kernel stack address on MIPS. [1]
o) Make pmap_align_tlb() on MIPS align addresses so that they do not start on
   an odd boundary within the TLB, so that they are suitable for insertion as
   wired entries and do not have to share a TLB entry with another mapping,
   assuming they are appropriately-sized.
o) Eliminate md_realstack now that the kstack will be appropriately-aligned on
   MIPS.
o) Increase the number of guard pages to 2 so that we retain the proper
   alignment of the kstack address.

Reviewed by:	[1] alc
X-MFC-after:	Making sure alc has not come up with a better interface.
2010-04-18 22:32:07 +00:00
Alan Cox 92351f162e Make _vm_map_init() the one place where the vm map's pmap field is
initialized.

Reviewed by:	kib
2010-04-03 19:07:05 +00:00
Alan Cox 0ef12795b5 Re-enable the call to pmap_release() by vmspace_dofree(). The accounting
problem that is described in the comment has been addressed.

Submitted by:	kib
Tested by:	pho (a few months ago)
MFC after:	6 weeks
2010-04-03 16:20:22 +00:00
Konstantin Belousov 41c2274481 The MAP_ENTRY_NEEDS_COPY flag belongs to protoeflags, cow variable
uses different namespace.

Reported by:	Jonathan Anderson <jonathan.anderson cl cam ac uk>
MFC after:	3 days
2010-01-29 19:25:45 +00:00
Alan Cox a6d42a0d62 Replace VM_PROT_OVERRIDE_WRITE by VM_PROT_COPY. VM_PROT_OVERRIDE_WRITE has
represented a write access that is allowed to override write protection.
Until now, VM_PROT_OVERRIDE_WRITE has been used to write breakpoints into
text pages.  Text pages are not just write protected but they are also
copy-on-write.  VM_PROT_OVERRIDE_WRITE overrides the write protection on the
text page and triggers the replication of the page so that the breakpoint
will be written to a private copy.  However, here is where things become
confused.  It is the debugger, not the process being debugged that requires
write access to the copied page.  Nonetheless, the copied page is being
mapped into the process with write access enabled.  In other words, once the
debugger sets a breakpoint within a text page, the program can write to its
private copy of that text page.  Whereas prior to setting the breakpoint, a
SIGSEGV would have occurred upon a write access.  VM_PROT_COPY addresses
this problem.  The combination of VM_PROT_READ and VM_PROT_COPY forces the
replication of a copy-on-write page even though the access is only for read.
Moreover, the replicated page is only mapped into the process with read
access, and not write access.

Reviewed by:	kib
MFC after:	4 weeks
2009-11-26 05:16:07 +00:00
Alan Cox 2db65ab46e Simplify both the invocation and the implementation of vm_fault() for wiring
pages.

(Note: Claims made in the comments about the handling of breakpoints in
wired pages have been false for roughly a decade.  This and another bug
involving breakpoints will be fixed in coming changes.)

Reviewed by:	kib
2009-11-18 18:05:54 +00:00
Alan Cox 2fafce9e4c Avoid pointless calls to pmap_protect().
Reviewed by:	kib
2009-11-02 17:45:39 +00:00
Konstantin Belousov 210a688642 When protection of wired read-only mapping is changed to read-write,
install new shadow object behind the map entry and copy the pages
from the underlying objects to it. This makes the mprotect(2) call to
actually perform the requested operation instead of silently do nothing
and return success, that causes SIGSEGV on later write access to the
mapping.

Reuse vm_fault_copy_entry() to do the copying, modifying it to behave
correctly when src_entry == dst_entry.

Reviewed by:	alc
MFC after:	3 weeks
2009-10-27 10:15:58 +00:00
Konstantin Belousov 6fecb26b6b Move the annotation for vm_map_startup() immediately before the function.
MFC after:	3 days
2009-10-01 12:48:35 +00:00
John Baldwin 013818111a Add a new type of VM object: OBJT_SG. An OBJT_SG object is very similar to
a device pager (OBJT_DEVICE) object in that it uses fictitious pages to
provide aliases to other memory addresses.  The primary difference is that
it uses an sglist(9) to determine the physical addresses for a given offset
into the object instead of invoking the d_mmap() method in a device driver.

Reviewed by:	alc
Approved by:	re (kensmith)
MFC after:	2 weeks
2009-07-24 13:50:29 +00:00
Konstantin Belousov 529ab57b9a When VM_MAP_WIRE_HOLESOK is not specified and vm_map_wire(9) encounters
non-readable and non-executable map entry, the entry is skipped from
wiring and loop is aborted. But, since MAP_ENTRY_WIRE_SKIPPED was not
set for the map entry, its wired_count is later erronously decremented.
vm_map_delete(9) for such map entry stuck in "vmmaps".

Properly set MAP_ENTRY_WIRE_SKIPPED when aborting the loop.

Reported by:	John Marshall <john.marshall riverwillow com au>
Approved by:	re (kensmith)
2009-07-12 12:37:38 +00:00
Konstantin Belousov 121fd46175 When forking a vm space that has wired map entries, do not forget to
charge the objects created by vm_fault_copy_entry. The object charge
was set, but reserve not incremented.

Reported by:	Greg Rivers <gcr+freebsd-current tharned org>
Reviewed by:	alc (previous version)
Approved by:	re (kensmith)
2009-07-03 22:17:37 +00:00
Konstantin Belousov 3364c323e6 Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.

The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.

The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.

The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).

Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.

In collaboration with:	pho
Reviewed by:	alc
Approved by:	re (kensmith)
2009-06-23 20:45:22 +00:00
Alan Cox 1136ed06b9 Eliminate an unnecessary restriction on the vm object type from
vm_map_pmap_enter().  The immediate effect of this change is that automatic
prefaulting by mmap() for small mappings is performed on POSIX shared memory
objects just the same as it is on ordinary files.
2009-06-09 17:04:39 +00:00
Alan Cox 0a2e596a93 Eliminate unnecessary obfuscation when testing a page's valid bits. 2009-06-07 19:38:26 +00:00
Alan Cox f9855e177d Allow valid pages to be mapped for read access when they have a non-zero
busy count.  Only mappings that allow write access should be prevented by
a non-zero busy count.

(The prohibition on mapping pages for read access when they have a non-
zero busy count originated in revision 1.202 of i386/i386/pmap.c when
this code was a part of the pmap.)

Reviewed by:	tegge
2009-04-19 00:34:34 +00:00
Konstantin Belousov 6d7e809123 When vm_map_wire(9) is allowed to skip holes in the wired region, skip
the mappings without any of read and execution rights, in particular,
the PROT_NONE entries. This makes mlockall(2) work for the process
address space that has such mappings.

Since protection mode of the entry may change between setting
MAP_ENTRY_IN_TRANSITION and final pass over the region that records
the wire status of the entries, allocate new map entry flag
MAP_ENTRY_WIRE_SKIPPED to mark the skipped PROT_NONE entries.

Reported and tested by:	Hans Ottevanger <fbsdhackers beasties demon nl>
Reviewed by:	alc
MFC after:	3 weeks
2009-04-10 10:16:03 +00:00
Konstantin Belousov 655c349022 Revert the addition of the freelist argument for the vm_map_delete()
function, done in r188334. Instead, collect the entries that shall be
freed, in the deferred_freelist member of the map. Automatically purge
the deferred freelist when map is unlocked.

Tested by:	pho
Reviewed by:	alc
2009-02-24 20:57:43 +00:00
Konstantin Belousov 3a0916b8ea Add the assertion macros for the map locks. Use them in several map
manipulation functions.

Tested by:	pho
Reviewed by:	alc
2009-02-24 20:43:29 +00:00
Konstantin Belousov e608cc3c8d Update the comment after the r188334.
Reviewed by:	alc
2009-02-24 20:23:16 +00:00
Konstantin Belousov b0994946c7 Improve comments, correct English.
Submitted by:	alc
2009-02-08 20:52:09 +00:00
Konstantin Belousov 897d81a020 Do not call vm_object_deallocate() from vm_map_delete(), because we
hold the map lock there, and might need the vnode lock for OBJT_VNODE
objects. Postpone object deallocation until caller of vm_map_delete()
drops the map lock. Link the map entries to be freed into the freelist,
that is released by the new helper function vm_map_entry_free_freelist().

Reviewed by:	tegge, alc
Tested by:	pho
2009-02-08 20:39:17 +00:00
Konstantin Belousov e53fa61bf2 In vm_map_sync(), do not call vm_object_sync() while holding map lock.
Reference object, drop the map lock, and then call vm_object_sync().
The object sync might require vnode lock for OBJT_VNODE type objects.

Reviewed by:	tegge
Tested by:	pho
2009-02-08 20:30:51 +00:00
Konstantin Belousov 7fd10fb3c7 Add the comments to vm_map_simplify_entry() and vmspace_fork(),
describing why several calls to vm_deallocate_object() with locked map
do not result in the acquisition of the vnode lock after map lock.

Suggested and reviewed by:	tegge
2009-02-08 20:00:33 +00:00
Konstantin Belousov 1fac7d7f35 Lock the new map in vmspace_fork(). The newly allocated map should not
be accessible outside vmspace_fork() yet, but locking it would satisfy
the protocol of the vm_map_entry_link() and other functions called
from vmspace_fork().

Use trylock that is supposedly cannot fail, to silence WITNESS warning
of the nested acquisition of the sx lock with the same name.

Suggested and reviewed by:	tegge
2009-02-08 19:55:03 +00:00
Konstantin Belousov 9f6acfd1a8 Do not leak the MAP_ENTRY_IN_TRANSITION flag when copying map entry
on fork. Otherwise, copied entry cannot be removed in the child map.

Reviewed by:	tegge
MFC after:	2 weeks
2009-02-08 19:41:08 +00:00
Alan Cox 05a8c41419 Resurrect shared map locks allowing greater concurrency during some map
operations, such as page faults.

An earlier version of this change was ...

Reviewed by:	kib
Tested by:	pho
MFC after:	6 weeks
2009-01-01 00:31:46 +00:00
Alan Cox e2abaaaa2b Update or eliminate some stale comments. 2008-12-31 05:44:05 +00:00
Alan Cox 7438d60b4b Avoid an unnecessary memory dereference in vm_map_entry_splay(). 2008-12-30 21:52:18 +00:00
Alan Cox 095104ac36 Style change to vm_map_lookup(): Eliminate a macro of dubious value. 2008-12-30 20:51:07 +00:00
Alan Cox 4c3ef59e3d Move the implementation of the vm map's fast path on address lookup from
vm_map_lookup{,_locked}() to vm_map_lookup_entry().  Having the fast path
in vm_map_lookup{,_locked}() limits its benefits to page faults.  Moving
it to vm_map_lookup_entry() extends its benefits to other operations on
the vm map.
2008-12-30 19:48:03 +00:00
Alan Cox c1f02198d1 KERNBASE is not necessarily an address within the kernel map, e.g.,
PowerPC/AIM.  Consequently, it should not be used to determine the maximum
number of kernel map entries.  Intead, use VM_MIN_KERNEL_ADDRESS, which marks
the start of the kernel map on all architectures.

Tested by:	marcel@ (PowerPC/AIM)
2008-06-21 21:02:13 +00:00
Alan Cox 26c538ffcd Generalize vm_map_find(9)'s parameter "find_space". Specifically, add
support for VMFS_ALIGNED_SPACE, which requests the allocation of an
address range best suited to superpages.  The old options TRUE and FALSE
are mapped to VMFS_ANY_SPACE and VMFS_NO_SPACE, so that there is no
immediate need to update all of vm_map_find(9)'s callers.

While I'm here, correct a misstatement about vm_map_find(9)'s return
values in the man page.
2008-05-10 18:55:35 +00:00
Alan Cox b8ca4ef2e3 vm_map_fixed(), unlike vm_map_find(), does not update "addr", so it can be
passed by value.
2008-04-28 05:30:23 +00:00
Alan Cox c416972587 Update a comment to vm_map_pmap_enter(). 2008-04-04 19:14:58 +00:00
Jeff Roberson 6617724c5f Remove kernel support for M:N threading.
While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential.  Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.
2008-03-12 10:12:01 +00:00
Konstantin Belousov 77bc7900bc In the vm_map_stack(), check for the specified stack region wraparound.
Reported and tested by:	Peter Holm
Reviewed by:	alc
MFC after:	3 days
2008-01-04 04:33:13 +00:00
Pawel Jakub Dawidek 8ce2d00a04 Change unused 'user_wait' argument to 'timo' argument, which will be
used to specify timeout for msleep(9).

Discussed with:	alc
Reviewed by:	alc
2007-11-07 21:56:58 +00:00
Konstantin Belousov 89b57fcf01 Fix for the panic("vm_thread_new: kstack allocation failed") and
silent NULL pointer dereference in the i386 and sparc64 pmap_pinit()
when the kmem_alloc_nofault() failed to allocate address space. Both
functions now return error instead of panicing or dereferencing NULL.

As consequence, vmspace_exec() and vmspace_unshare() returns the errno
int. struct vmspace arg was added to vm_forkproc() to avoid dealing
with failed allocation when most of the fork1() job is already done.

The kernel stack for the thread is now set up in the thread_alloc(),
that itself may return NULL. Also, allocation of the first process
thread is performed in the fork1() to properly deal with stack
allocation failure. proc_linkup() is separated into proc_linkup()
called from fork1(), and proc_linkup0(), that is used to set up the
kernel process (was known as swapper).

In collaboration with:	Peter Holm
Reviewed by:	jhb
2007-11-05 11:36:16 +00:00
Alan Cox 7b0e72d184 Correct an error in vm_map_sync(), nee vm_map_clean(), that has existed
since revision 1.1.  Specifically, neither traversal of the vm map checks
whether the end of the vm map has been reached.  Consequently, the first
traversal can wrap around and bogusly return an error.

This error has gone unnoticed for so long because no one had ever before
tried msync(2)ing a region above the stack.

Reported by:	peter
MFC after:	1 week
2007-10-22 05:21:05 +00:00
Alan Cox 7bfda801a8 Change the management of cached pages (PQ_CACHE) in two fundamental
ways:

(1) Cached pages are no longer kept in the object's resident page
splay tree and memq.  Instead, they are kept in a separate per-object
splay tree of cached pages.  However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock.  Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.

This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE).  The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held.  Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.

Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case.  Cached pages
are reclaimed far, far more often than they are reactivated.  Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.

(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.

Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated.  Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page.  Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.

Discussed with: many over the course of the summer, including jeff@,
   Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
Konstantin Belousov d239bd3ccc Do not drop vm_map lock between doing vm_map_remove() and vm_map_insert().
For this, introduce vm_map_fixed() that does that for MAP_FIXED case.

Dropping the lock allowed for parallel thread to occupy the freed space.

Reported by:	Tijl Coosemans <tijl ulyssis org>
Reviewed by:	alc
Approved by:	re (kensmith)
MFC after:	2 weeks
2007-08-20 12:05:45 +00:00
Attilio Rao 2feb50bf7d Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)
2007-05-31 22:52:15 +00:00
Attilio Rao f9819486e5 Add functions sx_xlock_sig() and sx_slock_sig().
These functions are intended to do the same actions of sx_xlock() and
sx_slock() but with the difference to perform an interruptible sleep, so
that sleep can be interrupted by external events.
In order to support these new featueres, some code renstruction is needed,
but external API won't be affected at all.

Note: use "void" cast for "int" returning functions in order to avoid tools
like Coverity prevents to whine.

Requested by: rwatson
Tested by: rwatson
Reviewed by: jhb
Approved by: jeff (mentor)
2007-05-31 09:14:48 +00:00
Alan Cox cf4682ae23 Eliminate the reactivation of cached pages in vm_fault_prefault() and
vm_map_pmap_enter() unless the caller is madvise(MADV_WILLNEED).  With
the exception of calls to vm_map_pmap_enter() from
madvise(MADV_WILLNEED), vm_fault_prefault() and vm_map_pmap_enter()
are both used to create speculative mappings.  Thus, always
reactivating cached pages is a mistake.  In principle, cached pages
should only be reactivated by an actual access.  Otherwise, the
following misbehavior can occur.  On a hard fault for a text page the
clustering algorithm fetches not only the required page but also
several of the adjacent pages.  Now, suppose that one or more of the
adjacent pages are never accessed.  Ultimately, these unused pages
become cached pages through the efforts of the page daemon.  However,
the next activation of the executable reactivates and maps these
unused pages.  Consequently, they are never replaced.  In effect, they
become pinned in memory.
2007-05-22 04:45:59 +00:00
Jeff Roberson 222d01951f - define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts.  This can be used to abstract away pcpu details but also changes
   to use atomics for all counters now.  This means sched lock is no longer
   responsible for protecting counts in the switch routines.

Contributed by:		Attilio Rao <attilio@FreeBSD.org>
2007-05-18 07:10:50 +00:00
Alan Cox 17afe8befe Remove some code from vmspace_fork() that became redundant after
revision 1.334 modified _vm_map_init() to initialize the new vm map's
flags to zero.
2007-04-26 05:48:17 +00:00
Alan Cox 8fece8c367 Two small changes to vm_map_pmap_enter():
1) Eliminate an unnecessary check for fictitious pages.  Specifically,
only device-backed objects contain fictitious pages and the object is
not device-backed.

2) Change the types of "psize" and "tmpidx" to vm_pindex_t in order to
prevent possible wrap around with extremely large maps and objects,
respectively.  Observed by: tegge (last summer)
2007-03-25 19:33:40 +00:00
Alan Cox 9f5c801b94 Change the way that unmanaged pages are created. Specifically,
immediately flag any page that is allocated to a OBJT_PHYS object as
unmanaged in vm_page_alloc() rather than waiting for a later call to
vm_page_unmanage().  This allows for the elimination of some uses of
the page queues lock.

Change the type of the kernel and kmem objects from OBJT_DEFAULT to
OBJT_PHYS.  This allows us to take advantage of the above change to
simplify the allocation of unmanaged pages in kmem_alloc() and
kmem_malloc().

Remove vm_page_unmanage().  It is no longer used.
2007-02-25 06:14:58 +00:00
Alan Cox 9fea8cad08 Eliminate unnecessary PG_BUSY tests. They originally served a purpose
that is now handled by vm object locking.
2006-10-21 21:02:04 +00:00
Alan Cox 2cf139527c Retire debug.mpsafevm. None of the architectures supported in CVS require
it any longer.
2006-07-21 23:22:49 +00:00
Alan Cox 379fb6429d Use ptoa(psize) instead of size to compute the end of the mapping in
vm_map_pmap_enter().
2006-06-17 08:45:01 +00:00
Alan Cox d2d9e24a89 Correct an error in the previous revision that could lead to a panic:
Found mapped cache page.  Specifically, if cnt.v_free_count dips below
cnt.v_free_reserved after p_start has been set to a non-NULL value,
then vm_map_pmap_enter() would break out of the loop and incorrectly
call pmap_enter_object() for the remaining address range.  To correct
this error, this revision truncates the address range so that
pmap_enter_object() will not map any cache pages.

In collaboration with: tegge@
Reported by: kris@
2006-06-14 17:48:45 +00:00
Alan Cox ce142d9ec0 Introduce the function pmap_enter_object(). It maps a sequence of resident
pages from the same object.  Use it in vm_map_pmap_enter() to reduce the
locking overhead of premapping objects.

Reviewed by: tegge@
2006-06-05 20:35:27 +00:00
Tor Egge 57051fdc4b Close race between vmspace_exitfree() and exit1() and races between
vmspace_exitfree() and vmspace_free() which could result in the same
vmspace being freed twice.

Factor out part of exit1() into new function vmspace_exit().  Attach
to vmspace0 to allow old vmspace to be freed earlier.

Add new function, vmspace_acquire_ref(), for obtaining a vmspace
reference for a vmspace belonging to another process.  Avoid changing
vmspace refcount from 0 to 1 since that could also lead to the same
vmspace being freed twice.

Change vmtotal() and swapout_procs() to use vmspace_acquire_ref().

Reviewed by:	alc
2006-05-29 21:28:56 +00:00
Warner Losh 62a59e8f0d Remove leading __ from __(inline|const|signed|volatile). They are
obsolete.  This should reduce diffs to NetBSD as well.
2006-03-08 06:31:46 +00:00
Alan Cox 997e1c252b Use the new macros abstracting the page coloring/queues implementation.
(There are no functional changes.)
2006-01-27 07:28:51 +00:00
Alan Cox 717f7d5962 Simplify vmspace_dofree(). 2005-12-04 22:55:41 +00:00
Alan Cox 51016cdfd7 Eliminate unneeded preallocation at initialization.
Reviewed by: tegge
2005-12-03 22:41:15 +00:00
Alan Cox 97a0c226d6 Eliminate pmap_init2(). It's no longer used. 2005-11-20 06:09:49 +00:00
Alan Cox ba8bca610c Pass a value of type vm_prot_t to pmap_enter_quick() so that it determine
whether the mapping should permit execute access.
2005-09-03 18:20:20 +00:00
Alan Cox 15d2d31372 Eliminate an incorrect (and unnecessary) cast. 2005-07-20 18:41:08 +00:00
Alan Cox b7903e65fb Remove GIANT_REQUIRED from vmspace_exec().
Prodded by: jeff
2005-05-02 07:05:20 +00:00
Alan Cox 986b43f845 Add checks to vm_map_findspace() to test for address wrap. The conditions
where this could occur are very rare, but possible.

Submitted by: Mark W. Krentel
MFC after: 2 weeks
2005-01-18 19:50:09 +00:00
Warner Losh 60727d8b86 /* -> /*- for license, minor formatting changes 2005-01-07 02:29:27 +00:00
Alan Cox 1f70d62298 Modify pmap_enter_quick() so that it expects the page queues to be locked
on entry and it assumes the responsibility for releasing the page queues
lock if it must sleep.

Remove a bogus comment from pmap_enter_quick().

Using the first change, modify vm_map_pmap_enter() so that the page queues
lock is acquired and released once, rather than each time that a page
is mapped.
2004-12-23 20:16:11 +00:00
Alan Cox 85f5b24573 In the common case, pmap_enter_quick() completes without sleeping.
In such cases, the busying of the page and the unlocking of the
containing object by vm_map_pmap_enter() and vm_fault_prefault() is
unnecessary overhead.  To eliminate this overhead, this change
modifies pmap_enter_quick() so that it expects the object to be locked
on entry and it assumes the responsibility for busying the page and
unlocking the object if it must sleep.  Note: alpha, amd64, i386 and
ia64 are the only implementations optimized by this change; arm,
powerpc, and sparc64 still conservatively busy the page and unlock the
object within every pmap_enter_quick() call.

Additionally, this change is the first case where we synchronize
access to the page's PG_BUSY flag and busy field using the containing
object's lock rather than the global page queues lock.  (Modifications
to the page's PG_BUSY flag and busy field have asserted both locks for
several weeks, enabling an incremental transition.)
2004-12-15 19:55:05 +00:00
Alan Cox 94ddc7076d Push Giant deep into vm_forkproc(), acquiring it only if the process has
mapped System V shared memory segments (see shmfork_myhook()) or requires
the allocation of an ldt (see vm_fault_wire()).
2004-09-03 05:11:32 +00:00
Alan Cox c1fbc251cd - Introduce and use a new tunable "debug.mpsafevm". At present, setting
"debug.mpsafevm" results in (almost) Giant-free execution of zero-fill
   page faults.  (Giant is held only briefly, just long enough to determine
   if there is a vnode backing the faulting address.)

   Also, condition the acquisition and release of Giant around calls to
   pmap_remove() on "debug.mpsafevm".

   The effect on performance is significant.  On my dual Opteron, I see a
   3.6% reduction in "buildworld" time.

 - Use atomic operations to update several counters in vm_fault().
2004-08-16 06:16:12 +00:00
Brian Feldman 7c938963a4 Rather than bringing back all of the changes to make VM map deletion
wait for system wires to disappear, do so (much more trivially) by
instead only checking for system wires of user maps and not kernel maps.

Alternative by:	tor
Reviewed by:	alc
2004-08-16 03:11:09 +00:00
Alan Cox 6eaee3fee4 Remove spl calls. 2004-08-14 18:57:41 +00:00
Alan Cox 0164e05781 Replace the linear search in vm_map_findspace() with an O(log n)
algorithm built into the map entry splay tree.  This replaces the
first_free hint in struct vm_map with two fields in vm_map_entry:
adj_free, the amount of free space following a map entry, and
max_free, the maximum amount of free space in the entry's subtree.
These fields make it possible to find a first-fit free region of a
given size in one pass down the tree, so O(log n) amortized using
splay trees.

This significantly reduces the overhead in vm_map_findspace() for
applications that mmap() many hundreds or thousands of regions, and
has a negligible slowdown (0.1%) on buildworld.  See, for example, the
discussion of a micro-benchmark titled "Some mmap observations
compared to Linux 2.6/OpenBSD" on -hackers in late October 2003.

OpenBSD adopted this approach in March 2002, and NetBSD added it in
November 2003, both with Red-Black trees.

Submitted by: Mark W. Krentel
2004-08-13 08:06:34 +00:00
Tor Egge 19dc560756 The vm map lock is needed in vm_fault() after the page has been found,
to avoid later changes before pmap_enter() and vm_fault_prefault()
has completed.

Simplify deadlock avoidance by not blocking on vm map relookup.

In collaboration with: alc
2004-08-12 20:14:49 +00:00
Brian Feldman c5f60ffccf Re-delete the comment from r1.352. 2004-08-12 17:22:28 +00:00
Brian Feldman 0ada205ee6 Back out all behavioral chnages. 2004-08-10 14:42:48 +00:00
Brian Feldman 9689d5e5ee Revamp VM map wiring.
* Allow no-fault wiring/unwiring to succeed for consistency;
  however, the wired count remains at zero, so it's a special case.

* Fix issues inside vm_map_wire() and vm_map_unwire() where the
  exact state of user wiring (one or zero) and system wiring
  (zero or more) could be confused; for example, system unwiring
  could succeed in removing a user wire, instead of being an
  error.

* Require all mappings to be unwired before they are deleted.
  When VM space is still wired upon deletion, it will be waited
  upon for the following unwire.  This makes vslock(9) work
  rather than allowing kernel-locked memory to be deleted
  out from underneath of its consumer as it would before.
2004-08-09 19:52:29 +00:00
Alan Cox 91491c3566 Remove a stale comment from vm_map_lookup() that pertains to share maps.
(The last vestiges of the share map code were removed in revisions 1.153
and 1.159.)
2004-08-09 18:15:46 +00:00
Alan Cox 684a62b7bf - Push down the acquisition and release of Giant into pmap_enter_quick()
on those architectures without pmap locking.
 - Eliminate the acquisition and release of Giant in vm_map_pmap_enter().
2004-08-04 22:03:16 +00:00
Brian Feldman b23f72e98a * Add a "how" argument to uma_zone constructors and initialization functions
so that they know whether the allocation is supposed to be able to sleep
  or not.
* Allow uma_zone constructors and initialation functions to return either
  success or error.  Almost all of the ones in the tree currently return
  success unconditionally, but mbuf is a notable exception: the packet
  zone constructor wants to be able to fail if it cannot suballocate an
  mbuf cluster, and the mbuf allocators want to be able to fail in general
  in a MAC kernel if the MAC mbuf initializer fails.  This fixes the
  panics people are seeing when they run out of memory for mbuf clusters.
* Allow debug.nosleepwithlocks on WITNESS to be disabled, without changing
  the default.

Both bmilekic and jeff have reviewed the changes made to make failable
zone allocations work.
2004-08-02 00:18:36 +00:00
Alan Cox 9bb0e06861 - Push down the acquisition and release of Giant into pmap_protect() on
those architectures without pmap locking.
 - Eliminate the acquisition and release of Giant from vm_map_protect().

(Translation: mprotect(2) runs to completion without touching Giant on
alpha, amd64, i386 and ia64.)
2004-07-30 20:38:30 +00:00
Maxime Henrion 12c649749c Get rid of another lockmgr(9) consumer by using sx locks for the user
maps.  We always acquire the sx lock exclusively here, but we can't
use a mutex because we want to be able to sleep while holding the
lock.  This is completely equivalent to what we were doing with the
lockmgr(9) locks before.

Approved by:	alc
2004-07-30 09:10:28 +00:00
Alan Cox 1a276a3f91 - Use atomic ops for updating the vmspace's refcnt and exitingcnt.
- Push down Giant into shmexit().  (Giant is acquired only if the vmspace
   contains shm segments.)
 - Eliminate the acquisition of Giant from proc_rwmem().
 - Reduce the scope of Giant in exit1(), uncovering the destruction of the
   address space.
2004-07-27 03:53:41 +00:00
Alan Cox 57a21aba93 Make the code and comments for vm_object_coalesce() consistent. 2004-07-25 07:48:47 +00:00
Alan Cox 51ab6c2890 Simplify vmspace initialization. The bcopy() of fields from the old
vmspace to the new vmspace in vmspace_exec() is mostly wasted effort.  With
one exception, vm_swrss, the copied fields are immediately overwritten.
Instead, initialize these fields to zero in vmspace_alloc(), eliminating a
bcopy() from vmspace_exec() and a bzero() from vmspace_fork().
2004-07-24 07:40:35 +00:00
Peter Wemm 5476633aed Semi-gratuitous change. Move two refcount operations to their own lines
rather than be buried inside an if (expression).  And now that the if
expression is the same in both exit paths, use the same ordering.
2004-07-21 05:08:10 +00:00
Peter Wemm 3f25cbddc2 Move the initialization and teardown of pmaps to the vmspace zone's
init and fini handlers.  Our vm system removes all userland mappings at
exit prior to calling pmap_release.  It just so happens that we might
as well reuse the pmap for the next process since the userland slate
has already been wiped clean.

However.  There is a functional benefit to this as well.  For platforms
that share userland and kernel context in the same pmap, it means that
the kernel portion of a pmap remains valid after the vmspace has been
freed (process exit) and while it is in uma's cache.  This is significant
for i386 SMP systems with kernel context borrowing because it avoids
a LOT of IPIs from the pmap_lazyfix() cleanup in the usual case.

Tested on:  amd64, i386, sparc64, alpha
Glanced at by:  alc
2004-07-21 00:29:21 +00:00