Commit graph

2451 commits

Author SHA1 Message Date
Konstantin Belousov 11041003c6 Use the VM_ALLOC_INTERRUPT for the page requests when allocating memory
for the bio for swapout write. It allows the page allocator to drain
free page list deeper. As result, a deadlock where pageout deamon sleeps
waiting for bio to be allocated for swapout is no more reproducable in
practice.

Alan said that M_USE_RESERVE shall be ressurrected and used there, but
until this is implemented, M_NOWAIT does exactly what is needed.

Tested by:	pho, kris
Reviewed by:	alc
No objections from:	phk
MFC after:	2 weeks (RELENG_7 only)
2008-07-11 11:27:42 +00:00
Alan Cox b89eaf4e9f Enable the creation of a kmem map larger than 4GB.
Submitted by: Tz-Huan Huang

Make several variables related to kmem map auto-sizing static.
Found by: CScout
2008-07-05 19:34:33 +00:00
Alan Cox 5cfa90e902 Make preparations for increasing the size of the kernel virtual address space
on the amd64 architecture.  The amd64 architecture requires kernel code and
global variables to reside in the highest 2GB of the 64-bit virtual address
space.  Thus, the memory allocated during bootstrap, before the call to
kmem_init(), starts at KERNBASE, which is not necessarily the same as
VM_MIN_KERNEL_ADDRESS on amd64.
2008-06-22 04:54:27 +00:00
Alan Cox c1f02198d1 KERNBASE is not necessarily an address within the kernel map, e.g.,
PowerPC/AIM.  Consequently, it should not be used to determine the maximum
number of kernel map entries.  Intead, use VM_MIN_KERNEL_ADDRESS, which marks
the start of the kernel map on all architectures.

Tested by:	marcel@ (PowerPC/AIM)
2008-06-21 21:02:13 +00:00
Stephan Uphoff 11be8415c9 Fix vm object creation locking to allow SHARED vnode locking for vnode_create_vobject.
(Not currently used)

Noticed by: kib@
2008-06-12 20:46:47 +00:00
Alan Cox 8bcd3b1998 Essentially, neither madvise(..., MADV_DONTNEED) nor madvise(..., MADV_FREE)
work.  (Moreover, I don't believe that they have ever worked as intended.)
The explanation is fairly simple.  Both MADV_DONTNEED and MADV_FREE perform
vm_page_dontneed() on each page within the range given to madvise().  This
function moves the page to the inactive queue.  Specifically, if the page is
clean, it is moved to the head of the inactive queue where it is first in
line for processing by the page daemon.  On the other hand, if it is dirty,
it is placed at the tail.  Let's further examine the case in which the page
is clean.  Recall that the page is at the head of the line for processing by
the page daemon.  The expectation of vm_page_dontneed()'s author was that
the page would be transferred from the inactive queue to the cache queue by
the page daemon.  (Once the page is in the cache queue, it is, in effect,
free, that is, it can be reallocated to a new vm object by vm_page_alloc()
if it isn't reactivated quickly enough by a user of the old vm object.)  The
trouble is that nowhere in the execution of either MADV_DONTNEED or
MADV_FREE is either the machine-independent reference flag (PG_REFERENCED)
or the reference bit in any page table entry (PTE) mapping the page cleared.
Consequently, the immediate reaction of the page daemon is to reactivate the
page because it is referenced.  In effect, the madvise() was for naught.
The case in which the page was dirty is not too different.  Instead of being
laundered, the page is reactivated.

Note: The essential difference between MADV_DONTNEED and MADV_FREE is
that MADV_FREE clears a page's dirty field.  So, MADV_FREE is always
executing the clean case above.

This revision changes vm_page_dontneed() to clear both the machine-
independent reference flag (PG_REFERENCED) and the reference bit in all PTEs
mapping the page.

MFC after:	6 weeks
2008-06-06 18:38:43 +00:00
Alan Cox ba3042115f To date, our implementation of munmap(2) has required that the
entirety of the specified range be mapped.  Specifically, it has
returned EINVAL if the entire range is not mapped.  There is not,
however, any basis for this in either SuSv2 or our own man page.
Moreover, neither Linux nor Solaris impose this requirement.  This
revision removes this requirement.

Submitted by: Tijl Coosemans
PR: 118510
MFC after: 6 weeks
2008-05-24 21:57:16 +00:00
Stephan Uphoff 2ac78f0e1a Allow VM object creation in ufs_lookup. (If vfs.vmiodirenable is set)
Directory IO without a VM object will store data in 'malloced' buffers
severely limiting caching of the data. Without this  change VM objects for
directories are only created on an open() of the directory.
TODO: Inline test if VM object already exists to avoid locking/function call
overhead.

Tested by: kris@
Reviewed by: jeff@
Reported by: David Filo
2008-05-20 19:05:43 +00:00
Alan Cox 1ec1304bdb Retire pmap_addr_hint(). It is no longer used. 2008-05-18 04:16:57 +00:00
Alan Cox d0a83a83bf In order to map device memory using superpages, mmap(2) must find a
superpage-aligned virtual address for the mapping.  Revision 1.65
implemented an overly simplistic and generally ineffectual method for
finding a superpage-aligned virtual address.  Specifically, it rounds
the virtual address corresponding to the end of the data segment up to
the next superpage-aligned virtual address.  If this virtual address
is unallocated, then the device will be mapped using superpages.
Unfortunately, in modern times, where applications like the X server
dynamically load much of their code, this virtual address is already
allocated.  In such cases, mmap(2) simply uses the first available
virtual address, which is not necessarily superpage aligned.

This revision changes mmap(2) to use a more robust method,
specifically, the VMFS_ALIGNED_SPACE option that is now implemented by
vm_map_find().
2008-05-17 19:32:48 +00:00
Alan Cox e46cd4132c Preset a device object's alignment ("pg_color") based upon the
physical address of the device's memory.  This enables
pmap_align_superpage() to propose a virtual address for mapping the
device memory that permits the use of superpage mappings.
2008-05-17 16:26:34 +00:00
Alan Cox f578838754 Don't call vm_reserv_alloc_page() on device-backed objects. Otherwise, the
system may panic because there is no reservation structure corresponding to
the physical address of the device memory.

Reported by: Giorgos Keramidas
2008-05-15 18:52:31 +00:00
Alan Cox 6ac3ab7f98 Provide the new argument to kmem_suballoc(). 2008-05-10 23:39:27 +00:00
Alan Cox 3202ed7523 Introduce a new parameter "superpage_align" to kmem_suballoc() that is
used to request superpage alignment for the submap.

Request superpage alignment for the kmem_map.

Pass VMFS_ANY_SPACE instead of TRUE to vm_map_find().  (They are currently
equivalent but VMFS_ANY_SPACE is the new preferred spelling.)

Remove a stale comment from kmem_malloc().
2008-05-10 21:46:20 +00:00
Alan Cox 26c538ffcd Generalize vm_map_find(9)'s parameter "find_space". Specifically, add
support for VMFS_ALIGNED_SPACE, which requests the allocation of an
address range best suited to superpages.  The old options TRUE and FALSE
are mapped to VMFS_ANY_SPACE and VMFS_NO_SPACE, so that there is no
immediate need to update all of vm_map_find(9)'s callers.

While I'm here, correct a misstatement about vm_map_find(9)'s return
values in the man page.
2008-05-10 18:55:35 +00:00
Alan Cox d3249b142b Introduce pmap_align_superpage(). It increases the starting virtual
address of the given mapping if a different alignment might result in more
superpage mappings.
2008-05-09 16:48:07 +00:00
Kip Macy c8c7ad9260 add malloc flag to blist so that it can be used in ithread context
Reviewed by: alc, bsdimp
2008-05-05 19:48:54 +00:00
Alan Cox 2bc24aa956 Eliminate pointless casts from kmem_suballoc(). 2008-04-28 17:25:27 +00:00
Alan Cox b8ca4ef2e3 vm_map_fixed(), unlike vm_map_find(), does not update "addr", so it can be
passed by value.
2008-04-28 05:30:23 +00:00
Jeff Roberson 8df78c41d6 - Make SCHED_STATS more generic by adding a wrapper to create the
variables and sysctl nodes.
 - In reset walk the children of kern_sched_stats and reset the counters
   via the oid_arg1 pointer.  This allows us to add arbitrary counters to
   the tree and still reset them properly.
 - Define a set of switch types to be passed with flags to mi_switch().
   These types are named SWT_*.  These types correspond to SCHED_STATS
   counters and are automatically handled in this way.
 - Make the new SWT_ types more specific than the older switch stats.
   There are now stats for idle switches, remote idle wakeups, remote
   preemption ithreads idling, etc.
 - Add switch statistics for ULE's pickcpu algorithm.  These stats include
   how much migration there is, how often affinity was successful, how
   often threads were migrated to the local cpu on wakeup, etc.

Sponsored by:	Nokia
2008-04-17 04:20:10 +00:00
Alan Cox 44aab2c3de Introduce vm_reserv_reclaim_contig(). This function is used by
contigmalloc(9) as a last resort to steal pages from an inactive,
partially-used superpage reservation.

Rename vm_reserv_reclaim() to vm_reserv_reclaim_inactive() and
refactor it so that a separate subroutine is responsible for breaking
the selected reservation.  This subroutine is also used by
vm_reserv_reclaim_contig().
2008-04-06 18:09:28 +00:00
Alan Cox 2fbced6574 Eliminate an unnecessary test from vm_phys_unfree_page(). 2008-04-05 05:02:53 +00:00
Alan Cox c416972587 Update a comment to vm_map_pmap_enter(). 2008-04-04 19:14:58 +00:00
Alan Cox 7630c26507 Reintroduce UMA_SLAB_KMAP; however, change its spelling to
UMA_SLAB_KERNEL for consistency with its sibling UMA_SLAB_KMEM.
(UMA_SLAB_KMAP met its original demise in revision 1.30 of
vm/uma_core.c.)  UMA_SLAB_KERNEL is now required by the jumbo frame
allocators.  Without it, UMA cannot correctly return pages from the
jumbo frame zones to the VM system because it resets the pages' object
field to NULL instead of the kernel object.  In more detail, the jumbo
frame zones are created with the option UMA_ZONE_REFCNT.  This causes
UMA to overwrite the pages' object field with the address of the slab.
However, when UMA wants to release these pages, it doesn't know how to
restore the object field, so it sets it to NULL.  This change teaches
UMA how to reset the object field to the kernel object.

Crashes reported by: kris
Fix tested by: kris
Fix discussed with: jeff
MFC after: 6 weeks
2008-04-04 18:41:12 +00:00
Alan Cox 24dedba9f5 Eliminate an unnecessary printf() from kmem_suballoc(). The subsequent
panic() can be extended to convey the same information.
2008-03-30 20:08:59 +00:00
Jeff Roberson 52481a9a9d - Use vm_object_reference_locked() directly from
vm_object_reference().  This is intended to get rid of vget()
   consumers who don't wish to acquire a lock.  This is functionally
   the same as calling vref(). vm_object_reference_locked() already
   uses vref.

Discussed with:	alc
2008-03-29 07:06:13 +00:00
Konstantin Belousov 91a35e7870 Do not dereference cdev->si_cdevsw, use the dev_refthread() to properly
obtain the reference. In particular, this fixes the panic reported in
the PR. Remove the comments stating that this needs to be done.

PR:	kern/119422
MFC after:	1 week
2008-03-20 16:08:42 +00:00
Alan Cox e5b006ffca Rename vm_pageq_requeue() to vm_page_requeue() on account of its recent
migration to vm/vm_page.c.
2008-03-19 20:24:35 +00:00
Jeff Roberson 374ae2a393 - Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from
requiring the per-process spinlock to only requiring the process lock.
 - Reflect these changes in the proc.h documentation and consumers throughout
   the kernel.  This is a substantial reduction in locking cost for these
   fields and was made possible by recent changes to threading support.
2008-03-19 06:19:01 +00:00
Alan Cox 1fa94a36b1 Almost seven years ago, vm/vm_page.c was split into three parts:
vm/vm_contig.c, vm/vm_page.c, and vm/vm_pageq.c.  Today, vm/vm_pageq.c
has withered to the point that it contains only four short functions,
two of which are only used by vm/vm_page.c.  Since I can't foresee any
reason for vm/vm_pageq.c to grow, it is time to fold the remaining
contents of vm/vm_pageq.c back into vm/vm_page.c.

Add some comments.  Rename one of the functions, vm_pageq_enqueue(),
that is now static within vm/vm_page.c to vm_page_enqueue().
Eliminate PQ_MAXCOUNT as it no longer serves any purpose.
2008-03-18 06:52:15 +00:00
Alan Cox ec96dca788 Simplify the inner loop of vm_fault()'s delete-behind heuristic.
Instead of checking each page for PG_UNMANAGED, perform a one-time
check whether the object is OBJT_PHYS.  (PG_UNMANAGED pages only
belong to OBJT_PHYS objects.)
2008-03-16 17:37:19 +00:00
Robert Watson 237fdd787b In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation.  This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after:	1 month
Discussed with:	imp, rink
2008-03-16 10:58:09 +00:00
Jeff Roberson 6617724c5f Remove kernel support for M:N threading.
While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential.  Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.
2008-03-12 10:12:01 +00:00
Jeff Roberson c5aa6b581d - Pass the priority argument from *sleep() into sleepq and down into
sched_sleep().  This removes extra thread_lock() acquisition and
   allows the scheduler to decide what to do with the static boost.
 - Change the priority arguments to cv_* to match sleepq/msleep/etc.
   where 0 means no priority change.  Catch -1 in cv_broadcastpri() and
   convert it to 0 for now.
 - Set a flag when sleeping in a way that is compatible with swapping
   since direct priority comparisons are meaningless now.
 - Add a sysctl to ule, kern.sched.static_boost, that defaults to on which
   controls the boost behavior.  Turning it off gives better performance
   in some workloads but needs more investigation.
 - While we're modifying sleepq, change signal and broadcast to both
   return with the lock held as the lock was held on enter.

Reviewed by:	jhb, peter
2008-03-12 06:31:06 +00:00
Alan Cox 593e717ec9 Eliminate an unnecessary test from vm_fault's delete-behind heuristic.
Specifically, since the delete-behind heuristic is never applied to a
device-backed object, there is no point in checking whether each of the
object's pages is fictitious.  (Only device-backed objects have
fictitious pages.)
2008-03-09 06:08:58 +00:00
Marcel Moolenaar 8775db6f50 Make the vm_pmap field of struct vmspace the last field in the
structure. This allows per-CPU variations of struct pmap on a
single architecture without affecting the machine-independent
fields. As such, the PMAP variations don't affect the ABI. They
become part of it.
2008-03-01 22:54:42 +00:00
Alan Cox 688559667f Correct a long-standing error in vm_object_page_remove(). Specifically,
pmap_remove_all() must not be called on fictitious pages.  To date,
fictitious pages have been allocated from zeroed memory, effectively
hiding this problem because the fictitious pages appear to have an empty
pv list.  Submitted by: Kostik Belousov

Rewrite the comments describing vm_object_page_remove() to better
describe what it does.  Add an assertion.  Reviewed by: Kostik Belousov

MFC after: 1 week
2008-02-26 17:16:48 +00:00
Alan Cox 4c8e0452e0 Correct a long-standing error in vm_object_deallocate(). Specifically,
only anonymous default (OBJT_DEFAULT) and swap (OBJT_SWAP) objects should
ever have OBJ_ONEMAPPING set.  However, vm_object_deallocate() was
setting it on device (OBJT_DEVICE) objects.  As a result,
vm_object_page_remove() could be called on a device object and if that
occurred pmap_remove_all() would be called on the device object's pages.
However, a device object's pages are fictitious, and fictitious pages do
not have an initialized pv list (struct md_page).

To date, fictitious pages have been allocated from zeroed memory,
effectively hiding this problem.  Now, however, the conversion of rotting
diagnostics to invariants in the amd64 and i386 pmaps has revealed the
problem.  Specifically, assertion failures have occurred during the
initialization phase of the X server on some hardware.

MFC after: 1 week
Discussed with: Kostik Belousov
Reported by: Michiel Boland
2008-02-24 18:03:56 +00:00
Attilio Rao 22db15c06f VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
2008-01-13 14:44:15 +00:00
Pawel Jakub Dawidek 79c2840d1d When one tries to allocate memory with the M_WAITOK flag and we are short in
address space in kmem map call vm_lowmem event in a loop and wait a bit for
subsystems to reclaim some memory which in turn will reclaim address space as
well.

Note, this is a work-around.

Reviewed by:	alc
Approved by:	alc
MFC after:	3 days
2008-01-10 08:36:38 +00:00
Attilio Rao cb05b60a89 vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by:	Diego Sardina <siarodx at gmail dot com>,
		Andrea Di Pasquale <whyx dot it at gmail dot com>
2008-01-10 01:10:58 +00:00
John Baldwin 8e38aeff17 Add a new file descriptor type for IPC shared memory objects and use it to
implement shm_open(2) and shm_unlink(2) in the kernel:
- Each shared memory file descriptor is associated with a swap-backed vm
  object which provides the backing store.  Each descriptor starts off with
  a size of zero, but the size can be altered via ftruncate(2).  The shared
  memory file descriptors also support fstat(2).  read(2), write(2),
  ioctl(2), select(2), poll(2), and kevent(2) are not supported on shared
  memory file descriptors.
- shm_open(2) and shm_unlink(2) are now implemented as system calls that
  manage shared memory file descriptors.  The virtual namespace that maps
  pathnames to shared memory file descriptors is implemented as a hash
  table where the hash key is generated via the 32-bit Fowler/Noll/Vo hash
  of the pathname.
- As an extension, the constant 'SHM_ANON' may be specified in place of the
  path argument to shm_open(2).  In this case, an unnamed shared memory
  file descriptor will be created similar to the IPC_PRIVATE key for
  shmget(2).  Note that the shared memory object can still be shared among
  processes by sharing the file descriptor via fork(2) or sendmsg(2), but
  it is unnamed.  This effectively serves to implement the getmemfd() idea
  bandied about the lists several times over the years.
- The backing store for shared memory file descriptors are garbage
  collected when they are not referenced by any open file descriptors or
  the shm_open(2) virtual namespace.

Submitted by:	dillon, peter (previous versions)
Submitted by:	rwatson (I based this on his version)
Reviewed by:	alc (suggested converting getmemfd() to shm_open())
2008-01-08 21:58:16 +00:00
Christian S.J. Peron 35918c55e5 When MAC is enabled in the kernel, fix a panic triggered by a locking
assertion hit in swapoff_one() when we un-mount a swap partition.  We
should be using curthread where we used thread0 before.  This change
also replaces the thread argument with a credential argument, as the
MAC framework only requires the cred.

It should be noted that this allows the machine to be rebooted without
panicing with "cannot differ from curthread or NULL" when MAC is enabled.

Submitted by:	rwatson
Reviewed by:	attilio
MFC after:	2 weeks
2008-01-08 14:58:41 +00:00
Konstantin Belousov 77bc7900bc In the vm_map_stack(), check for the specified stack region wraparound.
Reported and tested by:	Peter Holm
Reviewed by:	alc
MFC after:	3 days
2008-01-04 04:33:13 +00:00
Alan Cox eb2a051720 Add an access type parameter to pmap_enter(). It will be used to implement
superpage promotion.

Correct a style error in kmem_malloc(): pmap_enter()'s last parameter is
a Boolean.
2008-01-03 07:34:34 +00:00
Alan Cox 273bf93c8d Defer setting either PG_CACHED or PG_FREE until after the free page
queues lock is acquired.  Otherwise, the state of a reservation's
pages' flags and its population count can be inconsistent.  That could
result in a page being freed twice.

Reported by:	kris
2008-01-02 04:43:47 +00:00
Alan Cox af6ce1660a Correct a style error that was introduced in revision 1.77. 2008-01-01 20:36:04 +00:00
Alan Cox f8a47341fe Add the superpage reservation system. This is "part 2 of 2" of the
machine-independent support for superpages.  (The earlier part was
the rewrite of the physical memory allocator.)  The remainder of the
code required for superpages support is machine-dependent and will
be added to the various pmap implementations at a later date.

Initially, I am only supporting one large page size per architecture.
Moreover, I am only enabling the reservation system on amd64.  (In
an emergency, it can be disabled by setting VM_NRESERVLEVELS to 0
in amd64/include/vmparam.h or your kernel configuration file.)
2007-12-29 19:53:04 +00:00
Alan Cox 3df92083af Add a list of reservations to the vm object structure.
Recycle the vm object's "pg_color" field to represent the color of the
first virtual page address at which the object is mapped instead of the
color of the object's first physical page.  Since an object may not be
mapped, introduce a flag "OBJ_COLORED" that indicates whether "pg_color"
is valid.
2007-12-27 17:56:35 +00:00
Alan Cox ae0fee95e1 Add the superpage reservation type. 2007-12-27 17:08:11 +00:00