Commit graph

13261 commits

Author SHA1 Message Date
Lawrence Stewart 0963c8e431 When a previous call to sbsndptr() leaves sb->sb_sndptroff at the start of an
mbuf that was fully consumed by the previous call, the mbuf ptr returned by the
current call ends up being the previous mbuf in the sb chain to the one that
contains the data we want.

This does not cause any observable issues because the mbuf copy routines happily
walk the mbuf chain to get to the data at the moff offset, which in this case
means they effectively skip over the mbuf returned by sbsndptr().

We can't adjust sb->sb_sndptr during the previous call for this case because the
next mbuf in the chain may not exist yet. We therefore need to detect the
condition and make the adjustment during the current call.

Fix by detecting the special case of moff being at the start of the next mbuf in
the chain and adjust the required accounting variables accordingly.

Reviewed by:	andre
MFC after:	2 weeks
2013-06-19 03:08:01 +00:00
Lawrence Stewart ec41a9a1bd The fix committed in r250951 replaced the reported panic with a deadlock... gold
star for me. EVENTHANDLER_DEREGISTER() attempts to acquire the lock which is
held by the event handler framework while executing event handler functions,
leading to deadlock.

Move EVENTHANDLER_DEREGISTER() to alq_load_handler() and thus deregister the ALQ
shutdown_pre_sync handler at module unload time, which takes care of the
originally reported panic and fixes the deadlock introduced in r250951.

Reported by:	Luiz Otavio O Souza
MFC after:	3 days
X-MFC with:	250951
2013-06-17 09:49:07 +00:00
Ed Schouten 2381f6ef8c Change callout use counter to use C11 atomics.
In order to get some coverage of C11 atomics in kernelspace, switch at
least one piece of code in kernelspace to use C11 atomics instead of
<machine/atomic.h>.

While there, slightly improve the code by adding an assertion to prevent
the use count from going negative.
2013-06-16 09:30:35 +00:00
Lawrence Stewart 38f080cb04 Move hhook's per-vnet initialisation to an earlier SYSINIT SI_SUB stage to
ensure all per-vnet related hhook initialisation is completed prior to any
virtualised hhook points attempting registration.

vnet_register_sysinit() requires that a stage later than SI_SUB_VNET be chosen.
There are no per-vnet initialisors in the source tree at this time which run
earlier than SI_SUB_INIT_IF. A quick audit of non-virtualised SYSINITs indicates
there are no subsystems pre SI_SUB_MBUF that would likely be interested in
registering a virtualised hhook point.

Settle on SI_SUB_MBUF as hhook's per-vnet initialisation stage as it's the first
overtly network-related initilisation stage to run after SI_SUB_VNET. If a
subsystem that initialises earlier than SI_SUB_MBUF ends up wanting to register
virtualised hhook points in future, hhook's use of SI_SUB_MBUF will need to be
revisited and would probably warrant creating a dedicated SI_SUB_HHOOK which
runs immediately after SI_SUB_VNET.

MFC after:	1 week
2013-06-15 10:08:34 +00:00
Lawrence Stewart 933a8bff73 Cleanup and simplification in khelp_{register|deregister}_helper(). No
functional changes.

MFC after:	1 week
2013-06-15 06:45:17 +00:00
Lawrence Stewart 58261d30e1 Add a private KPI between hhook and khelp that allows khelp modules to insert
hook functions into hhook points which register after the modules were loaded -
potentially useful during boot or if hhook points are dynamically registered.

MFC after:	1 week
2013-06-15 05:57:29 +00:00
Lawrence Stewart b1f53277ec Internalise handling of virtualised hook points inside
hhook_{add|remove}_hook_lookup() so that khelp (and other potential API
consumers) do not have to care when they attempt to (un)hook a particular hook
point identified by id and type.

Reviewed by:	scottl
MFC after:	1 week
2013-06-15 04:03:40 +00:00
Lawrence Stewart bfe72a58e2 Fix a major oversight in r251732 which causes non-VIMAGE kernels to trigger a
KASSERT during TCP hhook registration at boot. Virtualised hook points only
require extra housekeeping and sanity checking when "options VIMAGE" is present.

Reported by:	bdrewery,jh,dhw
Tested by:	dhw
MFC after:	1 week
X-MFC with:	251732
2013-06-14 18:11:21 +00:00
Lawrence Stewart 601d4c7543 Add support for non-virtualised hhook points, which are uniquely identified by
type and id, as compared to virtualised hook points which are now uniquely
identified by type, id and a vid (which for vimage is the pointer to the vnet
that the hhook resides in).

All hhook_head structs for both virtualised and non-virtualised hook points
coexist in hhook_head_list, and a separate list is maintained for hhook points
within each vnet to simplify some vimage-related housekeeping.

Reviewed by:	scottl
MFC after:	1 week
2013-06-14 04:10:34 +00:00
Lawrence Stewart 86241d89a9 Fix a potential NULL-pointer dereference that would trigger if the hhook
registration site did not provide storage for a copy of the hhook_head struct.

MFC after:	3 days
2013-06-14 02:25:40 +00:00
Jeff Roberson 17a2737732 - Add a BIT_FFS() macro and use it to replace cpusetffs_obj()
Discussed with:	attilio
Sponsored by:	EMC / Isilon Storage Division
2013-06-13 20:46:03 +00:00
Konstantin Belousov 1d7466bca4 Fix two issues with the spin loops in the umtx(2) implementation.
- When looping, check for the pending suspension.  Otherwise, other
  usermode thread which races with the looping one, could try to
  prevent the process from stopping or exiting.

- Add missed checks for the faults from casuword*().  The code is
  structured in a way which makes the loops exit if the specified
  address is invalid, since both fuword() and casuword() return -1 on
  the fault.  But if the address is mapped readonly, the typical value
  read by fuword() is different from -1, while casuword() returns -1.
  Absent the checks for casuword() faults, this is interpreted as the
  race with other thread and causes non-interruptible spinning in the
  kernel.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-06-13 09:33:22 +00:00
Marcel Moolenaar 4612275fdb Revert r251590. It unexpectedly broke the build and there were some
questions on locking. As part of commit-bit grooming, I'd like Steve
to handle this, but can't leave things broken in the mean time.
2013-06-10 15:22:27 +00:00
Marcel Moolenaar 8c7ca16f63 Add vfs_mounted and vfs_unmounted events so that components can be informed
about mount and unmount events. This is used by Juniper to implement a more
optimal implementation of NetBSD's veriexec.

Submitted by:	stevek@juniper.net
Obtained from:	Juniper Networks, Inc
2013-06-09 23:51:26 +00:00
Gleb Smirnoff 8d1aa3c6b4 aio_mlock() added:
- Regen for r251526.
  - Bump __FreeBSD_version.
2013-06-08 13:30:13 +00:00
Gleb Smirnoff 6160e12c10 Add new system call - aio_mlock(). The name speaks for itself. It allows
to perform the mlock(2) operation, which can consume a lot of time, under
control of aio(4).

Reviewed by:	kib, jilles
Sponsored by:	Nginx, Inc.
2013-06-08 13:27:57 +00:00
Gleb Smirnoff f95c13db04 Separate LIO_SYNC processing into a separate function aio_process_sync(),
and rename aio_process() into aio_process_rw().

Reviewed by:	kib
Sponsored by:	Nginx, Inc.
2013-06-08 13:02:43 +00:00
John Baldwin c9813d0a37 Do not compare the existing mask of a cpuset with a new mask when changing
the mask of a cpuset.  Also, change the cpuset's mask before updating the
masks of all children.  Previously changing a cpuset's mask first required
setting the mask to a super-set of both the old and new masks and then
changing it a second time to the new mask.
2013-06-06 14:43:19 +00:00
Alan Cox 27a18d6a23 Don't busy the page unless we are likely to release the object lock.
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2013-06-06 06:17:20 +00:00
Jeff Roberson ba39d89bc9 - Consolidate duplicate code into support functions.
- Split the bqlock into bqclean and bqdirty locks.
 - Only acquire the wakeup synchronization locks when we cross a
   threshold requiring them.
 - Restructure the way flushbufqueues() targets work so they are more
   smp friendly and sane.

Reviewed by:	kib
Discussed with:	mckusick, attilio
Sponsored by:	EMC / Isilon Storage Division

M    vfs_bio.c
2013-06-05 23:53:00 +00:00
Gleb Smirnoff 82e825c4c9 Improve r250890, so that we stop processing of a message with zero
descriptors as early as possible, and assert that number of descriptors
is positive in unp_freerights().

Reviewed by:	mjg, pjd, jilles
2013-06-04 11:19:08 +00:00
John Baldwin 24150d37d3 - Fix a couple of inverted panic messages for shared/exclusive mismatches
of a lock within a single thread.
- Fix handling of interlocks in WITNESS by properly requiring the interlock
  to be held exactly once if it is specified.
2013-06-03 17:41:11 +00:00
John Baldwin 95d28652af - Handle the recursed/not recursed flags with RA_RLOCKED in rw_assert().
- Tweak a panic message.
2013-06-03 17:38:57 +00:00
Konstantin Belousov d39116f5d5 Be more generous when donating the current thread time to the owner of
the vnode lock while iterating over the free vnode list.  Instead of
yielding, pause for 1 tick.  The change is reported to help in some
virtualized environments.

Submitted by:	Roger Pau Monn? <roger.pau@citrix.com>
Discussed with:	jilles
Tested by:	pho
MFC after:	2 weeks
2013-06-03 17:36:43 +00:00
Konstantin Belousov 1e65d73c74 Do not map the shared page COW. If the process wired its address
space, fork(2) would cause shadowing of the physical object and
copying of the shared page into private copy, effectively preventing
updates for the exported timehands structure and stopping the clock.

Specify the maximum allowed permissions for the page to be read and
execute, preventing write from the user mode.

Reported and tested by:	<huanghwh@yahoo.com>
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-06-03 04:32:53 +00:00
Konstantin Belousov 92fab43f7f When auto-sizing the buffer cache, limit the amount of physical memory
used as the estimation of size, to 32GB.  This provides around 100K of
buffer headers and corresponding KVA for buffer map at the peak.
Sizing the cache larger is not useful, also resulting in the wasting
and exhausting of KVA for large machines.

Reported and tested by:	bdrewery
Sponsored by:	The FreeBSD Foundation
2013-06-03 04:16:48 +00:00
Alan Cox 39a4cd0cec Reduce the scope of the VM object locking in brelse(). In my tests, this
change reduced the total number of VM object lock acquisitions by brelse()
by 74%.

Sponsored by:	EMC / Isilon Storage Division
2013-06-02 16:18:03 +00:00
Marius Strobl 0ad17e4b32 Move an assertion to the right spot; only bus_dmamap_load_mbuf(9)
requires a pkthdr being present but that's not the case for either
_bus_dmamap_load_mbuf_sg() or bus_dmamap_load_mbuf_sg(9).

Reported by:	sbruno
MFC after:	1 week
2013-06-01 11:42:47 +00:00
John Baldwin 3d4c503cf0 Style fixes to vn_ioctl().
Suggested by:	bde
2013-05-31 16:15:22 +00:00
Jeff Roberson 22a722605d - Convert the bufobj lock to rwlock.
- Use a shared bufobj lock in getblk() and inmem().
 - Convert softdep's lk to rwlock to match the bufobj lock.
 - Move INFREECNT to b_flags and protect it with the buf lock.
 - Remove unnecessary locking around bremfree() and BKGRDINPROG.

Sponsored by:	EMC / Isilon Storage Division
Discussed with:	mckusick, kib, mdf
2013-05-31 00:43:41 +00:00
Julian Elischer 4591f0d339 Initialising the new fibnum field to a known value turns out to
be a GOOD IDEA (TM).
Apparently MOST users set this (e.g. tcp and friends) but there are a few
users that just assume that it is a sensible value but then go on to read it.
These include SCTP, pf and the FLOWTABLE option (and maybe others).
2013-05-24 02:18:37 +00:00
Lawrence Stewart 7639c9be45 Ensure alq's shutdown_pre_sync event handler is deregistered on module unload to
avoid a dangling pointer and eventual panic on system shutdown.

Reported by:	Ali <comnetboy at gmail.com>
Tested by:	Ali <comnetboy at gmail.com>
MFC after:	1 week
2013-05-24 00:49:12 +00:00
Pawel Jakub Dawidek 92981fdf9e Use proper malloc type for ioctls white-list.
Reported by:	pho
Tested by:	pho
2013-05-23 21:07:26 +00:00
Luigi Rizzo 4b62214f4a Increase the (arbitrary) limit for the number of packets per tick
from 1k to 20k The previous value was good 10 years ago, but not
anymore now.

More importantly, lots of good surprises:
polling is incredibly effective under virtualization, and not only
prevents livelock but also saves most of the VM exit overhead in
receive mode.

Using polling, a FreeBSD instance under qemu-kvm remains perfectly
responsive even when bombed with 10 Mpps over an emulated e1000,
and happily processes 1.7 Mpps through ipfw.

Note that some incompatibilities still remain: e.g. polling is not
(yet) compatible with netmap, and seems to freeze the guest when
kern.polling.idle_poll=1

MFC after:	3 days
2013-05-22 16:32:18 +00:00
Mateusz Guzik ecbb2a1819 passing fd over unix socket: fix a corner case where caller
wants to pass no descriptors.

Previously the kernel would leak memory and try to free a potentially
arbitrary pointer.

Reviewed by:	pjd
2013-05-21 21:58:00 +00:00
Attilio Rao bed927ee17 vm_object locking is not needed there as pages are already wired.
Sponsored by:	EMC / Isilon storage division
Submitted by:	alc
2013-05-21 20:54:03 +00:00
Konstantin Belousov f85769eb75 Regenerate. 2013-05-21 11:41:08 +00:00
Konstantin Belousov 48947eccee Fix the wait6(2) on 32bit architectures and for the compat32, by using
the right type for the argument in syscalls.master.  Also fix the
posix_fallocate(2) and posix_fadvise(2) compat32 syscalls on the
architectures which require padding of the 64bit argument.

Noted and reviewed by:	jhb
Pointy hat to:	kib
MFC after:	1 week
2013-05-21 11:40:16 +00:00
Pawel Jakub Dawidek 9b9ff7d390 Style nits. 2013-05-19 23:30:24 +00:00
Pawel Jakub Dawidek 9b1040a574 Use SDT_PROBE1() instead of SDT_PROBE(). 2013-05-19 23:29:22 +00:00
Jamie Gritton 761d2bb5b9 Refine the "nojail" rc keyword, adding "nojailvnet" for files that don't
apply to most jails but do apply to vnet jails.  This includes adding
a new sysctl "security.jail.vnet" to identify vnet jails.

PR:		conf/149050
Submitted by:	mdodd
MFC after:	3 days
2013-05-19 04:10:34 +00:00
Attilio Rao e3ed7ff03f Use readlocking now that assertions on vm_page_lookup() are relaxed.
Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	flo, pho
2013-05-17 20:03:55 +00:00
Jaakko Heinonen c532f8c4d6 A library function shall not set errno to 0.
Reviewed by:	mdf
2013-05-16 18:13:10 +00:00
Jeff Roberson f2cc1285c2 - Add a new general purpose path-compressed radix trie which can be used
with any structure containing a uint64_t index.  The tree code
   auto-generates type safe wrappers.
 - Eliminate the buf splay and replace it with pctrie.  This is not only
   significantly faster with large files but also allows for the possibility
   of shared locking.

Reviewed by:    alc, attilio
Sponsored by:   EMC / Isilon Storage Division
2013-05-12 04:05:01 +00:00
Konstantin Belousov 0fc6daa72d - Fix nullfs vnode reference leak in nullfs_reclaim_lowervp(). The
null_hashget() obtains the reference on the nullfs vnode, which must
  be dropped.

- Fix a wart which existed from the introduction of the nullfs
  caching, do not unlock lower vnode in the nullfs_reclaim_lowervp().
  It should be innocent, but now it is also formally safe.  Inform the
  nullfs_reclaim() about this using the NULLV_NOUNLOCK flag set on
  nullfs inode.

- Add a callback to the upper filesystems for the lower vnode
  unlinking. When inactivating a nullfs vnode, check if the lower
  vnode was unlinked, indicated by nullfs flag NULLV_DROP or VV_NOSYNC
  on the lower vnode, and reclaim upper vnode if so.  This allows
  nullfs to purge cached vnodes for the unlinked lower vnode, avoiding
  excessive caching.

Reported by:	G??ran L??wkrantz <goran.lowkrantz@ismobile.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-05-11 11:17:44 +00:00
Eitan Adler 7a2b450ff8 Fxi a bunch of typos.
PR:	misc/174625
Submitted by:	Jeremy Chadwick <jdc@koitsu.org>
2013-05-10 16:41:26 +00:00
Marcel Moolenaar e63091ea6c Add option WITNESS_NO_VNODE to suppress printing LORs between VNODE
locks. To support this, VNODE locks are created with the LK_IS_VNODE
flag. This flag is propagated down using the LO_IS_VNODE flag.

Note that WITNESS still records the LOR. Only the printing and the
optional entering into the kernel debugger is bypassed with the
WITNESS_NO_VNODE option.
2013-05-09 16:28:18 +00:00
Konstantin Belousov 3328431dec Item 1 in r248830 causes earlier exits from the sendfile(2), before
all requested data was sent.  The reason is that xfsize <= 0 condition
must not be tested at all if space == loopbytes.  Otherwise, the done
is set to 1, and sendfile(2) is aborted too early.

Instead of moving the condition to exiting the inner loop after the
xfersize check, directly check for the completed transfer before the
testing of the available space in the socket buffer, and revert item 1
of r248830.  It is arguably another bug to sleep waiting for socket
buffer space (or return EAGAIN for non-blocking socket) if all bytes
are already transferred.

Reported by:	pho
Discussed with:	scottl, gibbs
Tested by:	scottl (stable/9 backport), pho
2013-05-09 16:05:51 +00:00
Andre Oppermann 6753da1356 When the accept queue is full print the number of already pending
new connections instead of by how many we're over the limit, which
is always 1.

Noticed by:	jmallet
MFC after:	1 week
2013-05-08 14:13:14 +00:00
Scott Long ab8f55b9fd Add a sysctl vfs.read_min to complement the exiting vfs.read_max. It
defaults to 1, meaning that it's off.

When read-ahead is enabled on a file, the vfs cluster code deliberately
breaks a read into 2 I/O transactions; one to satisfy the actual read,
and one to perform read-ahead.  This makes sense in low-latency
circumstances, but often produces unbalanced i/o transactions that
penalize disks.  By setting vfs.read_min, we can tell the algorithm to
fetch a larger transaction that what we asked for, achieving the same
effect as the read-ahead but without the doubled, unbalanced transaction
and the slightly lower latency.  This significantly helps our workloads
with video streaming.

Submitted by:	emax
Reviewed by:	kib
Obtained from:	Netflix
2013-05-07 08:16:21 +00:00