Commit graph

1298 commits

Author SHA1 Message Date
Andrey V. Elsukov 52c57247d3 Remove unused variable.
PR:		173521
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-04-17 06:40:11 +00:00
Andrey V. Elsukov 4fd913364f Properly release the in6_multi lock.
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-04-12 02:05:31 +00:00
Kevin Lo d1b18731d9 Minor style cleanups. 2014-04-07 01:55:53 +00:00
Kevin Lo e06e816f67 Add support for UDP-Lite protocol (RFC 3828) to IPv4 and IPv6 stacks.
Tested with vlc and a test suite [1].

[1] http://www.erg.abdn.ac.uk/~gerrit/udp-lite/files/udplite_linux.tar.gz

Reviewed by:	jhb, glebius, adrian
2014-04-07 01:53:03 +00:00
Andrey V. Elsukov cd71804c84 Remove unused label.
MFC after:	1 week
2014-03-31 14:40:35 +00:00
Andrey V. Elsukov 27aa751c90 Don't generate an ICMPv6 error message if packet was consumed by filter.
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-03-31 14:27:22 +00:00
Robert Watson 7527624efa Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation.  This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.

(1) Merge a software implementation of the Toeplitz hash specified in
    RSS implemented by David Malone.  This is used to allow suitable
    pcbgroup placement of connections before the first packet is
    received from the NIC.  Software hashing is generally avoided,
    however, due to high cost of the hash on general-purpose CPUs.

(2) In in_rss.c, maintain authoritative versions of RSS state intended
    to be pushed to each NIC, including keying material, hash
    algorithm/ configuration, and buckets.  Provide software-facing
    interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
    the RSS standardised Toeplitz and a 'naive' variation with a hash
    efficient in software but with poor distribution properties.
    Implement rss_m2cpuid()to be used by netisr and other load
    balancing code to look up the CPU on which an mbuf should be
    processed.

(3) In the Ethernet link layer, allow netisr distribution using RSS as
    a source of policy as an alternative to source ordering; continue
    to default to direct dispatch (i.e., don't try and requeue packets
    for processing on the 'right' CPU if they arrive in a directly
    dispatchable context).

(4) Allow RSS to control tuning of connection groups in order to align
    groups with RSS buckets.  If a packet arrives on a protocol using
    connection groups, and contains a suitable hardware-generated
    hash, use that hash value to select the connection group for pcb
    lookup for both IPv4 and IPv6.  If no hardware-generated Toeplitz
    hash is available, we fall back on regular PCB lookup risking
    contention rather than pay the cost of Toeplitz in software --
    this is a less scalable but, at my last measurement, faster
    approach.  As core counts go up, we may want to revise this
    strategy despite CPU overhead.

Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP.  This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers.  Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing.  This will hopefully prove
a useful starting point for refinement.

No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.

Sponsored by:   Juniper Networks (original work)
Sponsored by:   EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
Gleb Smirnoff aa69c61235 Since both netinet/ and netinet6/ call into netipsec/ and netpfil/,
the protocol specific mbuf flags are shared between them.

- Move all M_FOO definitions into a single place: netinet/in6.h, to
  avoid future  clashes.
- Resolve clash between M_DECRYPTED and M_SKIP_FIREWALL which resulted
  in a failure of operation of IPSEC and packet filters.

Thanks to Nicolas and Georgios for all the hard work on bisecting,
testing and finally finding the root of the problem.

PR:			kern/186755
PR:			kern/185876
In collaboration with:	Georgios Amanakis <gamanakis gmail.com>
In collaboration with:	Nicolas DEFFAYET <nicolas-ml deffayet.com>
Sponsored by:		Nginx, Inc.
2014-03-12 14:29:08 +00:00
Gleb Smirnoff e3a7aa6f56 - Remove rt_metrics_lite and simply put its members into rtentry.
- Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This
  removes another cache trashing ++ from packet forwarding path.
- Create zini/fini methods for the rtentry UMA zone. Via initialize
  mutex and counter in them.
- Fix reporting of rmx_pksent to routing socket.
- Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode.

The change is mostly targeted for stable/10 merge. For head,
rt_pksent is expected to just disappear.

Discussed with:		melifaro
Sponsored by:		Netflix
Sponsored by:		Nginx, Inc.
2014-03-05 01:17:47 +00:00
John Baldwin 5b26ea5df3 Remove more constants related to static sysctl nodes. The MAXID constants
were primarily used to size the sysctl name list macros that were removed
in r254295.  A few other constants either did not have an associated
sysctl node, or the associated node used OID_AUTO instead.

PR:		ports/184525 (exp-run)
2014-02-25 18:44:33 +00:00
Craig Rodrigues 47a79fadc6 Remove KASSERT from in6p_lookup_mcast_ifp().
When the devel/jenkins port, version 1.551 was started,
the kernel would panic if INVARIANTS was enabled in the kernel config.

Suggested by: bms
2014-02-23 01:27:22 +00:00
Gleb Smirnoff 0ff96b4f55 o Remove at compile time the HASH_ALL code, that was never
tested and is unfinished. However, I've tested my version,
  it works okay. As before it is unfinished: timeout aren't
  driven by TCP session state. To enable the HASH_ALL mode,
  one needs in kernel config:

	options FLOWTABLE_HASH_ALL

o Reduce the alignment on flentry to 64 bytes. Without
  the FLOWTABLE_HASH_ALL option, twice less memory would
  be consumed by flows.
o API to ip_output()/ip6_output() got even more thin: 1 liner.
o Remove unused unions. Simply use fle->f_key[].
o Merge all IPv4 code into flowtable_lookup_ipv4(), and do same
  flowtable_lookup_ipv6(). Stop copying data to on stack
  sockaddr structures, simply use key[] on stack.
o Move code from flowtable_lookup_common() that actually works
  on insertion into flowtable_insert().

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-02-17 11:50:56 +00:00
Alexander V. Chernikov f6990c4e3e Further simplify nd6_output_lle.
Currently we have 3 usage patterns:
1) nd6_output (most traffic flow, no lle supplied, lle RLOCK sufficient)
2) corner cases for output (no lle, STALE lle, so on). lle WLOCK needed.
3) nd* iunternal machinery (WLOCK'ed lle provided, perform packet queing).

We separate case 1 and implement it inside its only customer - nd6_output.
This leads to some code duplication (especialy SEND stuff, which should be
hooked to output in a different way), but simplifies locking and control
flow logic fir nd6_output_lle.

Reviewed by:	ae
MFC after:	3 weeks
Sponsored by:	Yandex LLC
2014-02-13 19:09:04 +00:00
Andrey V. Elsukov e4c77ca0c0 Drop packets to multicast address whose scop field contains the
reserved value 0.

MFC after:	1 week
Sponsored by:	Yandex LLC
2014-02-13 14:10:44 +00:00
Christian Brueffer d37872314f Only count table lookups when we're actually processing packets.
PR:		183462
Submitted by:	Sven-Thorsten Dietrich <thebigcorporation at gmail.com>
Reviewed by:	bms
MFC after:	1 month
2014-02-10 14:47:51 +00:00
Christian Brueffer 1b55364ed9 For IPv6, return the same error code as IPv4 when mrouter is not initialized.
PR:		178472
Submitted by:	Sven-Thorsten Dietrich <sven at vyatta.com>
Reviewed by:	bms
2014-02-10 14:36:51 +00:00
Alexander V. Chernikov 9dffa6a3f3 Simplify nd6_output_lle:
* Check ND6_IFF_IFDISABLED before acquiring any locks
* Assume m is always non-NULL
* remove 'bad' case not used anymore
* Simply if_output conditional

MFC after:	2 weeks
Sponsored by:	Yandex LLC
2014-02-10 12:52:33 +00:00
Gleb Smirnoff 5d6d7e756b o Revamp API between flowtable and netinet, netinet6.
- ip_output() and ip_output6() simply call flowtable_lookup(),
    passing mbuf and address family. That's the only code under
    #ifdef FLOWTABLE in the protocols code now.
o Revamp statistics gathering and export.
  - Remove hand made pcpu stats, and utilize counter(9).
  - Snapshot of statistics is available via 'netstat -rs'.
  - All sysctls are moved into net.flowtable namespace, since
    spreading them over net.inet isn't correct.
o Properly separate at compile time INET and INET6 parts.
o General cleanup.
  - Remove chain of multiple flowtables. We simply have one for
    IPv4 and one for IPv6.
  - Flowtables are allocated in flowtable.c, symbols are static.
  - With proper argument to SYSINIT() we no longer need flowtable_ready.
  - Hash salt doesn't need to be per-VNET.
  - Removed rudimentary debugging, which use quite useless in dtrace era.

The runtime behavior of flowtable shouldn't be changed by this commit.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-02-07 15:18:23 +00:00
Andrey V. Elsukov 74a976fffd Unlock entry before retry.
Submitted by:	melifaro
MFC after:	1 week
2014-02-07 10:58:46 +00:00
Andrey V. Elsukov 51eecdc35a Take exclusive lock only when lle isn't NULL. We don't need write access
to lle in most cases.

MFC after:	1 week
Sponsored by:	Yandex LLC
2014-02-02 07:28:04 +00:00
Alexander V. Chernikov f6b84910bb Further rework netinet6 address handling code:
* Set ia address/mask values BEFORE attaching to address lists.
Inet6 address assignment is not atomic, so the simplest way to
do this atomically is to fill in ia before attach.
* Validate irfa->ia_addr field before use (we permit ANY sockaddr in old code).
* Do some renamings:
  in6_ifinit -> in6_notify_ifa (interaction with other subsystems is here)
  in6_setup_ifa -> in6_broadcast_ifa (LLE/Multicast/DaD code)
  in6_ifaddloop -> nd6_add_ifa_lle
  in6_ifremloop -> nd6_rem_ifa_lle
* Split working with LLE and route announce code for last two.
Add temporary in6_newaddrmsg() function to mimic current rtsock behaviour.
* Call device SIOCSIFADDR handler IFF we're adding first address.
In IPv4 we have to call it on every address change since ARP record
is installed by arp_ifinit() which is called by given handler.
IPv6 stack, on the opposite is responsible to call nd6_add_ifa_lle() so
there is no reason to call SIOCSIFADDR often.
2014-01-19 16:07:27 +00:00
Alexander V. Chernikov 0c5d4bde90 Use in6_localip() instead of hand-rolled cycle.
MFC after:	2 weeks
2014-01-18 20:54:55 +00:00
Alexander V. Chernikov 9080e7d023 Add in6_prepare_ifra() function to ease preparing in-kernel IPv6
address requests.

MFC after:	2 weeks
2014-01-18 20:32:59 +00:00
Alexander V. Chernikov b6a16fc853 Do some style(9) not done in r260851 to improve readability.
MFC after:	2 weeks
2014-01-18 15:57:43 +00:00
Alexander V. Chernikov 60d7c722a5 Split in6_update_ifa() into smaller pieces leaving functionality intact.
Discussed with:	ae
MFC after:	2 weeks
2014-01-18 15:52:52 +00:00
Andrey V. Elsukov e74966f60b Mechanically replace direct accessing to if_xname to using if_name() macro. 2014-01-10 12:33:28 +00:00
John-Mark Gurney f2effe745c revert part of r260485 which changes how part of the header gets
included..  netstat uses -DKERNEL=1 to get these parts and breaks the
build w/o it...

melifaro@ says that ae@ is probably asleep, and the PR doesn't have
this part of the patch...  Probably a local change got in by accident..

PR:		185148
Pointy hat to:	ae@
2014-01-09 22:41:18 +00:00
Andrey V. Elsukov 78415d1082 Remove extra nesting from X_ip6_mforward() function.
Also remove disabled definitions from ip6_mroute.h.

PR:		185148
Sponsored by:	Yandex LLC
2014-01-09 15:38:28 +00:00
Andrey V. Elsukov 0a6b0ffa54 Add MRT6_DLOG() macro for debugging.
Reduce number of MRT6DEBUG ifdefs and fix some broken format strings.

MFC after:	1 week
Sponsored by:	Yandex LLC
2014-01-09 14:58:06 +00:00
Alexander V. Chernikov 1dc8f6a82c Introduce IN6_MASK_ADDR() macro to unify various hand-rolled code
to do IPv6 addr & mask in different places.

MFC after:	2 weeks
2014-01-08 22:13:32 +00:00
Andrey V. Elsukov b88aef1dcf Use pointer to struct sockaddr_in6 in lla_lookup() call.
This prevents from triggering KASSERT in in6_lltable_lookup.
2014-01-03 02:40:56 +00:00
Andrey V. Elsukov e2d14d9317 Add IF_AFDATA_WLOCK_ASSERT() in case lla_lookup() is called with
LLE_CREATE flag.

MFC after:	1 week
2014-01-03 02:32:05 +00:00
Andrey V. Elsukov ea0c377602 lla_lookup() does modification only when LLE_CREATE is specified.
Thus we can use IF_AFDATA_RLOCK() instead of IF_AFDATA_LOCK() when doing
lla_lookup() without LLE_CREATE flag.

Reviewed by:	glebius, adrian
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-01-02 08:40:37 +00:00
Adrian Chadd c445d2520d Use an RLOCK here instead of an RWLOCK - matching all the other calls
to lla_lookup().

This drastically reduces the very high lock contention when doing parallel
TCP throughput tests (> 1024 sockets) with IPv6.

Tested:

* parallel IPv6 TCP bulk data exchange, 8192 sockets

MFC after:	1 week
Sponsored by:	Netflix, Inc.
2014-01-01 00:56:26 +00:00
Bjoern A. Zeeb 010c2b8192 Correct warnings comparing unsigned variables < 0 constantly reported
while building kernels.  All instances removed are indeed unsigned so
the expressions could not be true.

MFC after:	1 week
2013-12-25 20:08:44 +00:00
Dimitry Andric 6c5a340e56 In sys/netinet6/in6_mcast.c, in6m_is_ifp_detached() is only used
whenever KTR is defined, so put it between #ifdef KTR guards.  This
avoids a warning about a unused function if KTR is not enabled.

MFC after:	3 days
2013-12-24 20:30:13 +00:00
Andrey V. Elsukov 569aad57d2 Free mbuf in case of error.
MFC after:	1 week
2013-12-17 10:53:17 +00:00
Attilio Rao 54366c0bd7 - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
Andrey V. Elsukov ee674966f4 Fix panic with RADIX_MPATH, when RTFREE_LOCKED() called for already
unlocked route. Use in6_rtalloc() instead of in6_rtalloc1. This helps
simplify the code and remove several now unused variables.

PR:		156283
MFC after:	2 weeks
2013-11-11 12:49:00 +00:00
Gleb Smirnoff 555036b5f6 Remove never used ioctls that originate from KAME. The proof
of their zero usage was exp-run from misc/183538.
2013-11-11 05:39:42 +00:00
Michael Tuexen b54ddf225f Changes from upstream to improve compilation when INET or INET6
or none of them is defined.

MFC after: 3 days
2013-11-02 20:12:19 +00:00
Gleb Smirnoff c3322cb91c Include necessary headers that now are available due to pollution
via if_var.h.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-28 07:29:16 +00:00
Gleb Smirnoff eedc7fd9e8 Provide includes that are needed in these files, and before were read
in implicitly via if.h -> if_var.h pollution.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 18:18:50 +00:00
Gleb Smirnoff 76039bc84f The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 17:58:36 +00:00
Andrey V. Elsukov baa09f1891 Initialize inc_fibnum for properly handling ICMP6_PACKET_TOO_BIG errors
in multifib environment.

PR:		183265
MFC after:	1 week
2013-10-25 01:02:25 +00:00
Gleb Smirnoff 7caf4ab7ac - Utilize counter(9) to accumulate statistics on interface addresses. Add
four counters to struct ifaddr. This kills '+=' on a variables shared
  between processors for every packet.
- Nuke struct if_data from struct ifaddr.
- In ip_input() do not put a reference on ifaddr, instead update statistics
  right now in place and do IN_IFADDR_RUNLOCK(). These removes atomic(9)
  for every packet. [1]
- To properly support NET_RT_IFLISTL sysctl used by getifaddrs(3), in
  rtsock.c fill if_data fields using counter_u64_fetch().
- Accidentially fix bug in COMPAT_32 version of NET_RT_IFLISTL, which
  took if_data not from the ifaddr, but from ifaddr's ifnet. [2]

Submitted by:	melifaro [1], pluknet[2]
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 11:37:57 +00:00
Gleb Smirnoff 4675896098 Remove ifa_init() and provide ifa_alloc() that will allocate and setup
struct ifaddr internally.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 10:31:42 +00:00
Gleb Smirnoff 6ed910fabe Hide 'struct ifaddr' definition from userland. Two tools left that use it,
namely ipftest(1) and ifmcstat(1). These sniff structure definition using
_WANT_IFADDR define.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 10:19:24 +00:00
Gleb Smirnoff 3fa98cf9ac Remove unsigned < 0 check. 2013-10-15 10:12:19 +00:00
Gleb Smirnoff ca695e0807 Remove useless check of ia6 against NULL, right after dereferencing it. 2013-10-15 10:11:23 +00:00