Commit graph

2347 commits

Author SHA1 Message Date
Dimitry Andric 50207b2de9 Adjust function definition in nd6.c to avoid clang 15 warnings
With clang 15, the following -Werror warning is produced:

    sys/netinet6/nd6.c:247:12: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    nd6_destroy()
               ^
                void

This is nd6_destroy() is declared with a (void) argument list, but
defined with an empty argument list. Make the definition match the
declaration.

MFC after:	3 days
2022-07-26 21:25:09 +02:00
Alexander V. Chernikov 50fa27e795 netinet6: fix interface handling for loopback traffic
Currently, processing of IPv6 local traffic is partially broken:
 link-local connection fails and global unicast connect() takes
 3 seconds to complete.
This happens due to the combination of multiple factors.
IPv6 code passes original interface "origifp" when passing
traffic via loopack to retain the scope that is mandatory for the
correct hadling of link-local traffic. First problem is that the logic
of passing source interface is not working correcly for TCP connections,
resulting in passing "origifp" on the first 2 connection attempts and
lo0 on the subsequent ones. Second problem is that source address
validation logic skips its checks iff the source interface is loopback,
which doesn't cover "origifp" case.
More detailed description is available at https://reviews.freebsd.org/D35732

Fix the first problem by untangling&simplifying ifp/origifp logic.
Fix the second problem by switching source address validation check to
using M_LOOP mbuf flag instead of interface type.

PR:		265089
Reviewed by:	ae, bz(previous version)
Differential Revision:	https://reviews.freebsd.org/D35732
MFC after:	2 weeks
2022-07-10 12:47:47 +00:00
Alexander V. Chernikov 2756774c3f netinet6: simplify selectroute()
Effectively selectroute() addresses two different cases:
 providing interface info for multicast destinations and providing
 nexthop data for unicast ones. Current implementation intertwines
 handling of both cases, especially in the error handling part.
Factor out all route lookup logic in a separate function,
 lookup_route() to simplify the code.
Ensure consistent KPI: no error means *retifp is set and otherwise.

Differential Revision: https://reviews.freebsd.org/D35711
MFC after:	2 weeks
2022-07-08 11:27:16 +00:00
Alexander V. Chernikov 81a235ecde netinet6: factor out cached route lookups from selectroute().
Currently selectroute() contains two nearly-identical versions of
 the route lookup logic - one for original destination and another
for the case when IPV6_NEXTHOP option was set on the socket.

Factor out handling these route lookups in a separation function to
 improve readability.
This change also fixes handling of link-local IPV6_NEXTHOPs.

Differential Revision: https://reviews.freebsd.org/D35710
MFC after:	2 weeks
2022-07-08 08:58:55 +00:00
Alexander V. Chernikov 0ed7253785 netinet6: perform out-of-bounds check for loX multicast statistics
Currently, some per-mbuf multicast statistics is stored in
 the per-interface ip6stat.ip6s_m2m[] array of size 32 (IP6S_M2MMAX).
Check that loopback ifindex falls within 0.. IP6S_M2MMAX-1 range to
 avoid silent data corruption. The latter cat happen with large
 number of VNETs.

Reviewed by:	glebius
Differential Revision: https://reviews.freebsd.org/D35715
MFC after:	2 weeks
2022-07-05 11:44:30 +00:00
Mark Johnston a14465e1b9 rip6: Fix a lock order reversal in rip6_bind()
See also commit 71a1539e37.

Reported by:	syzbot+9b461b6a07a83cc10daa@syzkaller.appspotmail.com
Reported by:	syzbot+b6ce0aec16f5fdab3282@syzkaller.appspotmail.com
Reviewed by:	glebius
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35472
2022-06-14 12:00:59 -04:00
Kristof Provost a7f20faa07 netinet6: fix panic on kldunload pfsync
Commit d6cd20cc5 ("netinet6: fix ndp proxying") caused us to panic when
unloading pfsync:

	Fatal trap 12: page fault while in kernel mode
	cpuid = 19; apic id = 38
	fault virtual address	= 0x20
	fault code		= supervisor read data, page not present
	instruction pointer	= 0x20:0xffffffff80dfe7f4
	stack pointer	        = 0x28:0xfffffe015d4f8ac0
	frame pointer	        = 0x28:0xfffffe015d4f8ae0
	code segment		= base 0x0, limit 0xfffff, type 0x1b
				= DPL 0, pres 1, long 1, def32 0, gran 1
	processor eflags	= interrupt enabled, resume, IOPL = 0
	current process		= 5477 (kldunload)
	trap number		= 12
	panic: page fault
	cpuid = 19
	time = 1654023100
	KDB: stack backtrace:
	db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe015d4f8880
	vpanic() at vpanic+0x17f/frame 0xfffffe015d4f88d0
	panic() at panic+0x43/frame 0xfffffe015d4f8930
	trap_fatal() at trap_fatal+0x387/frame 0xfffffe015d4f8990
	trap_pfault() at trap_pfault+0xab/frame 0xfffffe015d4f89f0
	calltrap() at calltrap+0x8/frame 0xfffffe015d4f89f0
	--- trap 0xc, rip = 0xffffffff80dfe7f4, rsp = 0xfffffe015d4f8ac0, rbp = 0xfffffe015d4f8ae0 ---
	in6_purge_proxy_ndp() at in6_purge_proxy_ndp+0x14/frame 0xfffffe015d4f8ae0
	if_purgeaddrs() at if_purgeaddrs+0x24/frame 0xfffffe015d4f8b90
	if_detach_internal() at if_detach_internal+0x1c2/frame 0xfffffe015d4f8bf0
	if_detach() at if_detach+0x71/frame 0xfffffe015d4f8c20
	pfsync_clone_destroy() at pfsync_clone_destroy+0x1dd/frame 0xfffffe015d4f8c70
	if_clone_destroyif() at if_clone_destroyif+0x239/frame 0xfffffe015d4f8cc0
	if_clone_detach() at if_clone_detach+0xc8/frame 0xfffffe015d4f8cf0
	vnet_pfsync_uninit() at vnet_pfsync_uninit+0xda/frame 0xfffffe015d4f8d10
	vnet_deregister_sysuninit() at vnet_deregister_sysuninit+0x85/frame 0xfffffe015d4f8d40
	linker_file_sysuninit() at linker_file_sysuninit+0x147/frame 0xfffffe015d4f8d70
	linker_file_unload() at linker_file_unload+0x269/frame 0xfffffe015d4f8db0
	kern_kldunload() at kern_kldunload+0x18d/frame 0xfffffe015d4f8e00
	amd64_syscall() at amd64_syscall+0x12e/frame 0xfffffe015d4f8f30
	fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe015d4f8f30
	--- syscall (444, FreeBSD ELF64, sys_kldunloadf), rip = 0x1601eab28cba, rsp = 0x1601e9c363f8, rbp = 0x1601e9c36c50 ---

This happens because ifp->if_afdata[AF_INET6] is NULL. Check for this,
just as we already do in a few other places.
See also c139b3c19b ("arp/nd: Cope with late calls to
iflladdr_event").

Reviewed by:	melifaro
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D35374
2022-06-01 09:26:15 +02:00
Arseny Smalyuk d18b4bec98 netinet6: Fix mbuf leak in NDP
Mbufs leak when manually removing incomplete NDP records with pending packet via ndp -d.
It happens because lltable_drop_entry_queue() rely on `la_numheld`
counter when dropping NDP entries (lles). It turned out NDP code never
increased `la_numheld`, so the actual free never happened.

Fix the issue by introducing unified lltable_append_entry_queue(),
common for both ARP and NDP code, properly addressing packet queue
maintenance.

Reviewed By: melifaro
Differential Revision: https://reviews.freebsd.org/D35365
MFC after:	2 weeks
2022-05-31 21:06:14 +00:00
KUROSAWA Takahiro d6cd20cc5c netinet6: fix ndp proxying
We could insert proxy NDP entries by the ndp command, but the host
with proxy ndp entries had not responded to Neighbor Solicitations.
Change the following points for proxy NDP to work as expected:
* join solicited-node multicast addresses for proxy NDP entries
  in order to receive Neighbor Solicitations.
* look up proxy NDP entries not on the routing table but on the
  link-level address table when receiving Neighbor Solicitations.

Reviewed By: melifaro
Differential Revision: https://reviews.freebsd.org/D35307
MFC after:	2 weeks
2022-05-30 10:53:33 +00:00
KUROSAWA Takahiro 77001f9b6d lltable: introduce the llt_post_resolved callback
In order to decrease ifdef INET/INET6s in the lltable implementation,
introduce the llt_post_resolved callback and implement protocol-dependent
code in the protocol-dependent part.

Reviewed By: melifaro
Differential Revision: https://reviews.freebsd.org/D35322
MFC after:	2 weeks
2022-05-30 10:53:33 +00:00
Dmitry Chagin 31d1b816fe sysent: Get rid of bogus sys/sysent.h include.
Where appropriate hide sysent.h under proper condition.

MFC after:	2 weeks
2022-05-28 20:52:17 +03:00
Gleb Smirnoff 6890b58814 sockbuf: improve sbcreatecontrol()
o Constify memory pointer.  Make length unsigned.
o Make it never fail with M_WAITOK and assert that length is sane.
2022-05-17 10:10:42 -07:00
Gleb Smirnoff b46667c63e sockbuf: merge two versions of sbcreatecontrol() into one
No functional change.
2022-05-17 10:10:42 -07:00
Gleb Smirnoff 808b7d80e0 mbuf: remove PH_vt alias for mbuf packet header persistent shared data
Mechanical sed change s/PH_vt\.vt_nrecs/vt_nrecs/g
2022-05-13 13:32:43 -07:00
Kristof Provost 797b94504f udp6: allow udp_tun_func_t() to indicate it did not eat the packet
Implement the same filter feature we implemented for UDP over IPv6 in
742e7210d. This was missed in that commit.

Pointed out by:	markj
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2022-04-22 16:55:23 +02:00
Mark Johnston 5d691ab4f0 mld6: Ensure that mld_domifattach() always succeeds
mld_domifattach() does a memory allocation under the global MLD mutex
and so can fail, but no error handling prevents a null pointer
dereference in this case.  The mutex is only needed when updating the
global softc list; the allocation and static initialization of the softc
does not require this mutex.  So, reduce the scope of the mutex and use
M_WAITOK for the allocation.

PR:		261457
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34943
2022-04-21 13:23:59 -04:00
John Baldwin a98bb75f80 netinet6: Use __diagused for variables only used in KASSERT(). 2022-04-13 16:08:19 -07:00
Kristof Provost 742e7210d0 udp: allow udp_tun_func_t() to indicate it did not eat the packet
Allow udp tunnel functions to indicate they have not taken ownership of
the packet, and that normal UDP processing should continue.

This is especially useful for scenarios where the kernel has taken
ownership of a socket that was originally created by userspace. It
allows the tunnel function to pass through certain packets for userspace
processing.

The primary user of this is if_ovpn, when it receives messages from
unknown peers (which might be a new client).

Reviewed by:	tuexen
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D34883
2022-04-12 10:04:59 +02:00
Andrey V. Elsukov 7d98cc096b Fix ipfw fwd that doesn't work in some cases
For IPv4 use dst pointer as destination address in fib4_lookup().
It keeps destination address from IPv4 header and can be changed
when PACKET_TAG_IPFORWARD tag was set by packet filter.

For IPv6 override destination address with address from dst_sa.sin6_addr,
that was set from PACKET_TAG_IPFORWARD tag.

Reviewed by:	eugen
MFC after:	1 week
PR:		256828, 261697, 255705
Differential Revision: https://reviews.freebsd.org/D34732
2022-04-11 14:16:43 +03:00
Mark Johnston 990a6d18b0 net: Fix memory leaks in lltable_calc_llheader() error paths
Also convert raw epoch_call() calls to lltable_free_entry() calls, no
functional change intended.  There's no need to asynchronously free the
LLEs in that case to begin with, but we might as well use the lltable
interfaces consistently.

Noticed by code inspection; I believe lltable_calc_llheader() failures
do not generally happen in practice.

Reviewed by:	bz
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34832
2022-04-08 11:47:25 -04:00
Mark Johnston dd91d84486 net: Fix LLE lock leaks
Historically, lltable_try_set_entry_addr() would release the LLE lock
upon failure.  After some refactoring, it no longer does so, but
consumers were not adjusted accordingly.

Also fix a leak that can occur if lltable_calc_llheader() fails in the
ARP code, but I suspect that such a failure can only occur due to a code
bug.

Reviewed by:	bz, melifaro
Reported by:	pho
Fixes:		0b79b007eb ("[lltable] Restructure nd6 code.")
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34831
2022-04-08 11:46:19 -04:00
John Baldwin bf73b06771 ip6_mroute: Mark a variable only used in a debug trace as unused. 2022-04-06 16:45:29 -07:00
John Baldwin 8b6ccfb6c7 multicast code: Quiet unused warnings for variables used for KTR traces.
For nallow and nblock, move the variables under #ifdef KTR.

For return values from functions logged in KTR traces, mark the
variables as __unused rather than having to #ifdef the assignment of
the function return value.
2022-04-06 16:45:28 -07:00
Warner Losh c7761ca93e pim6_input: eliminate write only variable rc
Sponsored by:		Netflix
2022-04-04 22:30:52 -06:00
Gordon Bergling c55ecce1c1 netinet6: Fix a typo in a source code comment
- s/maping/mapping/

MFC after:	3 days
2022-03-28 19:32:10 +02:00
Andrew Gallatin 9ba117960e Fix a memory leak when ip_output_send() returns EAGAIN due to send tag issues
When ip_output_send() returns EAGAIN due to issues with send tags (route
change, lagg failover, etc), it must free the mbuf. This is because
ip_output_send() was written as a wrapper/replacement for a direct
call to  if_output(), and the contract with if_output() has
historically been that it owns the mbufs once called. When
ip_output_send() failed to free mbufs, it violated this assumption
and lead to leaked mbufs.

This was noticed when using NIC TLS in combination with hardware
rate-limited connections. When seeing lots of NIC output drops
triggered ratelimit send tag changes, we noticed we were leaking
ktls_sessions, send tags and mbufs. This was due ip_output_send()
leaking mbufs which held references to ktls_sessions, which in
turn held references to send tags.

Many thanks to jbh, rrs, hselasky and markj for their help in
debugging this.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34054
Reviewed by: hselasky, jhb, rrs
MFC after: 2 weeks
2022-01-27 10:34:34 -05:00
Thomas Steen Rasmussen bc6abdd97e nd6: use CARP link level address in SLLAO for NS sent out
When sending an NS, check if we are using a IPv6 CARP address
and if we do, then put proper CARP link level address into
ND_OPT_SOURCE_LINKADDR option and also put PACKET_TAG_CARP tag
on the packet.  The latter will enforce CARP link level address
at the data link layer too, which might be necessary for broken
implementations.
The code really follows what NA sending code has been doing since
introduction of carp(4).  While here, bring to style(9) the whole
block of code.

PR:			193280
Differential revision:	https://reviews.freebsd.org/D33858
2022-01-24 21:02:47 -08:00
Gleb Smirnoff 644ca0846d domains: make domain_init() initialize only global state
Now that each module handles its global and VNET initialization
itself, there is no VNET related stuff left to do in domain_init().

Differential revision:	https://reviews.freebsd.org/D33541
2022-01-03 10:15:22 -08:00
Gleb Smirnoff 89128ff3e4 protocols: init with standard SYSINIT(9) or VNET_SYSINIT
The historical BSD network stack loop that rolls over domains and
over protocols has no advantages over more modern SYSINIT(9).
While doing the sweep, split global and per-VNET initializers.

Getting rid of pr_init allows to achieve several things:
o Get rid of ifdef's that protect against double foo_init() when
  both INET and INET6 are compiled in.
o Isolate initializers statically to the module they init.
o Makes code easier to understand and maintain.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D33537
2022-01-03 10:15:21 -08:00
Michael Tuexen 657fcf5807 udp6: remove assignments not being used
MFC after:	3 days
Sponsored by:	Netflix, Inc.
2022-01-01 19:25:47 +01:00
Michael Tuexen 2de2ae331b sctp: improve sctp_pathmtu_adjustment()
Allow the resending of DATA chunks to be controlled by the caller,
which allows retiring sctp_mtu_size_reset() in a separate commit.
Also improve the computaion of the overhead and use 32-bit integers
consistently.
Thanks to Timo Voelker for pointing me to the code.

MFC after:	3 days
2021-12-30 15:16:05 +01:00
Alexander V. Chernikov ff3a85d324 [lltable] Add per-family lltable getters.
Introduce a new function, lltable_get(), to retrieve lltable pointer
 for the specified interface and family.
Use it to avoid all-iftable list traversal when adding or deleting
 ARP/ND records.

Differential Revision: https://reviews.freebsd.org/D33660
MFC after:	2 weeks
2021-12-29 20:57:15 +00:00
Gleb Smirnoff a057769205 in_pcb: use jenkins hash over the entire IPv6 (or IPv4) address
The intent is to provide more entropy than can be provided
by just the 32-bits of the IPv6 address which overlaps with
6to4 tunnels.  This is needed to mitigate potential algorithmic
complexity attacks from attackers who can control large
numbers of IPv6 addresses.

Together with:		gallatin
Reviewed by:		dwmalone, rscheff
Differential revision:	https://reviews.freebsd.org/D33254
2021-12-26 10:47:28 -08:00
Gleb Smirnoff eb8dcdeac2 jail: network epoch protection for IP address lists
Now struct prison has two pointers (IPv4 and IPv6) of struct
prison_ip type.  Each points into epoch context, address count
and variable size array of addresses.  These structures are
freed with network epoch deferred free and are not edited in
place, instead a new structure is allocated and set.

While here, the change also generalizes a lot (but not enough)
of IPv4 and IPv6 processing. E.g. address family agnostic helpers
for kern_jail_set() are provided, that reduce v4-v6 copy-paste.

The fast-path prison_check_ip[46]_locked() is also generalized
into prison_ip_check() that can be executed with network epoch
protection only.

Reviewed by:		jamie
Differential revision:	https://reviews.freebsd.org/D33339
2021-12-26 10:45:50 -08:00
Mateusz Guzik 71a1539e37 inet6: fix a LOR between rip and rawinp
Running sys/netpfil/pf/fragmentation v6 results in:

lock order reversal:
 1st 0xfffffe00050429a8 rip (rip, sleep mutex) @ /usr/src/sys/netinet6/raw_ip6.c:803
 2nd 0xfffff8009491e1d0 rawinp (rawinp, rw) @ /usr/src/sys/netinet6/raw_ip6.c:804
lock order rawinp -> rip established at:
0xffffffff8068e26a at witness_lock_order_add+0x28a
0xffffffff8068d087 at witness_checkorder+0x627
0xffffffff805a9f05 at __mtx_lock_flags+0x205
0xffffffff808102e4 at in_pcballoc+0x204
0xffffffff808d53c6 at rip6_attach+0x116
0xffffffff806dc4e8 at socreate+0x368
0xffffffff806eaedc at kern_socket+0xfc
0xffffffff806eadcd at sys_socket+0x2d
0xffffffff80abc774 at syscallenter+0x5c4
0xffffffff80abbeeb at amd64_syscall+0x1b
 0xffffffff80a8044b at fast_syscall_common+0xf8
lock order rip -> rawinp attempted at:
0xffffffff8068dc2a at witness_checkorder+0x11ca
0xffffffff805d1b7f at _rw_wlock_cookie+0x18f
0xffffffff808d596c at rip6_connect+0x19c
0xffffffff806e0842 at soconnectat+0x142
0xffffffff806ebe36 at kern_connectat+0x136
0xffffffff806ebcdf at sys_connect+0x4f
0xffffffff80abc774 at syscallenter+0x5c4
0xffffffff80abbeeb at amd64_syscall+0x1b
0xffffffff80a8044b at fast_syscall_common+0xf8

Reviewed by:	glebius
Fixes:	de2d47842e ("SMR protection for inpcbs")
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D33508
2021-12-19 14:43:04 +00:00
Kristof Provost 9f5432d5e5 netinet6: ip6_setpktopt() requires NET_EPOCH
ip6_setpktopt() can call ifnet_byindex() which requires epoch. Mark the
function as requiring NET_EPOCH, and ensure we enter it priot to calling
it.

Reported-by: syzbot+92526116441688fea8a3@syzkaller.appspotmail.com
Reviewed by:	glebius
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D33462
2021-12-17 17:30:36 +01:00
Gleb Smirnoff 185e659c40 inpcb: use locked variant of prison_check_ip*()
The pcb lookup always happens in the network epoch and in SMR section.
We can't block on a mutex due to the latter.  Right now this patch opens
up a race.  But soon that will be addressed by D33339.

Reviewed by:		markj, jamie
Differential revision:	https://reviews.freebsd.org/D33340
Fixes:			de2d47842e
2021-12-14 09:38:52 -08:00
Gleb Smirnoff e3044071de in6p_set_multicast_if(): fix malloc(M_WAITOK) with epoch
Fixes:	d74b7baeb0
2021-12-06 14:33:23 -08:00
Gleb Smirnoff d74b7baeb0 ifnet_byindex() actually requires network epoch
Sweep over potentially unsafe calls to ifnet_byindex() and wrap them
in epoch.  Most of the code touched remains unsafe, as the returned
pointer is being used after epoch exit.  Mark that with a comment.

Validate the index argument inside the function, reducing argument
validation requirement from the callers and making V_if_index
private to if.c.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D33263
2021-12-06 09:32:31 -08:00
Cy Schubert db0ac6ded6 Revert "wpa: Import wpa_supplicant/hostapd commit 14ab4a816"
This reverts commit 266f97b5e9, reversing
changes made to a10253cffe.

A mismerge of a merge to catch up to main resulted in files being
committed which should not have been.
2021-12-02 14:45:04 -08:00
Cy Schubert 266f97b5e9 wpa: Import wpa_supplicant/hostapd commit 14ab4a816
This is the November update to vendor/wpa committed upstream 2021-11-26.

MFC after:      1 month
2021-12-02 13:35:14 -08:00
Gleb Smirnoff de2d47842e SMR protection for inpcbs
With introduction of epoch(9) synchronization to network stack the
inpcb database became protected by the network epoch together with
static network data (interfaces, addresses, etc).  However, inpcb
aren't static in nature, they are created and destroyed all the
time, which creates some traffic on the epoch(9) garbage collector.

Fairly new feature of uma(9) - Safe Memory Reclamation allows to
safely free memory in page-sized batches, with virtually zero
overhead compared to uma_zfree().  However, unlike epoch(9), it
puts stricter requirement on the access to the protected memory,
needing the critical(9) section to access it.  Details:

- The database is already build on CK lists, thanks to epoch(9).
- For write access nothing is changed.
- For a lookup in the database SMR section is now required.
  Once the desired inpcb is found we need to transition from SMR
  section to r/w lock on the inpcb itself, with a check that inpcb
  isn't yet freed.  This requires some compexity, since SMR section
  itself is a critical(9) section.  The complexity is hidden from
  KPI users in inp_smr_lock().
- For a inpcb list traversal (a pcblist sysctl, or broadcast
  notification) also a new KPI is provided, that hides internals of
  the database - inp_next(struct inp_iterator *).

Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D33022
2021-12-02 10:48:48 -08:00
Gleb Smirnoff 565655f4e3 inpcb: reduce some aliased functions after removal of PCBGROUP.
Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D33021
2021-12-02 10:48:48 -08:00
Gleb Smirnoff 93c67567e0 Remove "options PCBGROUP"
With upcoming changes to the inpcb synchronisation it is going to be
broken. Even its current status after the move of PCB synchronization
to the network epoch is very questionable.

This experimental feature was sponsored by Juniper but ended never to
be used in Juniper and doesn't exist in their source tree [sjg@, stevek@,
jtl@]. In the past (AFAIK, pre-epoch times) it was tried out at Netflix
[gallatin@, rrs@] with no positive result and at Yandex [ae@, melifaro@].

I'm up to resurrecting it back if there is any interest from anybody.

Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D33020
2021-12-02 10:48:48 -08:00
Gleb Smirnoff 1cec1c5831 Allow to compile RSS without PCBGROUP.
Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D33019
2021-12-02 10:48:48 -08:00
Gordon Bergling 3cf59750eb netinet6: Fix a typo in a sysctl description
- remove a double 'a'

MFC after:	3 days
2021-11-30 07:24:44 +01:00
Mark Johnston 44775b163b netinet: Remove unneeded mb_unmapped_to_ext() calls
in_cksum_skip() now handles unmapped mbufs on platforms where they're
permitted.

Reviewed by:	glebius, jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33097
2021-11-24 13:31:16 -05:00
Kristof Provost 19dc644511 if_stf: add 6rd support
Implement IPv6 Rapid Deployment (RFC5969) on top of the existing 6to4
(RFC3056) if_stf code.

PR:		253328
Reviewed by:	hrs
Obtained from:	pfSense
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D33037
2021-11-20 19:29:01 +01:00
Gleb Smirnoff 3850d1837b in6_rmx: remove unnecessary TCP includes 2021-11-18 00:54:29 -08:00
Gleb Smirnoff 1817be481b Add net.inet6.ip6.source_address_validation
Drop packets arriving from the network that have our source IPv6
address.  If maliciously crafted they can create evil effects
like an RST exchange between two of our listening TCP ports.
Such packets just can't be legitimate.  Enable the tunable
by default.  Long time due for a modern Internet host.

Reviewed by:		melifaro, donner, kp
Differential revision:	https://reviews.freebsd.org/D32915
2021-11-12 09:01:40 -08:00
Gleb Smirnoff 9c89392f12 Add in_localip_fib(), in6_localip_fib().
Check if given address/FIB exists locally.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D32913
2021-11-12 08:59:42 -08:00
Gleb Smirnoff 3ea9a7cf7b blackhole(4): disable for locally originated TCP/UDP packets
In most cases blackholing for locally originated packets is undesired,
leads to different kind of lags and delays. Provide sysctls to enforce
it, e.g. for debugging purposes.

Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D32718
2021-11-03 13:02:44 -07:00
Roy Marples 5c5340108e net: Allow binding of unspecified address without address existance
Previously in_pcbbind_setup returned EADDRNOTAVAIL for empty
V_in_ifaddrhead (i.e., no IPv4 addresses configured) and in6_pcbbind
did the same for empty V_in6_ifaddrhead (no IPv6 addresses).

An equivalent test has existed since 4.4-Lite.  It was presumably done
to avoid extra work (assuming the address isn't going to be found
later).

In normal system operation *_ifaddrhead will not be empty: they will
at least have the loopback address(es).  In practice no work will be
avoided.

Further, this case caused net/dhcpd to fail when run early in boot
before assignment of any addresses.  It should be possible to bind the
unspecified address even if no addresses have been configured yet, so
just remove the tests.

The now-removed "XXX broken" comments were added in 59562606b9,
which converted the ifaddr lists to TAILQs.  As far as I (emaste) can
tell the brokenness is the issue described above, not some aspect of
the TAILQ conversion.

PR:		253166
Reviewed by:	ae, bz, donner, emaste, glebius
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D32563
2021-10-20 19:25:51 -04:00
Gleb Smirnoff 0f617ae48a Add in_pcb_var.h for KPIs that are private to in_pcb.c and in6_pcb.c. 2021-10-18 10:19:57 -07:00
Gleb Smirnoff 147f018a72 Move in6_pcbsetport() to in6_pcb.c
This function was originally carved out of in6_pcbbind(), which
is in in6_pcb.c. This function also uses KPI private to the PCB
database - in_pcb_lport().
2021-10-18 10:19:03 -07:00
Mark Johnston 2d5c48eccd sctp: Tighten up locking around sctp_aloc_assoc()
All callers of sctp_aloc_assoc() mark the PCB as connected after a
successful call (for one-to-one-style sockets).  In all cases this is
done without the PCB lock, so the PCB's flags can be corrupted.  We also
do not atomically check whether a one-to-one-style socket is a listening
socket, which violates various assumptions in solisten_proto().

We need to hold the PCB lock across all of sctp_aloc_assoc() to fix
this.  In order to do that without introducing lock order reversals, we
have to hold the global info lock as well.

So:
- Convert sctp_aloc_assoc() so that the inp and info locks are
  consistently held.  It returns with the association lock held, as
  before.
- Fix an apparent bug where we failed to remove an association from a
  global hash if sctp_add_remote_addr() fails.
- sctp_select_a_tag() is called when initializing an association, and it
  acquires the global info lock.  To avoid lock recursion, push locking
  into its callers.
- Introduce sctp_aloc_assoc_connected(), which atomically checks for a
  listening socket and sets SCTP_PCB_FLAGS_CONNECTED.

There is still one edge case in sctp_process_cookie_new() where we do
not update PCB/socket state correctly.

Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31908
2021-09-11 10:15:21 -04:00
Mark Johnston 353783964c ip6mrouter: Make the expiration callout MPSAFE
- Protect the `expire_upcalls` callout with the MFC6 mutex.  The callout
  handler needs this mutex anyway.
- Convert the MROUTER6 mutex to a sleepable sx lock.  It is only used
  when configuring the global v6 multicast routing socket, so is only
  used in system call paths where sleeping is safe.  This lets us drain
  the callout without having to drop the lock.
- For all locking macros in the file, convert to using a _LOCKPTR macro.

Reported by:	mav
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31836
2021-09-07 11:19:29 -04:00
Mark Johnston 9a94097cd0 nd6: Make the DAD callout MPSAFE
Interface addresses with pending duplicate address detection (DAD) live
in a global queue.  In this case, a callout is associated with each
entry.  The callout transmits neighbour solicitations until the system
decides the address is no longer tentative, or until a duplicate address
is discovered.  At this point the entry is dequeued and freed.  DAD may
be manually stopped as well.

The callout currently runs (and potentially transmits packets) with
Giant held.  Reorganize DAD queue locking to interlock properly with the
callout:

- Configure the callout to acquire the DAD queue lock before running.
  The lock is dropped before transmitting any packets.  Stop protecting
  the callout with Giant.
- When looking up DAD queue entries for an incoming NS or NA, don't
  bother fiddling with the DAD queue entry reference count.
- Split nd6_dad_starttimer() so that the caller is responsible to
  transmitting a NS if it so desires.
- Remove the DAD entry from the queue before stopping the timer.  Use a
  temporary reference to make sure that the entry doesn't get freed by
  the callout while we're draining.

Reported by:	mav
Reviewed by:	bz, hrs
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D31826
2021-09-07 11:19:29 -04:00
Gordon Bergling 10e0082fff inet6(4): Fix a few common typos in source code comments
- s/reshedule/reschedule/

MFC after:	3 days
2021-08-28 18:53:59 +02:00
Hiroki Sato 9823a0c0ac
inet6(4): add a missing IPPROTO_ETHERIP entry
bridge(4) + gif(4) did not work when the outer protocol was IPv6.

Submitted by:	Masahiro Kozuka
PR:		256820
MFC after:	3 days
2021-08-27 17:14:35 +09:00
Alexander V. Chernikov f8c1b1a929 lltable: fix crash introduced in c541bd368f.
Reported by:	cy
MFC after:	2 weeks
2021-08-22 08:49:18 +00:00
Alexander V. Chernikov c541bd368f lltable: Add support for "child" LLEs holding encap for IPv4oIPv6 entries.
Currently we use pre-calculated headers inside LLE entries as prepend data
 for `if_output` functions. Using these headers allows saving some
 CPU cycles/memory accesses on the fast path.

However, this approach makes adding L2 header for IPv4 traffic with IPv6
 nexthops more complex, as it is not possible to store multiple
 pre-calculated headers inside lle. Additionally, the solution space is
 limited by the fact that PCB caching saves LLEs in addition to the nexthop.

Thus, add support for creating special "child" LLEs for the purpose of holding
 custom family encaps and store mbufs pending resolution. To simplify handling
 of those LLEs, store them in a linked-list inside a "parent" (e.g. normal) LLE.
 Such LLEs are not visible when iterating LLE table. Their lifecycle is bound
 to the "parent" LLE - it is not possible to delete "child" when parent is alive.
 Furthermore, "child" LLEs are static (RTF_STATIC), avoding complex state
 machine used by the standard LLEs.

nd6_lookup() and nd6_resolve() now accepts an additional argument, family,
 allowing to return such child LLEs. This change uses `LLE_SF()` macro which
 packs family and flags in a single int field. This is done to simplify merging
 back to stable/. Once this code lands, most of the cases will be converted to
 use a dedicated `family` parameter.

Differential Revision: https://reviews.freebsd.org/D31379
MFC after:	2 weeks
2021-08-21 17:34:35 +00:00
Mateusz Guzik 8afe9481cf frag6: do less work in frag6_slowtimo if possible
frag6_slowtimo avoidably uses CPU on otherwise idle boxes

Reviewed by:	kp
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D31528
2021-08-14 18:51:00 +02:00
Mateusz Guzik c17ae18080 frag6: drop the volatile keyword from frag6_nfrags and mark with __exclusive_cache_line
The keyword adds nothing as all operations on the var are performed
through atomic_*

Reviewed by:	kp
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D31528
2021-08-14 18:50:29 +02:00
Mark Johnston 663428ea17 nd6: Mark several callouts as MPSAFE
The use of Giant here is vestigal and does not provide any useful
synchronization.  Furthermore, non-MPSAFE callouts can cause the
softclock threads to block waiting for long-running newbus operations to
complete.

Reported by:	mav
Reviewed by:	bz
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31470
2021-08-09 13:27:52 -04:00
Mark Johnston 8ee0826f75 in6: Enter the net epoch in in6_tmpaddrtimer()
We need to do so to safely traverse the ifnet list.

Reviewed by:	bz
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31470
2021-08-09 13:27:52 -04:00
Michael Tuexen a8d54fc903 ipv6: Fix getsockopt() for some IPPROTO_IPV6 level socket options
Fix getsockopt() for the IPPROTO_IPV6 level socket options with the
following names: IPV6_HOPOPTS, IPV6_RTHDR, IPV6_RTHDRDSTOPTS,
IPV6_DSTOPTS, and IPV6_NEXTHOP.

Reviewed by:		markj
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D31458
2021-08-09 09:29:13 +02:00
Gordon Bergling 04389c855e Fix some common typos in comments
- s/configuraiton/configuration/
- s/specifed/specified/
- s/compatiblity/compatibility/

MFC after:	5 days
2021-08-08 10:16:06 +02:00
Michael Tuexen 784692c740 sctp: improve handling of IPv4 addresses on IPV6 sockets
Reported by:	syzbot+08fe66e4bfc2777cba95@syzkaller.appspotmail.com
MFC after:	3 days
2021-08-07 17:27:56 +02:00
Alexander V. Chernikov 0b79b007eb [lltable] Restructure nd6 code.
Factor out lltable locking logic from lltable_try_set_entry_addr()
 into a separate lltable_acquire_wlock(), so the latter can be used
 in other parts of the code w/o duplication.

Create nd6_try_set_entry_addr() to avoid code duplication in nd6.c
 and nd6_nbr.c.

Move lle creation logic from nd6_resolve_slow() into a separate
 nd6_get_llentry() to simplify the former.

These changes serve as a pre-requisite for implementing
 RFC8950 (IPv4 prefixes with IPv6 nexthops).

Differential Revision: https://reviews.freebsd.org/D31432
MFC after:	2 weeks
2021-08-07 09:59:11 +00:00
Andrey V. Elsukov d477a7feed Fix panic in IPv6 multicast code.
Add check that ifp supports IPv6 multicasts in in6_getmulti.
This fixes panic when user application tries to join into multicast
group on an interface that doesn't support IPv6 multicasts, like
IFT_PFLOG interfaces.

PR:             257302
Reviewed by:	melifaro
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D31420
2021-08-06 12:57:59 +03:00
Alexander V. Chernikov 8482aa7748 Use lltable calculated header when sending lle holdchain after successful lle resolution.
Subscribers: imp, ae, bz

Differential Revision: https://reviews.freebsd.org/D31391
2021-08-05 20:44:36 +00:00
Alexander V. Chernikov f3a3b06121 [lltable] Unify datapath feedback mechamism.
Use newly-create llentry_request_feedback(),
 llentry_mark_used() and llentry_get_hittime() to
 request datapatch usage check and fetch the results
 in the same fashion both in IPv4 and IPv6.

While here, simplify llentry_provide_feedback() wrapper
 by eliminating 1 condition check.

MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D31390
2021-08-04 22:52:43 +00:00
Roy Marples 7045b1603b socket: Implement SO_RERROR
SO_RERROR indicates that receive buffer overflows should be handled as
errors. Historically receive buffer overflows have been ignored and
programs could not tell if they missed messages or messages had been
truncated because of overflows. Since programs historically do not
expect to get receive overflow errors, this behavior is not the
default.

This is really really important for programs that use route(4) to keep
in sync with the system. If we loose a message then we need to reload
the full system state, otherwise the behaviour from that point is
undefined and can lead to chasing bogus bug reports.

Reviewed by:	philip (network), kbowling (transport), gbe (manpages)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D26652
2021-07-28 09:35:09 -07:00
Michael Tuexen 105b68b42d sctp: Fix errno in case of association setup failures
Do not report always ETIMEDOUT, but only when appropriate. In
other cases report ECONNABORTED.

MFC after:	3 days
2021-07-09 23:19:25 +02:00
Michael Tuexen c7f048ab35 sctp: initialize sequence numbers for ECN correctly
MFC after:	3 days
Reported by:	Junseok Yang (for the userland stack)
2021-06-27 20:14:48 +02:00
Ryan Stone 2290dfb40f Enter the net epoch before calling ip6_setpktopts
ip6_setpktopts() can look up ifnets via ifnet_by_index(), which
is only safe in the net epoch.  Ensure that callers are in the net
epoch before calling this function.

Sponsored by: Dell EMC Isilon
MFC after: 4 weeks
Reviewed by: donner, kp
Differential Revision: https://reviews.freebsd.org/D30630
2021-06-04 13:18:11 -04:00
Mark Johnston d8acd2681b Fix mbuf leaks in various pru_send implementations
The various protocol implementations are not very consistent about
freeing mbufs in error paths.  In general, all protocols must free both
"m" and "control" upon an error, except if PRUS_NOTREADY is specified
(this is only implemented by TCP and unix(4) and requires further work
not handled in this diff), in which case "control" still must be freed.

This diff plugs various leaks in the pru_send implementations.

Reviewed by:	tuexen
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30151
2021-05-12 13:00:09 -04:00
Mark Johnston c1dd4d642f nd6: Avoid using an uninitialized sockaddr in nd6_prefix_offlink()
Commit 81728a538 ("Split rtinit() into multiple functions.") removed
the initialization of sa6, but not one of its uses.  This meant that we
were passing an uninitialized sockaddr as the address to
lltable_prefix_free().  Remove the variable outright to fix the problem.
The caller is expected to hold a reference on pr.

Fixes:		81728a538 ("Split rtinit() into multiple functions.")
Reported by:	KMSAN
Reviewed by:	donner
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30166
2021-05-12 12:52:06 -04:00
Kristof Provost 2ef5d803e3 in6_mcast: Return EADDRINUSE when we've already joined the group
Distinguish between truly invalid requests and those that fail because
we've already joined the group. Both cases fail, but differentiating
them allows userspace to make more informed decisions about what the
error means.

For example. radvd tries to join the all-routers group on every SIGHUP.
This fails, because it's already joined it, but this failure should be
ignored (rather than treated as a sign that the interface's multicast is
broken).

This puts us in line with OpenBSD, NetBSD and Linux.

Reviewed by:	donner
MFC after:	1 week
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D30111
2021-05-10 09:48:51 +02:00
Mark Johnston f161d294b9 Add missing sockaddr length and family validation to various protocols
Several protocol methods take a sockaddr as input.  In some cases the
sockaddr lengths were not being validated, or were validated after some
out-of-bounds accesses could occur.  Add requisite checking to various
protocol entry points, and convert some existing checks to assertions
where appropriate.

Reported by:	syzkaller+KASAN
Reviewed by:	tuexen, melifaro
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29519
2021-05-03 13:35:19 -04:00
Mark Johnston 8e8f1cc9bb Re-enable network ioctls in capability mode
This reverts a portion of 274579831b ("capsicum: Limit socket
operations in capability mode") as at least rtsol and dhcpcd rely on
being able to configure network interfaces while in capability mode.

Reported by:	bapt, Greg V
Sponsored by:	The FreeBSD Foundation
2021-04-23 09:22:49 -04:00
Gleb Smirnoff 1db08fbe3f tcp_input: always request read-locking of PCB for any pure SYN segment.
This is further rework of 08d9c92027.  Now we carry the knowledge of
lock type all the way through tcp_input() and also into tcp_twcheck().
Ideally the rlocking for pure SYNs should propagate all the way into
the alternative TCP stacks, but not yet today.

This should close a race when socket is bind(2)-ed but not yet
listen(2)-ed and a SYN-packet arrives racing with listen(2), discovered
recently by pho@.
2021-04-20 10:02:20 -07:00
Michael Tuexen 9e644c2300 tcp: add support for TCP over UDP
Adding support for TCP over UDP allows communication with
TCP stacks which can be implemented in userspace without
requiring special priviledges or specific support by the OS.
This is joint work with rrs.

Reviewed by:		rrs
Sponsored by:		Netflix, Inc.
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D29469
2021-04-18 16:16:42 +02:00
Gleb Smirnoff 08d9c92027 tcp_input/syncache: acquire only read lock on PCB for SYN,!ACK packets
When packet is a SYN packet, we don't need to modify any existing PCB.
Normally SYN arrives on a listening socket, we either create a syncache
entry or generate syncookie, but we don't modify anything with the
listening socket or associated PCB. Thus create a new PCB lookup
mode - rlock if listening. This removes the primary contention point
under SYN flood - the listening socket PCB.

Sidenote: when SYN arrives on a synchronized connection, we still
don't need write access to PCB to send a challenge ACK or just to
drop. There is only one exclusion - tcptw recycling. However,
existing entanglement of tcp_input + stacks doesn't allow to make
this change small. Consider this patch as first approach to the problem.

Reviewed by:	rrs
Differential revision:	https://reviews.freebsd.org/D29576
2021-04-12 08:25:31 -07:00
Mark Johnston 274579831b capsicum: Limit socket operations in capability mode
Capsicum did not prevent certain privileged networking operations,
specifically creation of raw sockets and network configuration ioctls.
However, these facilities can be used to circumvent some of the
restrictions that capability mode is supposed to enforce.

Add capability mode checks to disallow network configuration ioctls and
creation of sockets other than PF_LOCAL and SOCK_DGRAM/STREAM/SEQPACKET
internet sockets.

Reviewed by:	oshogbo
Discussed with:	emaste
Reported by:	manu
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D29423
2021-04-07 14:32:56 -04:00
Kyle Evans f187d6dfbf base: remove if_wg(4) and associated utilities, manpage
After length decisions, we've decided that the if_wg(4) driver and
related work is not yet ready to live in the tree.  This driver has
larger security implications than many, and thus will be held to
more scrutiny than other drivers.

Please also see the related message sent to the freebsd-hackers@
and freebsd-arch@ lists by Kyle Evans <kevans@FreeBSD.org> on
2021/03/16, with the subject line "Removing WireGuard Support From Base"
for additional context.
2021-03-17 09:14:48 -05:00
Kyle Evans 74ae3f3e33 if_wg: import latest fixup work from the wireguard-freebsd project
This is the culmination of about a week of work from three developers to
fix a number of functional and security issues.  This patch consists of
work done by the following folks:

- Jason A. Donenfeld <Jason@zx2c4.com>
- Matt Dunwoodie <ncon@noconroy.net>
- Kyle Evans <kevans@FreeBSD.org>

Notable changes include:
- Packets are now correctly staged for processing once the handshake has
  completed, resulting in less packet loss in the interim.
- Various race conditions have been resolved, particularly w.r.t. socket
  and packet lifetime (panics)
- Various tests have been added to assure correct functionality and
  tooling conformance
- Many security issues have been addressed
- if_wg now maintains jail-friendly semantics: sockets are created in
  the interface's home vnet so that it can act as the sole network
  connection for a jail
- if_wg no longer fails to remove peer allowed-ips of 0.0.0.0/0
- if_wg now exports via ioctl a format that is future proof and
  complete.  It is additionally supported by the upstream
  wireguard-tools (which we plan to merge in to base soon)
- if_wg now conforms to the WireGuard protocol and is more closely
  aligned with security auditing guidelines

Note that the driver has been rebased away from using iflib.  iflib
poses a number of challenges for a cloned device trying to operate in a
vnet that are non-trivial to solve and adds complexity to the
implementation for little gain.

The crypto implementation that was previously added to the tree was a
super complex integration of what previously appeared in an old out of
tree Linux module, which has been reduced to crypto.c containing simple
boring reference implementations.  This is part of a near-to-mid term
goal to work with FreeBSD kernel crypto folks and take advantage of or
improve accelerated crypto already offered elsewhere.

There's additional test suite effort underway out-of-tree taking
advantage of the aforementioned jail-friendly semantics to test a number
of real-world topologies, based on netns.sh.

Also note that this is still a work in progress; work going further will
be much smaller in nature.

MFC after:	1 month (maybe)
2021-03-14 23:52:04 -05:00
Alexander V. Chernikov b1d63265ac Flush remaining routes from the routing table during VNET shutdown.
Summary:
This fixes rtentry leak for the cloned interfaces created inside the
 VNET.

PR:	253998
Reported by:	rashey at superbox.pl
MFC after:	3 days

Loopback teardown order is `SI_SUB_INIT_IF`, which happens after `SI_SUB_PROTO_DOMAIN` (route table teardown).
Thus, any route table operations are too late to schedule.
As the intent of the vnet teardown procedures to minimise the amount of effort by doing global cleanups instead of per-interface ones, address this by adding a relatively light-weight routing table cleanup function, `rib_flush_routes()`.
It removes all remaining routes from the routing table and schedules the deletion, which will happen later, when `rtables_destroy()` waits for the current epoch to finish.

Test Plan:
```
set_skip:set_skip_group_lo  ->  passed  [0.053s]
tail -n 200 /var/log/messages | grep rtentry
```

Reviewers: #network, kp, bz

Reviewed By: kp

Subscribers: imp, ae

Differential Revision: https://reviews.freebsd.org/D29116
2021-03-10 21:10:14 +00:00
Alexander V. Chernikov 7634919e15 Fix 'in6_purgeaddr: err=65, destination address delete failed' message.
P2P ifa may require 2 routes: one is the loopback route, another is
 the "prefix" route towards its destination.

Current code marks loopback routes existence with IFA_RTSELF and
 "prefix" p2p routes with IFA_ROUTE.

For historic reasons, we fill in ifa_dstaddr for loopback interfaces.
To avoid installing the same route twice, we preemptively set
 IFA_RTSELF when adding "prefix" route for loopback.
However, the teardown part doesn't have this hack, so we try to
 remove the same route twice.

Fix this by checking if ifa_dstaddr is different from the ifa_addr
 and moving this logic into a separate function.

Reviewed By: kp
Differential Revision: https://reviews.freebsd.org/D29121
MFC after:	3 days
2021-03-08 21:28:35 +00:00
Alexander V. Chernikov d5be41beb7 Fix dpdk/ldradix fib lookup algorithm preference calculation.
The current preference number were copied from IPv4 code,
 assuming 500k routes to be the full-view. Adjust with the current
 reality (100k full-view).

Reported by:	Marek Zarychta <zarychtam at plan-b.pwste.edu.pl>
MFC after:	3 days
2021-03-07 22:17:53 +00:00
Kristof Provost bb4a7d94b9 net: Introduce IPV6_DSCP(), IPV6_ECN() and IPV6_TRAFFIC_CLASS() macros
Introduce convenience macros to retrieve the DSCP, ECN or traffic class
bits from an IPv6 header.

Use them where appropriate.

Reviewed by:	ae (previous version), rscheff, tuexen, rgrimes
MFC after:	2 weeks
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D29056
2021-03-04 20:56:48 +01:00
Alexander V. Chernikov cc3fa1e29f Fix crash with rtadv-originated multipath IPv6 routes.
PR:		253800
Reported by:	Frederic Denis <freebsdml at hecian.net>
MFC after:	immediately
2021-02-24 16:44:10 +00:00
Alexander V. Chernikov 9c4a8d24f0 Fix nd6 rib_action() handling.
rib_action() guarantees valid rc filling IFF it returns without error.
Check rib_action() return code instead of checking rc fields.

PR:		253800
Reported by:	Frederic Denis <freebsdml@hecian.net>
MFC after:	immediately
2021-02-23 22:40:01 +00:00
Kristof Provost c139b3c19b arp/nd: Cope with late calls to iflladdr_event
When tearing down vnet jails we can move an if_bridge out (as
part of the normal vnet_if_return()). This can, when it's clearing out
its list of member interfaces, change its link layer address.
That sends an iflladdr_event, but at that point we've already freed the
AF_INET/AF_INET6 if_afdata pointers.

In other words: when the iflladdr_event callbacks fire we can't assume
that ifp->if_afdata[AF_INET] will be set.

Reviewed by:	donner@, melifaro@
MFC after:	1 week
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D28860
2021-02-23 13:54:07 +01:00
Alexander V. Chernikov 8268d82cff Remove per-packet ifa refcounting from IPv6 fast path.
Currently ip6_input() calls in6ifa_ifwithaddr() for
 every local packet, in order to check if the target ip
 belongs to the local ifa in proper state and increase
 its counters.

in6ifa_ifwithaddr() references found ifa.
With epoch changes, both `ip6_input()` and all other current callers
 of `in6ifa_ifwithaddr()` do not need this reference
 anymore, as epoch provides stability guarantee.

Given that, update `in6ifa_ifwithaddr()` to allow
 it to return ifa without referencing it, while preserving
 option for getting referenced ifa if so desired.

MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D28648
2021-02-15 22:33:12 +00:00
Alexander V. Chernikov 605284b894 Enforce net epoch in in6_selectsrc().
in6_selectsrc() may call fib6_lookup() in some cases, which requires
 epoch. Wrap in6_selectsrc* calls into epoch inside its users.
Mark it as requiring epoch by adding NET_EPOCH_ASSERT().

MFC after:	1 weeek
Differential Revision:	https://reviews.freebsd.org/D28647
2021-02-15 22:33:12 +00:00
Alexander V. Chernikov 1bd44b11e5 Do not reference returned ifa in in6_ifawithifp().
The only place where in6_ifawithifp() is used is ip6_output(),
 which uses the returned ifa to bump traffic counters.
Given ifa stability guarantees is provided by epoch, do not refcount ifa.

This eliminates 2 atomic ops from IPv6 fast path.

Reviewed By:	rstone
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28649
2021-02-14 10:11:18 +00:00
Andrey V. Elsukov 3c782d9c91 [udp6] fix possible panic due to lack of locking.
The lookup for a IPv6 multicast addresses corresponding to
the destination address in the datagram is protected by the
NET_EPOCH section. Access to each PCB is protected by INP_RLOCK
during comparing. But access to socket's so_options field is
not protected. And in some cases it is possible, that PCB
pointer is still valid, but inp_socket is not. The patch wides
lock holding to protect access to inp_socket. It copies locking
strategy from IPv4 UDP handling.

PR:	232192
Obtained from:	Yandex LLC
MFC after:	3 days
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D28232
2021-02-11 12:00:25 +03:00
Alexander V. Chernikov 924d1c9a05 Revert "SO_RERROR indicates that receive buffer overflows should be handled as errors."
Wrong version of the change was pushed inadvertenly.

This reverts commit 4a01b854ca.
2021-02-08 22:32:32 +00:00
Alexander V. Chernikov 4a01b854ca SO_RERROR indicates that receive buffer overflows should be handled as errors.
Historically receive buffer overflows have been ignored and programs
could not tell if they missed messages or messages had been truncated
because of overflows. Since programs historically do not expect to get
receive overflow errors, this behavior is not the default.

This is really really important for programs that use route(4) to keep in sync
with the system. If we loose a message then we need to reload the full system
state, otherwise the behaviour from that point is undefined and can lead
to chasing bogus bug reports.
2021-02-08 21:42:20 +00:00
Alexander V. Chernikov 91f2c69ec2 Fix unused-function waring when compiling with FIB_ALGO.
MFC after:	3 days
2021-01-30 23:25:56 +00:00
Gleb Smirnoff 3f43ada98c Catch up with 6edfd179c8: mechanically rename IFCAP_NOMAP to IFCAP_MEXTPG.
Originally IFCAP_NOMAP meant that the mbuf has external storage pointer
that points to unmapped address.  Then, this was extended to array of
such pointers.  Then, such mbufs were augmented with header/trailer.
Basically, extended mbufs are extended, and set of features is subject
to change.  The new name should be generic enough to avoid further
renaming.
2021-01-29 11:46:24 -08:00
Randall Stewart 24a8f6d369 When we are about to send down to the driver layer
we need to make sure that the m_nextpkt field is NULL
else the lower layers may do unwanted things.

Reviewed By:  gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D28377
2021-01-27 13:52:44 -05:00
Alexander V. Chernikov f9e0752e35 Create new in6_purgeifaddr() which purges bound ifa prefix if
it gets unused.

Currently if_purgeifaddrs() uses in6_purgeaddr() to remove IPv6
 ifaddrs. in6_purgeaddr() does not trrigger prefix removal if
 number of linked ifas goes to 0, as this is a low-level function.
 As a result, if_purgeifaddrs() purges all IPv4/IPv6 addresses but
 keeps corresponding IPv6 prefixes.

Fix this by creating higher-level wrapper which handles unused
 prefix usecase and use it in if_purgeifaddrs().

Differential revision:	https://reviews.freebsd.org/D28128
2021-01-17 20:32:25 +00:00
Alexander V. Chernikov 81728a538d Split rtinit() into multiple functions.
rtinit[1]() is a function used to add or remove interface address prefix routes,
  similar to ifa_maintain_loopback_route().
It was intended to be family-agnostic. There is a problem with this approach
 in reality.

1) IPv6 code does not use it for the ifa routes. There is a separate layer,
  nd6_prelist_(), providing interface for maintaining interface routes. Its part,
  responsible for the actual route table interaction, mimics rtenty() code.

2) rtinit tries to combine multiple actions in the same function: constructing
  proper route attributes and handling iterations over multiple fibs, for the
  non-zero net.add_addr_allfibs use case. It notably increases the code complexity.

3) dstaddr handling. flags parameter re-uses RTF_ flags. As there is no special flag
 for p2p connections, host routes and p2p routes are handled in the same way.
 Additionally, mapping IFA flags to RTF flags makes the interface pretty messy.
 It make rtinit() to clash with ifa_mainain_loopback_route() for IPV4 interface
 aliases.

4) rtinit() is the last customer passing non-masked prefixes to rib_action(),
 complicating rib_action() implementation.

5) rtinit() coupled ifa announce/withdrawal notifications, producing "false positive"
 ifa messages in certain corner cases.

To address all these points, the following has been done:

* rtinit() has been split into multiple functions:
- Route attribute construction were moved to the per-address-family functions,
 dealing with (2), (3) and (4).
- funnction providing net.add_addr_allfibs handling and route rtsock notificaions
 is the new routing table inteface.
- rtsock ifa notificaion has been moved out as well. resulting set of funcion are only
 responsible for the actual route notifications.

Side effects:
* /32 alias does not result in interface routes (/32 route and "host" route)
* RTF_PINNED is now set for IPv6 prefixes corresponding to the interface addresses

Differential revision:	https://reviews.freebsd.org/D28186
2021-01-16 22:42:41 +00:00
Alexander V. Chernikov e58c8da068 Map IPv6 link-local prefix to the link-local ifa.
Currently we create link-local route by creating an always-on IPv6 prefix
 in the prefix list. This prefix is not tied to the link-local ifa.

This leads to the following problems:

First, when flushing interface addresses we skip on-link route, leaving
 fe80::/64 prefix on the interface without any IPv6 addresses.
Second, when creating and removing link-local alias we lose fe80::/64 prefix
 from the routing table.

Fix this by attaching link-local prefix to the ifa at the initial creation.

Reviewed by:	hrs
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D28129
2021-01-13 10:03:15 +00:00
Alexander V. Chernikov 2defbe9f0e Use rn_match instead of doing indirect calls in fib_algo.
Relevant inet/inet6 code has the control over deciding what
 the RIB lookup function currently is. With that in mind,
 explicitly set it to the current value (rn_match) in the
 datapath lookups. This avoids cost on indirect call.

Differential Revision: https://reviews.freebsd.org/D28066
2021-01-11 23:30:35 +00:00
Alexander V. Chernikov 0da3f8c98d Bump amount of queued packets in for unresolved ARP/NDP entries to 16.
Currently default behaviour is to keep only 1 packet per unresolved entry.
Ability to queue more than one packet was added 10 years ago, in r215207,
 though the default value was kep intact.

Things have changed since that time. Systems tend to initiate multiple
 connections at once for a variety of reasons.
For example, recent kern/252278 bug report describe happy-eyeball DNS
 behaviour sending multiple requests to the DNS server.

The primary driver for upper value for the queue length determination is
 memory consumption. Remote actors should not be able to easily exhaust
 local memory by sending packets to unresolved arp/ND entries.

For now, bump value to 16 packets, to match Darwin implementation.

The proper approach would be to switch the limit to calculate memory
 consumption instead of packet count and limit based on memory.

We should MFC this with a variation of D22447.

Reviewers: #manpages, #network, bz, emaste

Reviewed By: emaste, gbe(doc), jilles(doc)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D28068
2021-01-11 19:51:11 +00:00
Alexander V. Chernikov d68cf57b7f Refactor rt_addrmsg() and rt_routemsg().
Summary:
* Refactor rt_addrmsg(): make V_rt_add_addr_allfibs decision locally.
* Fix rt_routemsg() and multipath by accepting nexthop instead of interface pointer.
* Refactor rtsock_routemsg(): avoid accessing rtentry fields directly.
* Simplify in_addprefix() by moving prefix search to a separate  function.

Reviewers: #network

Subscribers: imp, ae, bz

Differential Revision: https://reviews.freebsd.org/D28011
2021-01-07 19:38:19 +00:00
Alexander V. Chernikov f5baf8bb12 Add modular fib lookup framework.
This change introduces framework that allows to dynamically
 attach or detach longest prefix match (lpm) lookup algorithms
 to speed up datapath route tables lookups.

Framework takes care of handling initial synchronisation,
 route subscription, nhop/nhop groups reference and indexing,
 dataplane attachments and fib instance algorithm setup/teardown.
Framework features automatic algorithm selection, allowing for
 picking the best matching algorithm on-the-fly based on the
 amount of routes in the routing table.

Currently framework code is guarded under FIB_ALGO config option.
An idea is to enable it by default in the next couple of weeks.

The following algorithms are provided by default:
IPv4:
* bsearch4 (lockless binary search in a special IP array), tailored for
  small-fib (<16 routes)
* radix4_lockless (lockless immutable radix, re-created on every rtable change),
  tailored for small-fib (<1000 routes)
* radix4 (base system radix backend)
* dpdk_lpm4 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized
  for large-fib (D27412)
IPv6:
* radix6_lockless (lockless immutable radix, re-created on every rtable change),
  tailed for small-fib (<1000 routes)
* radix6 (base system radix backend)
* dpdk_lpm6 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized
  for large-fib (D27412)

Performance changes:
Micro benchmarks (I7-7660U, single-core lookups, 2048k dst, code in D27604):
IPv4:
8 routes:
  radix4: ~20mpps
  radix4_lockless: ~24.8mpps
  bsearch4: ~69mpps
  dpdk_lpm4: ~67 mpps
700k routes:
  radix4_lockless: 3.3mpps
  dpdk_lpm4: 46mpps

IPv6:
8 routes:
  radix6_lockless: ~20mpps
  dpdk_lpm6: ~70mpps
100k routes:
  radix6_lockless: 13.9mpps
  dpdk_lpm6: 57mpps

Forwarding benchmarks:
+ 10-15% IPv4 forwarding performance (small-fib, bsearch4)
+ 25% IPv4 forwarding performance (full-view, dpdk_lpm4)
+ 20% IPv6 forwarding performance (full-view, dpdk_lpm6)

Control:
Framwork adds the following runtime sysctls:

List algos
* net.route.algo.inet.algo_list: bsearch4, radix4_lockless, radix4
* net.route.algo.inet6.algo_list: radix6_lockless, radix6, dpdk_lpm6
Debug level (7=LOG_DEBUG, per-route)
net.route.algo.debug_level: 5
Algo selection (currently only for fib 0):
net.route.algo.inet.algo: bsearch4
net.route.algo.inet6.algo: radix6_lockless

Support for manually changing algos in non-default fib will be added
soon. Some sysctl names will be changed in the near future.

Differential Revision: https://reviews.freebsd.org/D27401
2020-12-25 11:33:17 +00:00
Andrew Gallatin a034518ac8 Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.

This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.

This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.

Reviewed by:	jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by:	Netfix
Differential Revision:	https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
Hans Petter Selasky ac4dd4cd95 Expose nonstandard IPv6 kernel definitions to standalone builds.
No functional change.

Reviewed by:	bz@
MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2020-12-04 21:51:47 +00:00
Alexander V. Chernikov d1d941c5b9 Remove RADIX_MPATH config option.
ROUTE_MPATH is the new config option controlling new multipath routing
 implementation. Remove the last pieces of RADIX_MPATH-related code and
 the config option.

Reviewed by:	glebius
Differential Revision:	https://reviews.freebsd.org/D27244
2020-11-29 19:43:33 +00:00
Alexander V. Chernikov b712e3e343 Refactor fib4/fib6 functions.
No functional changes.

* Make lookup path of fib<4|6>_lookup_debugnet() separate functions
 (fib<46>_lookup_rt()). These will be used in the control plane code
 requiring unlocked radix operations and actual prefix pointer.
* Make lookup part of fib<4|6>_check_urpf() separate functions.
 This change simplifies the switch to alternative lookup implementations,
 which helps algorithmic lookups introduction.
* While here, use static initializers for IPv4/IPv6 keys

Differential Revision:	https://reviews.freebsd.org/D27405
2020-11-29 13:41:49 +00:00
Bjoern A. Zeeb dd4d5a5ffb IPv6: set ifdisabled in the kernel rather than in rc
Enable ND6_IFF_IFDISABLED when the interface is created in the
kernel before return to user space.

This avoids a race when an interface is create by a program which
also calls ifconfig IF inet6 -ifdisabled and races with the
devd -> /etc/pccard_ether -> .. netif start IF -> ifdisabled
calls (the devd/rc framework disabling IPv6 again after the program
had enabled it already).

In case the global net.inet6.ip6.accept_rtadv was turned on,
we also default to enabling IPv6 on the interfaces, rather than
disabling them.

PR:		248172
Reported by:	Gert Doering (gert greenie.muc.de)
Reviewed by:	glebius (, phk)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D27324
2020-11-25 20:58:01 +00:00
Alexander V. Chernikov 7511a63825 Refactor rib iterator functions.
* Make rib_walk() order of arguments consistent with the rest of RIB api
* Add rib_walk_ext() allowing to exec callback before/after iteration.
* Rename rt_foreach_fib_walk_del -> rib_foreach_table_walk_del
* Rename rt_forach_fib_walk -> rib_foreach_table_walk
* Move rib_foreach_table_walk{_del} to route/route_helpers.c
* Slightly refactor rib_foreach_table_walk{_del} to make the implementation
 consistent and prepare for upcoming iterator optimizations.

Differential Revision:	https://reviews.freebsd.org/D27219
2020-11-22 20:21:10 +00:00
Jonathan T. Looney 440598dd9e Fix implicit automatic local port selection for IPv6 during connect calls.
When a user creates a TCP socket and tries to connect to the socket without
explicitly binding the socket to a local address, the connect call
implicitly chooses an appropriate local port. When evaluating candidate
local ports, the algorithm checks for conflicts with existing ports by
doing a lookup in the connection hash table.

In this circumstance, both the IPv4 and IPv6 code look for exact matches
in the hash table. However, the IPv4 code goes a step further and checks
whether the proposed 4-tuple will match wildcard (e.g. TCP "listen")
entries. The IPv6 code has no such check.

The missing wildcard check can cause problems when connecting to a local
server. It is possible that the algorithm will choose the same value for
the local port as the foreign port uses. This results in a connection with
identical source and destination addresses and ports. Changing the IPv6
code to align with the IPv4 code's behavior fixes this problem.

Reviewed by:	tuexen
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27164
2020-11-14 14:50:34 +00:00
Alexander V. Chernikov d9999ae9ca Fix use-after-free in icmp6_notify_error().
Reported by:	Maxime Villard <max at m00nbsd.net>
Reviewed by:	markj
MFC after:	3 days
2020-10-28 20:22:20 +00:00
Mark Johnston 4caea9b169 icmp6: Count packets dropped due to an invalid hop limit
Pad the icmp6stat structure so that we can add more counters in the
future without breaking compatibility again, last done in r358620.
Annotate the rarely executed error paths with __predict_false while
here.

Reviewed by:	bz, melifaro
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26578
2020-10-19 17:07:19 +00:00
Alexander V. Chernikov 0c325f53f1 Implement flowid calculation for outbound connections to balance
connections over multiple paths.

Multipath routing relies on mbuf flowid data for both transit
 and outbound traffic. Current code fills mbuf flowid from inp_flowid
 for connection-oriented sockets. However, inp_flowid is currently
 not calculated for outbound connections.

This change creates simple hashing functions and starts calculating hashes
 for TCP,UDP/UDP-Lite and raw IP if multipath routes are present in the
 system.

Reviewed by:	glebius (previous version),ae
Differential Revision:	https://reviews.freebsd.org/D26523
2020-10-18 17:15:47 +00:00
Richard Scheffenegger 868aabb470 Add IP(V6)_VLAN_PCP to set 802.1 priority per-flow.
This adds a new IP_PROTO / IPV6_PROTO setsockopt (getsockopt)
option IP(V6)_VLAN_PCP, which can be set to -1 (interface
default), or explicitly to any priority between 0 and 7.

Note that for untagged traffic, explicitly adding a
priority will insert a special 801.1Q vlan header with
vlan ID = 0 to carry the priority setting

Reviewed by:	gallatin, rrs
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26409
2020-10-09 12:06:43 +00:00
Alexander V. Chernikov fedeb08b6a Introduce scalable route multipath.
This change is based on the nexthop objects landed in D24232.

The change introduces the concept of nexthop groups.
Each group contains the collection of nexthops with their
 relative weights and a dataplane-optimized structure to enable
 efficient nexthop selection.

Simular to the nexthops, nexthop groups are immutable. Dataplane part
 gets compiled during group creation and is basically an array of
 nexthop pointers, compiled w.r.t their weights.

With this change, `rt_nhop` field of `struct rtentry` contains either
 nexthop or nexthop group. They are distinguished by the presense of
 NHF_MULTIPATH flag.
All dataplane lookup functions returns pointer to the nexthop object,
leaving nexhop groups details inside routing subsystem.

User-visible changes:

The change is intended to be backward-compatible: all non-mpath operations
 should work as before with ROUTE_MPATH and net.route.multipath=1.

All routes now comes with weight, default weight is 1, maximum is 2^24-1.

Current maximum multipath group width is statically set to 64.
 This will become sysctl-tunable in the followup changes.

Using functionality:
* Recompile kernel with ROUTE_MPATH
* set net.route.multipath to 1

route add -6 2001:db8::/32 2001:db8::2 -weight 10
route add -6 2001:db8::/32 2001:db8::3 -weight 20

netstat -6On

Nexthop groups data

Internet6:
GrpIdx  NhIdx     Weight   Slots                                 Gateway     Netif  Refcnt
1         ------- ------- ------- --------------------------------------- ---------       1
              13      10       1                             2001:db8::2     vlan2
              14      20       2                             2001:db8::3     vlan2

Next steps:
* Land outbound hashing for locally-originated routes ( D26523 ).
* Fix net/bird multipath (net/frr seems to work fine)
* Add ROUTE_MPATH to GENERIC
* Set net.route.multipath=1 by default

Tested by:	olivier
Reviewed by:	glebius
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D26449
2020-10-03 10:47:17 +00:00
Alexander V. Chernikov 2259a03020 Rework part of routing code to reduce difference to D26449.
* Split rt_setmetrics into get_info_weight() and rt_set_expire_info(),
 as these two can be applied at different entities and at different times.
* Start filling route weight in route change notifications
* Pass flowid to UDP/raw IP route lookups
* Rework nd6_subscription_cb() and sysctl_dumpentry() to prepare for the fact
 that rtentry can contain multiple nexthops.

Differential Revision:	https://reviews.freebsd.org/D26497
2020-09-21 20:02:26 +00:00
Alexander V. Chernikov 1440f62266 Remove unused nhop_ref_any() function.
Remove "opt_mpath.h" header where not needed.

No functional changes.
2020-09-20 21:32:52 +00:00
Navdeep Parhar b092fd6c97 if_vxlan(4): add support for hardware assisted checksumming, TSO, and RSS.
This lets a VXLAN pseudo-interface take advantage of hardware checksumming (tx
and rx), TSO, and RSS if the NIC is capable of performing these operations on
inner VXLAN traffic.

A VXLAN interface inherits the capabilities of its vxlandev interface if one is
specified or of the interface that hosts the vxlanlocal address. If other
interfaces will carry traffic for that VXLAN then they must have the same
hardware capabilities.

On transmit, if_vxlan verifies that the outbound interface has the required
capabilities and then translates the CSUM_ flags to their inner equivalents.
This tells the hardware ifnet that it needs to operate on the inner frame and
not the outer VXLAN headers.

An event is generated when a VXLAN ifnet starts. This allows hardware drivers to
configure their devices to expect VXLAN traffic on the specified incoming port.

On receive, the hardware does RSS and checksum verification on the inner frame.
if_vxlan now does a direct netisr dispatch to take full advantage of RSS. It is
not very clear why it didn't do this already.

Future work:
Rx: it should be possible to avoid the first trip up the protocol stack to get
the frame to if_vxlan just so it can decapsulate and requeue for a second trip
up the stack. The hardware NIC driver could directly call an if_vxlan receive
routine for VXLAN traffic instead.

Rx: LRO. depends on what happens with the previous item. There will have to to
be a mechanism to indicate that it's time for if_vxlan to flush its LRO state.

Reviewed by:	kib@
Relnotes:	Yes
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D25873
2020-09-18 02:37:57 +00:00
Navdeep Parhar 72cc43df17 Add a knob to allow zero UDP checksums for UDP/IPv6 traffic on the given UDP port.
This will be used by some upcoming changes to if_vxlan(4).  RFC 7348 (VXLAN)
says that the UDP checksum "SHOULD be transmitted as zero.  When a packet is
received with a UDP checksum of zero, it MUST be accepted for decapsulation."
But the original IPv6 RFCs did not allow zero UDP checksum.  RFC 6935 attempts
to resolve this.

Reviewed by:	kib@
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D25873
2020-09-18 02:21:15 +00:00
Mateusz Guzik 662c13053f net: clean up empty lines in .c and .h files 2020-09-01 21:19:14 +00:00
Kyle Evans 1e9b8db9b2 ipv6: quit dropping packets looping back on p2p interfaces
To paraphrase the below-referenced PR:

This logic originated in the KAME project, and was even controversial when
it was enabled there by default in 2001. No such equivalent logic exists in
the IPv4 stack, and it turns out that this leads to us dropping valid
traffic when the "point to point" interface is actually a 1:many tun
interface, e.g. with the wireguard userland stack.

Even in the case of true point-to-point links, this logic only avoids
transient looping of packets sent by misconfigured applications or
attackers, which can be subverted by proper route configuration rather than
hardcoded logic in the kernel to drop packets.

In the review, melifaro goes on to note that the kernel can't fix it, so it
perhaps shouldn't try to be 'smart' about it. Additionally, that TTL will
still kick in even with incorrect route configuration.

PR:		247718
Reviewed by:	melifaro, rgrimes
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D25567
2020-08-31 01:45:48 +00:00
Alexander V. Chernikov a624ca3dff Move net/route/shared.h definitions to net/route/route_var.h.
No functional changes.

net/route/shared.h was created in the inital phases of nexthop conversion.
It was intended to serve the same purpose as route_var.h - share definitions
 of functions and structures between the routing subsystem components. At
 that time route_var.h was included by many files external to the routing
 subsystem, which largerly defeats its purpose.

As currently this is not the case anymore and amount of route_var.h includes
 is roughly the same as shared.h, retire the latter in favour of the former.
2020-08-28 22:50:20 +00:00
Alexander V. Chernikov bec053ffe0 Make net.inet6.ip6.deembed_scopeid behaviour default & remove sysctl.
Submitted by: Neel Chauhan <neel AT neelc DOT org>
Differential Revision:	https://reviews.freebsd.org/D25637
2020-08-15 11:37:44 +00:00
Alexander V. Chernikov 2f23f45b20 Simplify dom_<rtattach|rtdetach>.
Remove unused arguments from dom_rtattach/dom_rtdetach functions and make
  them return/accept 'struct rib_head' instead of 'void **'.
Declare inet/inet6 implementations in the relevant _var.h headers similar
  to domifattach / domifdetach.
Add rib_subscribe_internal() function to accept subscriptions to the rnh
  directly.

Differential Revision:	https://reviews.freebsd.org/D26053
2020-08-14 21:29:56 +00:00
Hans Petter Selasky b453d3d239 Use a static initializer for the multicast free tasks.
This makes the SYSINIT() function updated in r364072 superfluous.

Suggested by:	glebius@
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2020-08-11 08:31:40 +00:00
Alexander V. Chernikov 9a00f6d067 Fix rib_subscribe() waitok flag by performing allocation outside epoch.
Make in6_inithead() use rib_subscribe with waitok to achieve reliable
 subscription allocation.

Reviewed by:	glebius
2020-08-11 07:05:30 +00:00
Bjoern A. Zeeb f9461246a2 MC: add a note with reference to the discussion and history as-to why we
are where we are now.  The main thing is to try to get rid of the delayed
freeing to avoid blocking on the taskq when shutting down vnets.

X-Timeout:	if you still see this before 14-RELEASE remove it.
2020-08-10 10:58:43 +00:00
Hans Petter Selasky 3689652c65 Make sure the multicast release tasks are properly drained when
destroying a VNET or a network interface.

Else the inm release tasks, both IPv4 and IPv6 may cause a panic
accessing a freed VNET or network interface.

Reviewed by:		jmg@
Discussed with:		bz@
Differential Revision:	https://reviews.freebsd.org/D24914
MFC after:		1 week
Sponsored by:		Mellanox Technologies
2020-08-10 10:46:08 +00:00
Hans Petter Selasky a95ef9d38d Use proper prototype for SYSINIT() functions.
Mark the unused argument using the __unused macro.

Discussed with:		kib@
MFC after:		1 week
Sponsored by:		Mellanox Technologies
2020-08-10 10:40:19 +00:00
Bjoern A. Zeeb a9839c4aee IPV6_PKTINFO support for v4-mapped IPv6 sockets
When using v4-mapped IPv6 sockets with IPV6_PKTINFO we do not
respect the given v4-mapped src address on the IPv4 socket.
Implement the needed functionality. This allows single-socket
UDP applications (such as OpenVPN) to work better on FreeBSD.

Requested by:	Gert Doering (gert greenie.net), pfsense
Tested by:	Gert Doering (gert greenie.net)
Reviewed by:	melifaro
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D24135
2020-08-07 15:13:53 +00:00
Andrey V. Elsukov ce8875b6c4 Fix typo.
Submitted by:	Evgeniy Khramtsov <evgeniy at khramtsov org>
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D25932
2020-08-05 10:27:11 +00:00
Mark Johnston cfae6a92ac Remove an incorrect assertion from in6p_lookup_mcast_ifp().
The socket may be bound to an IPv4-mapped IPv6 address.  However, the
inp address is not relevant to the JOIN_GROUP or LEAVE_GROUP operations.

While here remove an unnecessary check for inp == NULL.

Reported by:	syzbot+d01ab3d5e6c1516a393c@syzkaller.appspotmail.com
Reviewed by:	hselasky
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25888
2020-08-04 15:00:02 +00:00
Mark Johnston 19afc65a7e ip6_output(): Check the return value of in6_getlinkifnet().
If the destination address has an embedded scope ID, make sure that it
corresponds to a valid ifnet before proceeding.  Otherwise a sendto()
with a bogus link-local address can trigger a NULL pointer dereference.

Reported by:	syzkaller
Reviewed by:	ae
Fixes:		r358572
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25887
2020-07-30 17:43:23 +00:00
Alexander V. Chernikov e1c05fd290 Transition from rtrequest1_fib() to rib_action().
Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib,
 in6_rtrequest, rtrequest_fib> and their uses and switch to
 to rib_action(). This is part of the new routing KPI.

Submitted by: Neel Chauhan <neel AT neelc DOT org>
Differential Revision: https://reviews.freebsd.org/D25546
2020-07-21 19:56:13 +00:00
Alexander V. Chernikov 725871230d Temporarly revert r363319 to unbreak the build.
Reported by:	CI
Pointy hat to: melifaro
2020-07-19 10:53:15 +00:00
Alexander V. Chernikov 8cee15d9e4 Transition from rtrequest1_fib() to rib_action().
Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib,
 in6_rtrequest, rtrequest_fib> and their uses and switch to
to rib_action(). This is part of the new routing KPI.

Submitted by:	Neel Chauhan <neel AT neelc DOT org>
Differential Revision:	https://reviews.freebsd.org/D25546
2020-07-19 09:29:27 +00:00
Alexander V. Chernikov 4c7ba83f9d Switch inet6 default route subscription to the new rib subscription api.
Old subscription model allowed only single customer.

Switch inet6 to the new subscription api and eliminate the old model.

Differential Revision:	https://reviews.freebsd.org/D25615
2020-07-12 11:24:23 +00:00
Alexander V. Chernikov eddfb2e86f Fix IPv6 regression introduced by r362900.
PR:		kern/247729
2020-07-03 08:06:26 +00:00
Alexander V. Chernikov 6ad7446c6f Complete conversions from fib<4|6>_lookup_nh_<basic|ext> to fib<4|6>_lookup().
fib[46]_lookup_nh_ represents pre-epoch generation of fib api, providing less guarantees
 over pointer validness and requiring on-stack data copying.

With no callers remaining, remove fib[46]_lookup_nh_ functions.

Submitted by:	Neel Chauhan <neel AT neelc DOT org>
Differential Revision:	https://reviews.freebsd.org/D25445
2020-07-02 21:04:08 +00:00
Mark Johnston 95033af923 Add the SCTP_SUPPORT kernel option.
This is in preparation for enabling a loadable SCTP stack.  Analogous to
IPSEC/IPSEC_SUPPORT, the SCTP_SUPPORT kernel option must be configured
in order to support a loadable SCTP implementation.

Discussed with:	tuexen
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-06-18 19:32:34 +00:00
Michael Tuexen 70486b27ae Retire SCTP_SO_LOCK_TESTING.
This was intended to test the locking used in the MacOS X kernel on a
FreeBSD system, to make use of WITNESS and other debugging infrastructure.
This hasn't been used for ages, to take it out to reduce the #ifdef
complexity.

MFC after:		1 week
2020-06-07 14:39:20 +00:00
Ryan Moeller 78a3645fd2 Fix typo in previous commit
Applied the wrong patch

Reported by:	Michael Butler <imb@protected-networks.net>
Approved by:	mav (mentor)
Sponsored by:	iXsystems.com
2020-06-03 17:26:00 +00:00
Ryan Moeller f057d56c6c scope6: Check for NULL afdata before dereferencing
Narrows the race window with if_detach.

Approved by:	mav (mentor)
MFC after:	3 days
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D25017
2020-06-03 16:57:30 +00:00
Alexander V. Chernikov da187ddb3d * Add rib_<add|del|change>_route() functions to manipulate the routing table.
The main driver for the change is the need to improve notification mechanism.
Currently callers guess the operation data based on the rtentry structure
 returned in case of successful operation result. There are two problems with
 this appoach. First is that it doesn't provide enough information for the
 upcoming multipath changes, where rtentry refers to a new nexthop group,
 and there is no way of guessing which paths were added during the change.
 Second is that some rtentry fields can change during notification and
 protecting from it by requiring customers to unlock rtentry is not desired.

Additionally, as the consumers such as rtsock do know which operation they
 request in advance, making explicit add/change/del versions of the functions
 makes sense, especially given the functions don't share a lot of code.

With that in mind, introduce rib_cmd_info notification structure and
 rib_<add|del|change>_route() functions, with mandatory rib_cmd_info pointer.
 It will be used in upcoming generalized notifications.

* Move definitions of the new functions and some other functions/structures
 used for the routing table manipulation to a separate header file,
 net/route/route_ctl.h. net/route.h is a frequently used file included in
 ~140 places in kernel, and 90% of the users don't need these definitions.

Reviewed by:		ae
Differential Revision:	https://reviews.freebsd.org/D25067
2020-06-01 20:49:42 +00:00
Alexander V. Chernikov e7403d0230 Revert r361704, it accidentally committed merged D25067 and D25070. 2020-06-01 20:40:40 +00:00
Alexander V. Chernikov 79674562b8 * Add rib_<add|del|change>_route() functions to manipulate the routing table.
The main driver for the change is the need to improve notification mechanism.
Currently callers guess the operation data based on the rtentry structure
 returned in case of successful operation result. There are two problems with
 this appoach. First is that it doesn't provide enough information for the
 upcoming multipath changes, where rtentry refers to a new nexthop group,
 and there is no way of guessing which paths were added during the change.
 Second is that some rtentry fields can change during notification and
 protecting from it by requiring customers to unlock rtentry is not desired.

Additionally, as the consumers such as rtsock do know which operation they
 request in advance, making explicit add/change/del versions of the functions
 makes sense, especially given the functions don't share a lot of code.

With that in mind, introduce rib_cmd_info notification structure and
 rib_<add|del|change>_route() functions, with mandatory rib_cmd_info pointer.
 It will be used in upcoming generalized notifications.

* Move definitions of the new functions and some other functions/structures
 used for the routing table manipulation to a separate header file,
 net/route/route_ctl.h. net/route.h is a frequently used file included in
 ~140 places in kernel, and 90% of the users don't need these definitions.

Reviewed by:	ae
Differential Revision: https://reviews.freebsd.org/D25067
2020-06-01 20:32:02 +00:00
Alexander V. Chernikov a37a5246ca Use fib[46]_lookup() in mtu calculations.
fib[46]_lookup_nh_ represents pre-epoch generation of fib api,
providing less guarantees over pointer validness and requiring
on-stack data copying.

Conversion is straight-forwarded, as the only 2 differences are
requirement of running in network epoch and the need to handle
RTF_GATEWAY case in the caller code.

Differential Revision:	https://reviews.freebsd.org/D24974
2020-05-28 08:00:08 +00:00
Alexander V. Chernikov 1483c1c508 Replace ip6_ouput fib6_lookup_nh_<ext|basic> calls with fib6_lookup().
fib6_lookup_nh_ represents pre-epoch generation of fib api,
providing less guarantees over pointer validness and requiring
on-stack data copying.

Conversion is straight-forwarded, as the only 2 differences are
requirement of running in network epoch and the need to handle
RTF_GATEWAY case in the caller code.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24973
2020-05-28 07:29:44 +00:00
Alexander V. Chernikov 7bfc98af12 Switch gif(4) path verification to fib[46]_check_urfp().
fibX_lookup_nh_ represents pre-epoch generation of fib api,
providing less guarantees over pointer validness and requiring
on-stack data copying.
Use specialized fib[46]_check_urpf() from newer KPI instead,
to allow removal of older KPI.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24978
2020-05-28 07:26:18 +00:00
Alexander V. Chernikov 4d2c2509f2 Move <add|del|change>_route() functions to route_ctl.c in preparation of
multipath control plane changed described in D24141.

Currently route.c contains core routing init/teardown functions, route table
 manipulation functions and various helper functions, resulting in >2KLOC
 file in total. This change moves most of the route table manipulation parts
 to a dedicated file, simplifying planned multipath changes and making
 route.c more manageable.

Differential Revision:	https://reviews.freebsd.org/D24870
2020-05-23 19:06:57 +00:00
Alexander V. Chernikov 2bbab0af6d Use epoch(9) for rtentries to simplify control plane operations.
Currently the only reason of refcounting rtentries is the need to report
 the rtable operation details immediately after the execution.
Delaying rtentry reclamation allows to stop refcounting and simplify the code.
Additionally, this change allows to reimplement rib_lookup_info(), which
 is used by some of the customers to get the matching prefix along
 with nexthops, in more efficient way.

The change keeps per-vnet rtzone uma zone. It adds nh_vnet field to
 nhop_priv to be able to reliably set curvnet even during vnet teardown.
Rest of the reference counting code will be removed in the D24867 .

Differential Revision:	https://reviews.freebsd.org/D24866
2020-05-23 10:21:02 +00:00
Mike Karels 2510235150 Allow TCP to reuse local port with different destinations
Previously, tcp_connect() would bind a local port before connecting,
forcing the local port to be unique across all outgoing TCP connections
for the address family. Instead, choose a local port after selecting
the destination and the local address, requiring only that the tuple
is unique and does not match a wildcard binding.

Reviewed by:	tuexen (rscheff, rrs previous version)
MFC after:	1 month
Sponsored by:	Forcepoint LLC
Differential Revision:	https://reviews.freebsd.org/D24781
2020-05-18 22:53:12 +00:00
Andrew Gallatin bc74b81991 IPv6: Fix a panic in the nd6 code with unmapped mbufs.
If the neighbor entry for an IPv6 TCP session using unmapped
mbufs times out, IPv6 will send an icmp6 dest. unreachable
message. In doing this, it will try to do a software checksum
on the reflected packet. If this is a TCP session using unmapped
mbufs, then there will be a kernel panic.

To fix this, just free packets with unmapped mbufs, rather
than sending the icmp.

Reviewed by:	np, rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D24821
2020-05-12 17:18:44 +00:00
Andrew Gallatin d7452d89ad IPv6: sync IP_NO_SND_TAG_RL support from IPv4
The IP_NO_SND_TAG_RL flag to ip{,6}_output() means that the packets
being sent should bypass hardware rate limiting. This is typically used
by modern TCP stacks for rexmits.

This support was added to IPv4 in r352657, but never added to IPv6, even
though rack and bbr call ip6_output() with this flag.

Reviewed by:	rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D24822
2020-05-12 14:01:12 +00:00
Andrew Gallatin 84af4cc153 Fix the build
Back out the IPv6 portion of r360903, as the stamp_tag param
is apparently not supported in upstream FreeBSD.

Sponsored by:	Netflix
Pointy hat to: gallatin
2020-05-11 21:23:22 +00:00
Andrew Gallatin 6043ac201a Ktls: never skip stamping tags for NIC TLS
The newer RACK and BBR TCP stacks have added a mechanism
to disable hardware packet pacing for TCP retransmits.
This mechanism works by skipping the send-tag stamp
on rate-limited connections when the TCP stack calls
ip_output() with the IP_NO_SND_TAG_RL flag set.

When doing NIC TLS, we must ignore this flag, as
NIC TLS packets must always be stamped.  Failure
to stamp a NIC TLS packet will result in crypto
issues.

Reviewed by:	hselasky, rrs
Sponsored by:	Netflix, Mellanox
2020-05-11 19:17:33 +00:00
Alexander V. Chernikov 9e02229580 Remove now-unused rt_ifp,rt_ifa,rt_gateway,rt_mtu rte fields.
After converting routing subsystem customers to use nexthop objects
 defined in r359823, some fields in struct rtentry became unused.

This commit removes rt_ifp, rt_ifa, rt_gateway and rt_mtu from struct rtentry
 along with the code initializing and updating these fields.

Cleanup of the remaining fields will be addressed by D24669.

This commit also changes the implementation of the RTM_CHANGE handling.
Old implementation tried to perform the whole operation under radix WLOCK,
 resulting in slow performance and hacks like using RTF_RNH_LOCKED flag.
New implementation looks up the route nexthop under radix RLOCK, creates new
 nexthop and tries to update rte nhop pointer. Only last part is done under
 WLOCK.
In the hypothetical scenarious where multiple rtsock clients
 repeatedly issue RTM_CHANGE requests for the same route, route may get
 updated between read and update operation. This is addressed by retrying
 the operation multiple (3) times before returning failure back to the
 caller.

Differential Revision:	https://reviews.freebsd.org/D24666
2020-05-04 14:31:45 +00:00
Gleb Smirnoff 7b6c99d08d Step 3: anonymize struct mbuf_ext_pgs and move all its fields into mbuf
within m_epg namespace.
All edits except the 'struct mbuf' declaration and mb_dupcl() were done
mechanically with sed:

s/->m_ext_pgs.nrdy/->m_epg_nrdy/g
s/->m_ext_pgs.hdr_len/->m_epg_hdrlen/g
s/->m_ext_pgs.trail_len/->m_epg_trllen/g
s/->m_ext_pgs.first_pg_off/->m_epg_1st_off/g
s/->m_ext_pgs.last_pg_len/->m_epg_last_len/g
s/->m_ext_pgs.flags/->m_epg_flags/g
s/->m_ext_pgs.record_type/->m_epg_record_type/g
s/->m_ext_pgs.enc_cnt/->m_epg_enc_cnt/g
s/->m_ext_pgs.tls/->m_epg_tls/g
s/->m_ext_pgs.so/->m_epg_so/g
s/->m_ext_pgs.seqno/->m_epg_seqno/g
s/->m_ext_pgs.stailq/->m_epg_stailq/g

Reviewed by:	gallatin
Differential Revision:	https://reviews.freebsd.org/D24598
2020-05-03 00:12:56 +00:00
Alexander V. Chernikov 74787ef47b Add nhop to the ifa_rtrequest() callback.
With the upcoming multipath changes described in D24141,
 rt->rt_nhop can potentially point to a nexthop group instead of
 an individual nhop.
To simplify caller handling of such cases, change ifa_rtrequest() callback
 to pass changed nhop directly.

Differential Revision:	https://reviews.freebsd.org/D24604
2020-04-29 19:28:56 +00:00
Alexander V. Chernikov e7d8af4f65 Move route_temporal.c and route_var.h to net/route.
Nexthop objects implementation, defined in r359823,
 introduced sys/net/route directory intended to hold all
 routing-related code. Move recently-introduced route_temporal.c and
 private route_var.h header there.

Differential Revision:	https://reviews.freebsd.org/D24597
2020-04-28 19:14:09 +00:00
Alexander V. Chernikov fe6da72759 Move struct rtentry definition to nhop_var.h.
One of the goals of the new routing KPI defined in r359823
 is to entirely hide`struct rtentry` from the consumers.
It will allow to improve routing subsystem internals and deliver
 features much faster.

This is one of the last changes, effectively moving struct rtentry
 definition to a net/route_var.h header, internal to the routing subsystem.

Differential Revision:	https://reviews.freebsd.org/D24580
2020-04-28 18:42:30 +00:00
Alexander V. Chernikov 1b0051bada Eliminate now-unused parts of old routing KPI.
r360292 switched most of the remaining routing customers to a new KPI,
 leaving a bunch of wrappers for old routing lookup functions unused.

Remove them from the tree as a part of routing cleanup.

Differential Revision:	https://reviews.freebsd.org/D24569
2020-04-28 07:25:34 +00:00
Alexander V. Chernikov 55f57ca9ac Convert debugnet to the new routing KPI.
Introduce new fib[46]_lookup_debugnet() functions serving as a
special interface for the crash-time operations. Underlying
implementation will try to return lookup result if
datastructures are not corrupted, avoding locking.

Convert debugnet to use fib4_lookup_debugnet() and switch it
to use nexthops instead of rtentries.

Reviewed by:	cem
Differential Revision:	https://reviews.freebsd.org/D24555
2020-04-26 18:42:38 +00:00
Alexander V. Chernikov 49c9f84f54 Fix IPv6 link-local operations with RADIX_MPATH.
It was broken by r360292 as fib6_lookup() assumes de-embedded addresses
 while rtalloc_mpath_fib() requires sockaddr with embedded ones.

New fib6_lookup() transparently supports multipath, hence
 remove old RADIX_MPATH condition.
2020-04-26 18:07:35 +00:00
Alexander V. Chernikov 983066f05b Convert route caching to nexthop caching.
This change is build on top of nexthop objects introduced in r359823.

Nexthops are separate datastructures, containing all necessary information
 to perform packet forwarding such as gateway interface and mtu. Nexthops
 are shared among the routes, providing more pre-computed cache-efficient
 data while requiring less memory. Splitting the LPM code and the attached
 data solves multiple long-standing problems in the routing layer,
 drastically reduces the coupling with outher parts of the stack and allows
 to transparently introduce faster lookup algorithms.

Route caching was (re)introduced to minimise (slow) routing lookups, allowing
 for notably better performance for large TCP senders. Caching works by
 acquiring rtentry reference, which is protected by per-rtentry mutex.
 If the routing table is changed (checked by comparing the rtable generation id)
 or link goes down, cache record gets withdrawn.

Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
 cached object, which is mostly mechanical. Other moving parts like cache
 cleanup on rtable change remains the same.

Differential Revision:	https://reviews.freebsd.org/D24340
2020-04-25 09:06:11 +00:00
Alexander V. Chernikov aaad3c4fca Convert rtentry field accesses into nhop field accesses.
One of the goals of the new routing KPI defined in r359823 is to entirely
 hide`struct rtentry` from the consumers. It will allow to improve routing
 subsystem internals and deliver more features much faster.

This commit is mostly mechanical change to eliminate direct struct rtentry
 field accesses.

The only notable difference is AF_LINK gateway encoding.

AF_LINK gw is used in routing stack for operations with interface routes
 and host loopback routes.
In the former case it indicates _some_ non-NULL gateway, as the interface
 is the same as in rt_ifp in kernel and rtm_ifindex in rtsock reporting.
In the latter case the interface index inside gateway was used by the IPv6
 datapath to verify address scope for link-local interfaces.

Kernel uses struct sockaddr_dl for this type of gateway. This structure
 allows for specifying rich interface data, such as mac address and interface
 name. However, this results in relatively large structure size - 52 bytes.
Routing stack fils in only 2 fields - sdl_index and sdl_type, which reside
 in the first 8 bytes of the structure.

In the new KPI, struct nhop_object tries to be cache-efficient, hence
 embodies gateway address inside the structure. In the AF_LINK case it
 stores stortened version of the structure - struct sockaddr_dl_short,
 which occupies 16 bytes. After D24340 changes, the data inside AF_LINK
 gateway will not be used in the kernel at all, leaving rtsock as the only
 potential concern.

The difference in rtsock reporting:

(old)
got message of size 240 on Thu Apr 16 03:12:13 2020
RTM_ADD: Add Route: len 240, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED>
locks:  inits:
sockaddrs: <DST,GATEWAY,NETMASK>
 10.0.0.0 link#5 255.255.255.0

(new)
got message of size 200 on Sun Apr 19 09:46:32 2020
RTM_ADD: Add Route: len 200, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED>
locks:  inits:
sockaddrs: <DST,GATEWAY,NETMASK>
 10.0.0.0 link#5 255.255.255.0

Note 40 bytes different (52-16 + alignment).
However, gateway is still a valid AF_LINK gateway with proper data filled in.

It is worth noting that these particular messages (interface routes) are mostly
 ignored by routing daemons:
* bird/quagga/frr uses RTM_NEWADDR and ignores prefix route addition messages.
* quagga/frr ignores routes without gateway

More detailed overview on how rtsock messages are used by the
 routing daemons to reconstruct the kernel view, can be found in D22974.

Differential Revision:	https://reviews.freebsd.org/D24519
2020-04-23 08:04:20 +00:00
Alexander V. Chernikov d98351e13c Fix lookup key generation in fib6_check_urpf().
The version introduced in r359823 assumed D23051
 had been in tree already. As this is not the case yet,
 revert to sockaddr.
2020-04-19 07:27:12 +00:00
Jonathan T. Looney 5d6e356cb0 Avoid calling protocol drain routines more than once per reclamation event.
mb_reclaim() calls the protocol drain routines for each protocol in each
domain. Some protocols exist in more than one domain and share drain
routines. In the case of SCTP, it also uses the same drain routine for
its SOCK_SEQPACKET and SOCK_STREAM entries in the same domain.

On systems with INET, INET6, and SCTP all defined, mb_reclaim() calls
sctp_drain() four times. On systems with INET and INET6 defined,
mb_reclaim() calls tcp_drain() twice. mb_reclaim() is the only in-tree
caller of the pr_drain protocol entry.

Eliminate this duplication by ensuring that each pr_drain routine is only
specified for one protocol entry in one domain.

Reviewed by:	tuexen
MFC after:	2 weeks
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D24418
2020-04-16 20:17:24 +00:00
Alexander V. Chernikov 539642a29d Add nhop parameter to rti_filter callback.
One of the goals of the new routing KPI defined in r359823 is to
 entirely hide`struct rtentry` from the consumers. It will allow to
 improve routing subsystem internals and deliver more features much faster.
This change is one of the ongoing changes to eliminate direct
 struct rtentry field accesses.

Additionally, with the followup multipath changes, single rtentry can point
 to multiple nexthops.

With that in mind, convert rti_filter callback used when traversing the
 routing table to accept pair (rt, nhop) instead of nexthop.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24440
2020-04-16 17:20:18 +00:00
Alexander V. Chernikov 53a4886d5d Convert ip6_forward() to the new routing KPI.
Update ip6_forward() internals to use deembedded IPv6 addresses
 to simplify calls to the new KPI and prepare for the future
 scope-embedding cleanup.

Add in6_get_unicast_scopeid() and in6_set_unicast_scopeid() scopeid
 operation functions tailored for unicast processing.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24334
2020-04-15 12:56:05 +00:00
Alexander V. Chernikov 9ac7c6cfed Convert IP/IPv6 forwarding, ICMP processing and IP PCB laddr selection to
the new routing KPI.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24245
2020-04-14 23:06:25 +00:00
Alexander V. Chernikov dd4776f0cc Reorganise nd6 notification code to avoid direct rtentry field access.
One of the goals of the new routing KPI defined in r359823 is to entirely hide
 `struct rtentry` from the consumers. Doing so will allow to improve routing
 subsystem internals and deliver features more easily. This change is one of
  the ongoing changes to eliminate direct struct rtentry field accesses.

It introduces rtfree_func() wrapper around RTFREE() and reorganises nd6 notification
 code to avoid accessing most of the rtentry fields.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24404
2020-04-14 22:48:33 +00:00
Andrew Gallatin 23feb56348 KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself.
While the original implementation of unmapped mbufs was a large
step forward in terms of reducing cache misses by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.

This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.

In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.

This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_arg2 as mentioned above,
and the removal of the ext_pgs zone,

This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.

In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.

Many thanks to glebius for helping to make this better in
the Netflix tree.

Reviewed by:	hselasky, jhb, rrs, glebius (early version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D24213
2020-04-14 14:46:06 +00:00
Alexander V. Chernikov 6722086045 Plug netmask NULL check during route addition causing kernel panic.
This bug was introduced by the r359823.

Reported by:	hselasky
2020-04-14 13:12:22 +00:00
Alexander V. Chernikov 3133002560 Remove tcp_rtlookup6() function signature.
The function itself was removed in r122922 16 years ago.
2020-04-13 08:26:11 +00:00
Alexander V. Chernikov a666325282 Introduce nexthop objects and new routing KPI.
This is the foundational change for the routing subsytem rearchitecture.
 More details and goals are available in https://reviews.freebsd.org/D24141 .

This patch introduces concept of nexthop objects and new nexthop-based
 routing KPI.

Nexthops are objects, containing all necessary information for performing
 the packet output decision. Output interface, mtu, flags, gw address goes
 there. For most of the cases, these objects will serve the same role as
 the struct rtentry is currently serving.
Typically there will be low tens of such objects for the router even with
 multiple BGP full-views, as these objects will be shared between routing
 entries. This allows to store more information in the nexthop.

New KPI:

struct nhop_object *fib4_lookup(uint32_t fibnum, struct in_addr dst,
  uint32_t scopeid, uint32_t flags, uint32_t flowid);
struct nhop_object *fib6_lookup(uint32_t fibnum, const struct in6_addr *dst6,
  uint32_t scopeid, uint32_t flags, uint32_t flowid);

These 2 function are intended to replace all all flavours of
 <in_|in6_>rtalloc[1]<_ign><_fib>, mpath functions  and the previous
 fib[46]-generation functions.

Upon successful lookup, they return nexthop object which is guaranteed to
 exist within current NET_EPOCH. If longer lifetime is desired, one can
 specify NHR_REF as a flag and get a referenced version of the nexthop.
 Reference semantic closely resembles rtentry one, allowing sed-style conversion.

Additionally, another 2 functions are introduced to support uRPF functionality
 inside variety of our firewalls. Their primary goal is to hide the multipath
 implementation details inside the routing subsystem, greatly simplifying
 firewalls implementation:

int fib4_lookup_urpf(uint32_t fibnum, struct in_addr dst, uint32_t scopeid,
  uint32_t flags, const struct ifnet *src_if);
int fib6_lookup_urpf(uint32_t fibnum, const struct in6_addr *dst6, uint32_t scopeid,
  uint32_t flags, const struct ifnet *src_if);

All functions have a separate scopeid argument, paving way to eliminating IPv6 scope
 embedding and allowing to support IPv4 link-locals in the future.

Structure changes:
 * rtentry gets new 'rt_nhop' pointer, slightly growing the overall size.
 * rib_head gets new 'rnh_preadd' callback pointer, slightly growing overall sz.

Old KPI:
During the transition state old and new KPI will coexists. As there are another 4-5
 decent-sized conversion patches, it will probably take a couple of weeks.
To support both KPIs, fields not required by the new KPI (most of rtentry) has to be
 kept, resulting in the temporary size increase.
Once conversion is finished, rtentry will notably shrink.

More details:
* architectural overview: https://reviews.freebsd.org/D24141
* list of the next changes: https://reviews.freebsd.org/D24232

Reviewed by:	ae,glebius(initial version)
Differential Revision:	https://reviews.freebsd.org/D24232
2020-04-12 14:30:00 +00:00
Alexander V. Chernikov c80b717f71 Remove RADIX_MPATH headers, they were unused since r293159.
MFC after:	2 weeks
2020-04-11 07:56:11 +00:00
Alexander V. Chernikov 4684d3cbcb Remove per-AF radix_mpath initializtion functions.
Split their functionality by moving random seed allocation
 to SYSINIT and calling (new) generic multipath function from
 standard IPv4/IPv5 RIB init handlers.

Differential Revision:	https://reviews.freebsd.org/D24356
2020-04-11 07:37:08 +00:00
Andrey V. Elsukov cfad769689 Ignore ND6 neighbor advertisement received for static link-layer entries.
Previously such NA could override manually created LLE.

Reported by:	Martin Beran <martin at mber cz>
Reviewed by:	melifaro
MFC after:	10 days
2020-04-01 02:13:01 +00:00
Mark Johnston 431f2b8712 Use a dedicated taskqueue thread for in6m_release_task().
Interfaces may be detached from a taskqueue_thread task, for example by
prison_complete(), so after r359438, when draining the queue we may end
up deadlocking.

Reported by:	Jenkins via lwhsu
MFC with:	r359438
2020-03-31 02:25:53 +00:00
Mark Johnston 9b1d850be8 Remove the "config" taskqgroup and its KPIs.
Equivalent functionality is already provided by taskqueue(9), just use
that instead.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-03-30 14:24:03 +00:00
Mark Johnston e02582d1ae Fix synchronization in the IPV6_2292PKTOPTIONS set handler.
The inpcb needs to be locked when we update output packet options.
Otherwise it is possible for the IPV6_2292PKTOPTIONS handler to free
packet option structures while another thread is reading or updating
them.

Note that the option handler is still kind of broken.  For instance it
frees all options before performing privilege checks for individual
options.  However, this can be fixed separately.

Reported by:	syzbot+52eb0fd4ddc119787f9d@syzkaller.appspotmail.com
Reviewed by:	bz, tuexen
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24125
2020-03-19 21:38:52 +00:00
Bjoern A. Zeeb 8483fce695 ip6: retire in6_selectroute_fib() as promised 8 years ago
In r231852 I added in6_selectroute_fib() as a compat function with the
fibnum as an extra argument compared to in6_selectroute() to keep the
KPI stable.
Way too late retire this function again and add the fib to in6_selectroute()
which also only has a single consumer now and was an orphan function before.
2020-03-03 13:48:12 +00:00
Bjoern A. Zeeb 000c42faf3 ip6_output: use new routing KPI when not passed a cached route
Implement the equivalent of r347375 (IPv4) for the IPv6 output path.
In IPv6 we get passed a cached route (and inp) by udp6_output()
depending on whether we acquired a write lock on the INP.
In case we neither bind nor connect a first UDP packet would come in
with a cached route (wlocked) and all further packets would not.
In case we bind and do not connect we never write-lock the inp.

When we do not pass in a cached route, rather than providing the
storage for a route locally and pass it over the old lookup code
and down the stack, use the new route lookup KPI and acquire all
details we need to send the packet.

Compared to the IPv4 code the IPv6 code has a couple of possible
complications: given an option with a routing hdr/caching route there,
and path mtu (ro_pmtu) case which now equally has to deal with the
possibility of having a route which is NULL passed in, and the
fwd_tag in case a firewall changes the next hop (something to
factor out in the future).

Sponsored by:	Netflix
Reviewed by:	glebius
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D23886
2020-03-03 11:32:47 +00:00
Bjoern A. Zeeb 5f3e375ed8 in6_fib: return nh_ia in the ext interface as we do for IPv4
Like for IPv4 add nh_ia to the ext interface and return rt_ifa
in order to be used for, e.g., packet/octets accounting purposes.

Reviewed by:	melifaro
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D23873
2020-03-03 09:50:33 +00:00
Bjoern A. Zeeb f6428cdb1f fib6_rte_to_nh_*: return a link-local gw address with scope embedded
In fib6_rte_to_nh_* when returning a link-local gateway address
currently we do clear the scope. That could be recovered using
the ifp returned as well, but the code in general seems to
expect a link-local address with scope embeedded as otherwise
the "dst" (gw) passed to the output routines will not include
scope and not send the packet out (the right interface).

Do not clear the scope when returning a link-local address and
allow packets to go out (the right interface).

Remove the (now) extra scope recovery in the IPv6 fast-fwd code.

Sponsored by:	Netflix
Reviewed by:	melifaro, ae
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D23872
2020-03-03 09:45:16 +00:00
Bjoern A. Zeeb f1db666a61 mld6: initialize oifp to avoid bogus results/panics in edge cases
In certain cases (probably not during normal operation but observed in
the lab during development) ip6_ouput() could return without error
and ifpp (&oifp) not updated.
Given oifp was never initialized we would take the later branch
as oifp was not NULL, and when calling icmp6_ifstat_inc() we would
panic dereferencing a garbage pointer.
For code stability initialize oifp to NULL before first use to always
have a deterministic value and not rely on a called function to behave
and always and for ever do the work for us as we hope for.

MFC after:	3 days
Sponsored by:	Netflix
2020-02-28 11:16:41 +00:00
Pawel Biernacki 7029da5c36 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE.  All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by:	kib (mentor, blanket)
Commented by:	kib, gallatin, melifaro
Differential Revision:	https://reviews.freebsd.org/D23718
2020-02-26 14:26:36 +00:00
Bjoern A. Zeeb 3db6053160 ip6_output: fix regression introduced in r358167 for ipv6 fragmentation
When moving the calculations for the optlen into the if (opt) block
which deals with possible extension headers I failed to initialise
unfragpartlen to the ipv6 header length if there were no extension
headers present.  Correct that mistake to make IPv6 fragment length
calculcations work again.

Reported by:	hselasky, kp
OKed by:	hselasky, kp
MFC after:	3 days
X-MFC with:	r358167
PR:		244393
2020-02-25 15:03:41 +00:00
Bjoern A. Zeeb 3459050c9a Fix IPv6 checksums when exthdrs are present.
In two places in ip6_output we are doing (delayed) checksum calculations.
The initial logic came from SCTP in r205075,205104 and later I copied
and adjusted it for the TCP|UDP case in r235958.
The problem was that the original SCTP offsets were already wrong for any
case with extension headers present given IPv6 extension headers are not
part of the pseudo checksum calculations.
The later changes do not help in case there is checksum offloading as for
extension headers (incl. fragments) we do currrently never offload as we
have no infrastructure to know whether the NIC can handle these cases.

Correct the offsets for delayed checksum calculations and properly handle
mbuf flags.  In addition harmonize the almost identical duplicate code.

While here eliminate the now unneeded variable hlen and add an always
missing mtod() call in the 1-b and 3 cases after the introduction of
the mb_unmapped_to_ext() calls.

Reported by:	Francis Dupont (fdupont isc.org)
PR:		243675
MFC after:	6 days
Reviewed by:	markj (earlier version), gallatin
Differential Revision:	https://reviews.freebsd.org/D23760
2020-02-24 19:12:20 +00:00
Pawel Biernacki 295a18d184 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (14 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Approved by:	kib (mentor, blanket)
Differential Revision:	https://reviews.freebsd.org/D23639
2020-02-24 10:47:18 +00:00
Bjoern A. Zeeb a1a6c01e41 ip6_output: improve extension header handling
Move IPv6 source address checks from after extension header heandling
to the top of the function. If we do not pass these checks there is
no reason to do a lot of work upfront.

Fold extension header preparations and length calculations together into
a single branch and macro rather than doing them sequentially.
Likewise move extension header concatination into a single branch block
only doing it if we recorded any extension header length length.

Reviewed by:	melifaro (earlier version), markj, gallatin
Sponsored by:	Netflix (partially, originally)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D23740
2020-02-20 10:56:12 +00:00