Commit graph

7247 commits

Author SHA1 Message Date
Richard Scheffenegger 7994ef3c39 Revert "tcp: move ECN handling code to a common file"
This reverts commit 0c424c90ea.
2022-02-05 01:07:51 +01:00
Richard Scheffenegger 0c424c90ea tcp: move ECN handling code to a common file
Reduce the burden to maintain correct and
extensible ECN related code across multiple
stacks and codepaths.

Formally no functional change.

Incidentially this establishes correct
ECN operation in one instance.

Reviewed By: rrs, #transport
Sponsored by:        NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34162
2022-02-04 22:54:41 +01:00
Sylvian Meygret cd7306bb1f ip_mroute: split mrouter interface deactivation and if_free
Move if_free outside MRW_LOCK. This will silence LOR message
which might appere during deinitialization.
2022-02-04 10:25:07 +01:00
Richard Scheffenegger fd723975ec tcp: fix typo in commit f026275e26
missed one bitmask inversion while committing D34148

Differential Revision: https://reviews.freebsd.org/D34148
Differential Revision: https://reviews.freebsd.org/D34160
2022-02-03 21:05:09 +01:00
Richard Scheffenegger 3b0ee68050 tcp: Prevent setting of ECN bits with setsockopt()
setsockopt() grants full access to the deprecated
TOS byte. For TCP, mask out the ECN codepoint, so that
only the DSCP portion can be adjusted.

Reviewed By: tuexen, hselasky, #manpages, #transport, debdrup
Sponsored by:        NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34154
2022-02-03 20:06:42 +01:00
Richard Scheffenegger f026275e26 tcp: set IP ECN header codepoint properly
TCP RACK can cache the IP header while preparing
a new TCP packet for transmission. Thus all the
IP ECN codepoint bits need to be assigned, without
assuming a clear field beforehand.

Reviewed By: tuexen, kbowling, #transport
MFC after:   3 days
Sponsored by:        NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34148
2022-02-03 16:53:41 +01:00
Richard Scheffenegger 1ebf460758 tcp: Access all 12 TCP header flags via inline function
In order to consistently provide access to all
(including reserved) TCP header flag bits,
use an accessor function tcp_get_flags and
tcp_set_flags. Also expand any flag variable from
uint8_t / char to uint16_t.

Reviewed By: hselasky, tuexen, glebius, #transport
Sponsored by:        NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34130
2022-02-03 16:21:58 +01:00
Michael Tuexen d51c80351f rack: fix compilation and small cleanup
Fix a function prototype missed in the last commit and whitespace
change.
Sponsored by:	Netflix, Inc.
2022-02-02 09:41:40 +01:00
Michael Tuexen 3b3c08c135 tcp: cleanup functions related to socket option handling
Consistently only pass the inp and the sopt around. Don't pass the
so around, since in a upcoming commit tcp_ctloutput_set() will be
called from a context different from setsockopt(). Also expect
the inp to be locked when calling tcp_ctloutput_[gs]et(), this is
also required for the upcoming use by tcpsso, a command line tool
to set socket options.
Reviewed by:		glebius, rscheff
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D34151
2022-02-02 09:27:59 +01:00
Wojciech Macek 77223d98b6 ip_mroute: refactor epoch-basd locking
Remove duplicated epoch_enter and epoch_exit in IP inp/outp routines.
Remove unnecessary macros as well.

Obtained from:		Semihalf
Spponsored by:		Stormshield
Reviewed by:		glebius
Differential revision:	https://reviews.freebsd.org/D34030
2022-02-02 06:48:05 +01:00
Richard Scheffenegger 93e28d6e89 tcp: LRO code to deal with all 12 TCP header flags
TCP per RFC793 has 4 reserved flag bits for future use. One
of those bits may be used for Accurate ECN.
This patch is to include these bits in the LRO code to ease
the extensibility if/when these bits are used.

Reviewed By: hselasky, rrs, #transport
Sponsored by:        NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34127
2022-02-01 18:41:36 +01:00
John Baldwin d782385e9b tcp_ratelimit: Handle some edge cases with TLS + RL send tags.
- After a connection has fallen back from NIC TLS to SW TLS, any
  pacing rate changes should modify the inpcb send tag even though
  SB_TLS_IFNET is set.

- If a connection tries to modify the pacing rate before the send
  tag has been converted from plain TLS to TLS + RL, don't fail
  the rate request set but let it fall through to setting the rate
  on the non-TLS inpcb RL tag.

Reviewed by:	gallatin, rrs, hselasky
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D34085
2022-01-31 16:40:04 -08:00
Gordon Bergling 4bd030b369 sctp(4): Fix a typo in an INVARIANTS panic message
- s/failes/fails/

MFC after:	1 week
2022-01-28 13:20:52 +01:00
Richard Scheffenegger 4531b3450b tcp: Tidying up the conditionals for unwinding a spurious RTO
- Use the semantically correct TSTMP_xx macro when comparing
  timestamps. (No functional change)
- check for bad retransmits only when TSopt is present in ACK
  (don't assume there will be a valid TSopt in the TCP options struct)
- exclude tsecr == 0, since that most likely indicates an
  invalid ts echo return (tsecr) value.

Reviewed By: tuexen, #transport
MFC after:   3 days
Sponsored by:        NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34062
2022-01-27 18:59:55 +01:00
Richard Scheffenegger 68e623c3f0 tcp: Rewind erraneous RTO only while performing RTO retransmissions
Under rare circumstances, a spurious retranmission is
incorrectly detected and rewound, messing up various tcpcb values,
which can lead to a panic when SACK is in use.

Reviewed By: tuexen, chengc_netapp.com, #transport
MFC after:   3 days
Sponsored by:        NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D33979
2022-01-27 18:49:42 +01:00
Andrew Gallatin 8a7404b2ae tcp: fix leaks in tcp_chg_pacing_rate error paths
tcp_chg_pacing_rate() is expected to release the hw rate limit table,
but failed to do so in several error cases, leading to ever
increasing counts of flows using the rate.

This patch was mostly done by rrs

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34058
Reviewed by: hselasky, rrs,  jhb (inital version, outside of Differential)
2022-01-27 10:35:03 -05:00
Andrew Gallatin 9ba117960e Fix a memory leak when ip_output_send() returns EAGAIN due to send tag issues
When ip_output_send() returns EAGAIN due to issues with send tags (route
change, lagg failover, etc), it must free the mbuf. This is because
ip_output_send() was written as a wrapper/replacement for a direct
call to  if_output(), and the contract with if_output() has
historically been that it owns the mbufs once called. When
ip_output_send() failed to free mbufs, it violated this assumption
and lead to leaked mbufs.

This was noticed when using NIC TLS in combination with hardware
rate-limited connections. When seeing lots of NIC output drops
triggered ratelimit send tag changes, we noticed we were leaking
ktls_sessions, send tags and mbufs. This was due ip_output_send()
leaking mbufs which held references to ktls_sessions, which in
turn held references to send tags.

Many thanks to jbh, rrs, hselasky and markj for their help in
debugging this.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34054
Reviewed by: hselasky, jhb, rrs
MFC after: 2 weeks
2022-01-27 10:34:34 -05:00
Gordon Bergling 9e58cca3e8 extra_tcp_stacks: Fix two typos in source code comments
- s/differnt/different/

MFC after;	3 days
2022-01-26 18:02:55 +01:00
Gordon Bergling b3df222eae extra_tcp_stacks: Fix a few common typos
TCP_BBR:
- Fix a typo introducted in 1b90dfa5d2, which was reported by tuexen@

TCP_RACK:
- Correct two sysctl descriptions: s/corret/correct/

tcp_bbr(4): Also fix s/measurment/measurement/ in the man page

MFC after:	1 week
2022-01-26 10:35:17 +01:00
Wojciech Macek 0daa28057c ip_mroute: add unlock in early-exit
Add missing unlock if V_ip_mrotue is not set

Obtained from:		Semihalf
2022-01-22 14:48:47 +01:00
Wojciech Macek 889c60500d ip_mroute: release epoch lock if mrouter is not configured
Add mising "else" branch to release a lock if mrouter is not
configured.

Obtained from:		Semihalf
Sponsored by:		Stormshield
2022-01-22 11:48:30 +01:00
Wojciech Macek 9ce46cbc95 ip_mroute: move ip_mrouter_done outside lock
X_ip_mrouter_done might sleep, which triggers INVARIANTS to
print additional errors on the screen.
Move it outside the lock, but provide some basic synchronization
to avoid race condition during module uninit/unload.

Obtained from:		Semihalf
Sponsored by:		Stormshield
2022-01-21 06:17:19 +01:00
Wojciech Macek 58630bdd13 Revert "ip_mroute: do not call epoch_waitwhen lock is taken"
This reverts commit 2e72208b6c.
2022-01-21 06:17:19 +01:00
Randall Stewart aac52f94ea tcp: Warning cleanup from new compiler.
The clang compiler recently got an update that generates warnings of unused
variables where they were set, and then never used. This revision goes through
the tcp stack and cleans all of those up.

Reviewed by: Michael Tuexen, Gleb Smirnoff
Sponsored by: Netflix Inc.
Differential Revision:
2022-01-18 07:41:18 -05:00
Marko Zec e7abe200c2 fib_algo: shift / mask by constants in dxr_lookup()
Since trie configuration remains invariant during each DXR instance
lifetime, instead of shifting and masking lookup keys by values
computed at runtime, compile upfront several dxr_lookup()
configurations with hardcoded shift / mask constants, and choose the
apropriate lookup function version after each DXR instance rebuild.

In synthetic tests this yields small but measurable (5-10%) lookup
throughput improvement, depending on FIB size and  prefix patterns.

MFC after:	3 days
2022-01-17 00:13:47 +01:00
Gleb Smirnoff 1d41a49404 tcp_usr_connect: report actual error code when stack requests drop 2022-01-13 10:32:41 -08:00
Ryan Stone 3284f4925f LRO: Don't merge ACK and non-ACK packets together
LRO was willing to merge ACK and non-ACK packets together.  This
can cause incorrect th_ack values to be reported up the stack.
While non-ACKs are quite unlikely to appear in practice, LRO's
behaviour is against the spec.  Make LRO unwilling to merge
packets with different TH_ACK flag values in order to fix the
issue.

Found by: Sysunit test
Differential Revision:	https://reviews.freebsd.org/D33775
Reviewed by: rrs
2022-01-13 11:17:58 -05:00
Ryan Stone 24fe6643da LRO: Fix lost packets when merging 1 payload with an ACK
To check if it needed to regenerate a packet's header before
sending it up the stack, LRO was checking if more than one payload
had been merged into the packet.  This failed in the case where
a single payload was merged with one or more pure ACKs.  This
results in lost ACKs.

Fix this by precisely tracking whether header regeneration is
required instead of using an incorrect heuristic.

Found with: Sysunit test
Differential Revision:	https://reviews.freebsd.org/D33774
Reviewed by: rrs
2022-01-13 11:17:48 -05:00
Wojciech Macek 776c34f646 ip_mroute: remove unused variables
Sponsored by:	Stormshield
Obtained from:	Semihalf
2022-01-11 13:06:22 +01:00
Wojciech Macek 2e72208b6c ip_mroute: do not call epoch_waitwhen lock is taken
mrouter_done is called with RAW IP lock taken. Some annoying
printfs are visible on the console if INVARIANTS option is enabled.

Provide atomic-based mechanism which counts enters and exits from/to
critical section in ip_input and ip_output.
Before de-initialization of function pointers ensure (with busy-wait)
that mrouter de-initialization is visible to all readers and that we don't
remove pointers (like ip_mforward etc.) in the middle of packet processing.
2022-01-11 11:19:32 +01:00
Wojciech Macek 68f28dd1cc ip_mroute: do not sleep when lock is taken
Kthread initialization calls uma_alloc which can sleep.
Modify the code to use deferred work instead.
2022-01-11 11:19:32 +01:00
Robert Wing eb18708ec8 syncache: accept packet with no SA when TCP_MD5SIG is set
When TCP_MD5SIG is set on a socket, all packets are dropped that don't
contain an MD5 signature. Relax this behavior to accept a non-signed
packet when a security association doesn't exist with the peer.

This is useful when a listen socket set with TCP_MD5SIG wants to handle
connections protected with and without MD5 signatures.

Reviewed by:	bz (previous version)
Sponsored by:   nepustil.net
Sponsored by:   Klara Inc.
Differential Revision:	https://reviews.freebsd.org/D33227
2022-01-08 16:32:14 -09:00
Michael Tuexen f87818eacf sctp: miror change due to upstreaming 2022-01-03 23:03:06 +01:00
Gleb Smirnoff afad340a14 inpcb: garbage collect INP_LOCK_INIT(), used only once in sctp
Reviewed by:		tuexen
Differential revision:	https://reviews.freebsd.org/D33543
2022-01-03 10:20:30 -08:00
Gleb Smirnoff fec8a8c7cb inpcb: use global UMA zones for protocols
Provide structure inpcbstorage, that holds zones and lock names for
a protocol.  Initialize it with global protocol init using macro
INPCBSTORAGE_DEFINE().  Then, at VNET protocol init supply it as
the main argument to the in_pcbinfo_init().  Each VNET pcbinfo uses
its private hash, but they all use same zone to allocate and SMR
section to synchronize.

Note: there is kern.ipc.maxsockets sysctl, which controls UMA limit
on the socket zone, which was always global.  Historically same
maxsockets value is applied also to every PCB zone.  Important fact:
you can't create a pcb without a socket!  A pcb may outlive its socket,
however.  Given that there are multiple protocols, and only one socket
zone, the per pcb zone limits seem to have little value.  Under very
special conditions it may trigger a little bit earlier than socket zone
limit, but in most setups the socket zone limit will be triggered
earlier.  When VIMAGE was added to the kernel PCB zones became per-VNET.
This magnified existing disbalance further: now we have multiple pcb
zones in multiple vnets limited to maxsockets, but every pcb requires a
socket allocated from the global zone also limited by maxsockets.
IMHO, this per pcb zone limit doesn't bring any value, so this patch
drops it.  If anybody explains value of this limit, it can be restored
very easy - just 2 lines change to in_pcbstorage_init().

Differential revision:	https://reviews.freebsd.org/D33542
2022-01-03 10:17:46 -08:00
Gleb Smirnoff 644ca0846d domains: make domain_init() initialize only global state
Now that each module handles its global and VNET initialization
itself, there is no VNET related stuff left to do in domain_init().

Differential revision:	https://reviews.freebsd.org/D33541
2022-01-03 10:15:22 -08:00
Gleb Smirnoff 89128ff3e4 protocols: init with standard SYSINIT(9) or VNET_SYSINIT
The historical BSD network stack loop that rolls over domains and
over protocols has no advantages over more modern SYSINIT(9).
While doing the sweep, split global and per-VNET initializers.

Getting rid of pr_init allows to achieve several things:
o Get rid of ifdef's that protect against double foo_init() when
  both INET and INET6 are compiled in.
o Isolate initializers statically to the module they init.
o Makes code easier to understand and maintain.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D33537
2022-01-03 10:15:21 -08:00
Kristof Provost 80871aeb0f udp_var.h: other headers already include types.h
Pointed out by:	imp
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2022-01-03 18:35:02 +01:00
Kristof Provost aa70361d86 headers: make a few more headers self-contained
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2022-01-03 10:12:30 +01:00
Gordon Bergling 1b90dfa5d2 tcp_bbr(4): Fix a few typos in sysctl descriptions
- s/measurment/measurement/

MFC after:	3 days
2022-01-02 18:03:10 +01:00
Michael Tuexen 502d5e8500 sctp: improve counting of incoming chunks
MFC after:	3 days
2022-01-01 20:59:47 +01:00
Michael Tuexen 4760956e9a udp: use appropriate pcbinfo when signalling EHOSTDOWN
MFC after:	3 days
Sponsored by:	Netflix, Inc.
2022-01-01 19:17:17 +01:00
Michael Tuexen 430df2abee in_pcb: improve inp_next()
If there is no inp to check, exit the loop iterating through them.

Reported by:	syzbot+403406a9cbf082b36ea4@syzkaller.appspotmail.com
Reviewed by:	glebius
Sponsored by:	Netflix, Inc.
2022-01-01 19:04:10 +01:00
Michael Tuexen 1adb91e521 sctp: retire sctp_mtu_size_reset()
Thanks to Timo Voelker for making me aware that sctp_mtu_size_reset()
is very similar to sctp_pathmtu_adjustment().

MFC after:	3 days
2021-12-30 15:30:11 +01:00
Michael Tuexen 2de2ae331b sctp: improve sctp_pathmtu_adjustment()
Allow the resending of DATA chunks to be controlled by the caller,
which allows retiring sctp_mtu_size_reset() in a separate commit.
Also improve the computaion of the overhead and use 32-bit integers
consistently.
Thanks to Timo Voelker for pointing me to the code.

MFC after:	3 days
2021-12-30 15:16:05 +01:00
Alexander V. Chernikov ff3a85d324 [lltable] Add per-family lltable getters.
Introduce a new function, lltable_get(), to retrieve lltable pointer
 for the specified interface and family.
Use it to avoid all-iftable list traversal when adding or deleting
 ARP/ND records.

Differential Revision: https://reviews.freebsd.org/D33660
MFC after:	2 weeks
2021-12-29 20:57:15 +00:00
Gleb Smirnoff 4287aa5619 tcp_usr_shutdown: don't cast inp_ppcb to tcpcb before checking inp_flags
While here move out one more erroneous condition out of the epoch and
common return.  The only functional change is that if we send control
on a shut down socket we would get EINVAL instead of ECONNRESET.

Reviewed by:	tuexen
Reported by:	syzbot+8388cf7f401a7b6bece6@syzkaller.appspotmail.com
Fixes:		f64dc2ab5b
2021-12-28 08:50:02 -08:00
Michael Tuexen a7ba00a438 sctp: minor improvements in sctp_get_frag_point
MFC after:	3 days
2021-12-28 10:23:31 +01:00
Michael Tuexen ca0dd19f09 sctp: check that the computed frag point is a multiple of 4
Reported by:	syzbot+5da189fc1fe80b31f5bd@syzkaller.appspotmail.com
MFC after:	3 days
2021-12-28 09:40:52 +01:00
Gleb Smirnoff 0af4ce4547 tcp_usr_shutdown: don't cast inp_ppcb to tcpcb before checking inp_flags
Fixes:	f64dc2ab5b
2021-12-27 16:58:09 -08:00