Commit graph

7890 commits

Author SHA1 Message Date
Richard Scheffenegger fcea1cc971 tcp: fix RTO ssthresh for non-6675 pipe calculation
Follow up on D43768 to properly deal with the non-default
pipe calculation. When CC_RTO is processed, the timeout
will have already pulled back snd_nxt. Further, snd_fack
is not pulled along with snd_una.

Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D43876
2024-02-14 14:51:53 +01:00
Richard Scheffenegger 57e27ff07a tcp: partially undo D43792
At the destruction of the tcpcb, no timers are supposed to
be running. However, it turns out that stopping them in the
close() / shutdown() call does not have the desired effect
under all circumstances.

This partially reverts 62d47d73b7 to reduce the nuisance
caused.

PR:			277009
Reported-by:		syzbot+9a9aa434a14a2b35c3ba@syzkaller.appspotmail.com
Reported-by:		syzbot+e82856782410e895bae7@syzkaller.appspotmail.com
Reviewed By:		glebius, tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D43855
2024-02-12 22:38:11 +01:00
Richard Scheffenegger 62d47d73b7 tcp: stop timers and clean scoreboard in tcp_close()
Stop timers when in tcp_close() instead of doing that in tcp_discardcb().
A connection in CLOSED state shall not need any timers. Assert that no
timer is rescheduled after that in tcp_timer_activate() and verfiy that
this is also the expected state in tcp_discardcb().

PR:			276761
Reviewed By:		glebius, tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D43792
2024-02-10 10:30:00 +01:00
Richard Scheffenegger a8e817cf5c tcp: stop doing superfluous work after sending RST
When sending a RST control segment in tcp_output() it
means we are in TCPS_CLOSED state, called from tcp_drop().
Once the RST is sent, don't call tcp_timer_activate() or
update anything in tcpcb, since that will go away shortly.

PR:			276761
Provided by:		glebius
Reviewed By:		glebius, tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D43808
2024-02-10 10:25:02 +01:00
Richard Scheffenegger 3eeb22cb81 tcp: clean scoreboard when releasing the socket buffer
The SACK scoreboard is conceptually an extention of the socket
buffer. Remove it when the socket buffer goes away with
soisdisconnected(). Verify that this is also the expected
state in tcp_discardcb().

PR:			276761
Reviewed by:		glebius, tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D43805
2024-02-10 10:20:00 +01:00
Richard Scheffenegger 23c4f23247 tcp: ensure tcp_sack_partialack does not inflate cwnd after RTO
The implicit assumption of snd_nxt always being larger than
snd_recover is not true after RTO. In that case, cwnd
would get inflated to ssthresh, which may be much larger
than the current pipe (data in flight).

Reviewed By:           tuexen, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43653
2024-02-08 20:40:25 +01:00
Richard Scheffenegger 32a6df57df tcp: calculate ssthresh on RTO according to RFC5681
per RFC5681, only adjust ssthresh on the initital
retransmission timeout. Since RTO often happens
during loss recovery, while cwnd no longer tracks
all data in flight, calculcate pipe properly.

Reviewed By:           tuexen, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43768
2024-02-08 19:18:26 +01:00
Richard Scheffenegger 1adab814e8 tcp: use tcp_fixed_maxseg instead of tcp_maxseg in cc modules
tcp_fixed_maxseg() is the streamlined calculation of typical
tcp options and more suitable for heavy use in the congestion
control modules on every received packet.

No external functional change.

Reviewed By:           tuexen, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43779
2024-02-08 18:36:59 +01:00
Gleb Smirnoff ce69e37369 Revert "sockets: retire sorflush()"
Provide a comment in sorflush() why the socket I/O sx(9) lock is actually
important.

This reverts commit 507f87a799.
2024-02-03 13:08:41 -08:00
Gleb Smirnoff f79a8585bb sockets: garbage collect SS_ISCONFIRMING
Fixes:	8df32b19de
2024-01-30 10:38:33 -08:00
Michael Tuexen f30c7d5654 TCP LRO: convert TCP header fields to host byte order earlier
This is a preparation for adding dtrace hooks in a follow-up commit,
which are missing in the code path, where packets are directly queued
to the tcpcb. The dtrace hooks expect the fields to be in host byte
order. This only applies when TCP HPTS is used.
No functional change intended.

Reviewed by:		rscheff
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D43594
2024-01-29 18:52:17 +01:00
Kristof Provost ffeab76b68 pfil: PFIL_PASS never frees the mbuf
pfil hooks (i.e. firewalls) may pass, modify or free the mbuf passed
to them. (E.g. when rejecting a packet, or when gathering up packets
for reassembly).

If the hook returns PFIL_PASS the mbuf must still be present. Assert
this in pfil_mem_common() and ensure that ipfilter follows this
convention. pf and ipfw already did.
Similarly, if the hook returns PFIL_DROPPED or PFIL_CONSUMED the mbuf
must have been freed (or now be owned by the firewall for further
processing, like packet scheduling or reassembly).

This allows us to remove a few extraneous NULL checks.

Suggested by:	tuexen
Reviewed by:	tuexen, zlei
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D43617
2024-01-29 14:10:19 +01:00
Richard Scheffenegger 0b3f9e435f tcp: move cc_post_recovery past snd_una update
The RFC6675 pipe calculation (sack.revised, enabled
by default since D28702), uses outdated information,
while the previous default calculated it correctly
with up-to-date information from the incoming ACK.

This difference can become as large as the receive
window (not the congestion window previously),
potentially triggering a massive burst of new packets.

MFC after:             1 week
Reviewed By:           tuexen, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43520
2024-01-28 00:18:51 +01:00
Mark Johnston bbf86c65d0 netinet: Remove stale references to Giant from comments
MFC after:	1 week
2024-01-27 13:51:13 -05:00
Richard Scheffenegger 2d05a1c81b tcp: commonize check for more data to send, style changes
Use SEQ_SUB instead of a plain subtraction, for an implict
type conversion and prevention of a possible overflow.
Use curly brackets in stacked if statements throughout.
Use of the ? operator to enhance readability when clearing
the FIN flag in tcp_output().

None of the above change the function.

Reviewed By:           tuexen, cc, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43539
2024-01-26 01:20:35 +01:00
Richard Scheffenegger fc262fd3dc tcp: AccECN access ACE field by shifting bits
Shifting bits is quicker than checking header flag bits
one by one. Also improve readability by the use of switch
statements.

No change in behaviour.

Reviewed By:           glebius, tuexen, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43560
2024-01-26 00:16:22 +01:00
Richard Scheffenegger 0932fb565a tcp: fix TCPSTAT accounting for SACK
Account for SACK retransmitted bytes once the actual length
is known. This prevents a call to tcp_maxseg() and prepares
for TSO support when transmitting from the SACK scoreboard.

Reviewed By:           tuexen, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43447
2024-01-25 22:58:33 +01:00
Richard Scheffenegger c7c325d01d tcp: pass maxseg around instead of calculating locally
Improve slowpath processing (reordering, retransmissions)
slightly by calculating maxseg only once. This typically
saves one of two calls to tcp_maxseg().

Reviewed By:           glebius, tuexen, cc, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43536
2024-01-24 16:43:29 +01:00
Gleb Smirnoff 90ad2dc287 tcp: remove 20+ year old disabled code from d912c694ee 2024-01-23 13:16:34 -08:00
Gleb Smirnoff c809435b18 tcp: clear outdated comment mentioning T/TCP 2024-01-23 12:59:21 -08:00
Gleb Smirnoff e21c668719 tcp: pass positive errno to tcp_drop()
Fixes:	446ccdd08e
2024-01-23 12:59:21 -08:00
Gordon Bergling 9b035689f1 tcp_fastopen: Fix a typo in a source code comment
- s/posession/possession/

MFC after:	3 days
2024-01-22 21:49:47 +01:00
Gleb Smirnoff 7f3184ba79 tcp: remove outdated comment
This paragraph should have been removed in 446ccdd08e.
2024-01-22 12:42:21 -08:00
Gordon Bergling ef0ac0a1ad tcp_hpts: Fix a typo of a function name in a comment
- s/tcp_ouput/tcp_output/

MFC after:	3 days
2024-01-20 17:29:28 +01:00
Richard Scheffenegger dfe30e4196 tcp: remove unused tcp_sack_output_debug() function
This debugging code has been lingering for years with
no known use.

No functional change.

Reviewed by:           tuexen, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43511
2024-01-19 14:48:32 +01:00
Gleb Smirnoff a079c891c0 sctp: restore missing inpcb lock
Fixes:	5bba272807
Reported-by: syzbot+b8636c973dc20fea4a9b@syzkaller.appspotmail.com
Reported-by: syzbot+d76a18ee8bbe6f7d3056@syzkaller.appspotmail.com
2024-01-16 23:11:27 -08:00
Xavier Beaudouin 80044c785c Add UDP encapsulation of ESP in IPv6
This patch provides UDP encapsulation of ESP packets over IPv6.
Ports the IPv4 code to IPv6 and adds support for IPv6 in udpencap.c
As required by the RFC and unlike in IPv4 encapsulation,
UDP checksums are calculated.

Co-authored-by:	Aurelien Cazuc <aurelien.cazuc.external@stormshield.eu>
Sponsored-by:	Stormshield
Sponsored-by:	Wiktel
Sponsored-by:	Klara, Inc.
2024-01-16 20:44:34 +00:00
Gleb Smirnoff 507f87a799 sockets: retire sorflush()
With removal of dom_dispose method the function boils down to two
meaningful function calls: socantrcvmore() and sbrelease().  The latter is
only relevant for protocols that use generic socket buffers.

The socket I/O sx(9) lock acquisition in sorflush() is not relevant for
shutdown(2) operation as it doesn't do any I/O that may interleave with
read(2) or write(2).  The socket buffer mutex acquisition inside
sbrelease() is what guarantees thread safety.  This sx(9) acquisition in
soshutdown() can be tracked down to 4.4BSD times, where it used to be
sblock(), and it was carried over through the years evolving together with
sockets with no reconsideration of why do we carry it over.  I can't tell
if that sblock() made sense back then, but it doesn't make any today.

Reviewed by:		tuexen
Differential Revision:	https://reviews.freebsd.org/D43415
2024-01-16 10:30:49 -08:00
Gleb Smirnoff 5bba272807 sockets: make pr_shutdown fully protocol specific method
Disassemble a one-for-all soshutdown() into protocol specific methods.
This creates a small amount of copy & paste, but makes code a lot more
self documented, as protocol specific method would execute only the code
that is relevant to that protocol and nothing else.  This also fixes a
couple recent regressions and reduces risk of future regressions.  The
extended KPI for the new pr_shutdown removes need for the extra pr_flush
which was added for the sake of SCTP which could not perform its shutdown
properly with the old one.  Particularly for SCTP this change streamlines
a lot of code.

Some notes on why certain parts of code were copied or were not to certain
protocols:
* The (SS_ISCONNECTED | SS_ISCONNECTING | SS_ISDISCONNECTING) check is
  needed only for those protocols that may be connected or disconnected.
* The above reduces into only SS_ISCONNECTED for those protocols that
  always connect instantly.
* The ENOTCONN and continue processing hack is left only for datagram
  protocols.
* The SOLISTENING(so) block is copied to those protocols that listen(2).
* sorflush() on SHUT_RD is copied almost to every protocol, but that
  will be refactored later.
* wakeup(&so->so_timeo) is copied to protocols that can make a non-instant
  connect(2), can SO_LINGER or can accept(2).

There are three protocols (netgraph(4), Bluetooth, SDP) that did not have
pr_shutdown, but old soshutdown() would still perform sorflush() on
SHUT_RD for them and also wakeup(9).  Those protocols partially supported
shutdown(2) returning EOPNOTSUP for SHUT_WR/SHUT_RDWR, now they fully lost
shutdown(2) support.  I'm pretty sure netgraph(4) and Bluetooth are okay
about that and SDP is almost abandoned anyway.

Reviewed by:		tuexen
Differential Revision:	https://reviews.freebsd.org/D43413
2024-01-16 10:30:37 -08:00
Gleb Smirnoff d4033ebd05 divert: just return EOPNOTSUPP on shutdown(2)
Before this change we would always return ENOTCONN.  There is no
legitimate use of shutdown(2) on divert(4).
2024-01-12 02:04:04 -08:00
Michael Tuexen 13720136fb tcpsso: fix when used without -i option
Since fdb987bebd it is not possible anymore to use inp_next
iterator for bound, but unconnected sockets. This applies
to TCP listening sockets. Therefore the metioned commit broke
tcpsso on listening sockets if the -i option was not used.
Fix this by iterating through all endpoints instead of only
through the bound, but unconnected ones.

Reviewed by:		markj
Fixes:			fdb987bebd ("inpcb: Split PCB hash tables")
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D43353
2024-01-10 08:33:09 +01:00
John Baldwin 8cb9b68f58 sys: Use mbufq_empty instead of comparing mbufq_len against 0
Reviewed by:	bz, emaste
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D43338
2024-01-09 11:00:46 -08:00
Richard Scheffenegger 429f14f83a tcp: clean PRR state after ECN congestion recovery.
PRR state was not properly reset on subsequent ECN CE
events. Clean up after local transmission failures too.

Reviewed by:           tuexen, cc, #transport
MFC after:             3 days
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43170
2024-01-08 10:53:04 +01:00
Richard Scheffenegger f4574e2dc5 tcp: prevent spurious empty segments and fix uncommon panic
Only try sending more data on pure ACKs when there is
more data available in the send buffer.

In the case of a retransmitted SYN not being sent due to
an internal error, the snd_una/snd_nxt accounting could
be off, leading to a panic. Pulling snd_nxt up to snd_una
prevents this from happening.

Reported by:           fengdreamer@126.com
Reviewed by:           cc, tuexen, #transport
MFC after:             1 week
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43343
2024-01-08 10:52:49 +01:00
Richard Scheffenegger 30409ecdb6 tcp: do not purge SACK scoreboard on first RTO
Keeping the SACK scoreboard intact after the first RTO
and retransmitting all data anew only on subsequent RTOs
allows a more timely and efficient loss recovery under
many adverse cirumstances.

Reviewed By:           tuexen, #transport
MFC after:             10 weeks
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D42906
2024-01-06 20:25:38 +01:00
Richard Scheffenegger 893ed42eca tcp: Make use of enum for sack_changed
No functional change.

Reviewed By:           tuexen, #transport
MFC after:             3 days
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43346
2024-01-06 20:23:52 +01:00
Michael Tuexen aa1223ac3a tcp: limit visibility of symbols
Put most symbols under __BSD_VISIBLE and limit the namespace of
tcp_[gs]et_flags.

Reviewed by:		kib, karels, rscheff
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D43245
2024-01-06 12:00:38 +01:00
Jose Luis Duran b0e13f785b netinet: Define IPv6 ECN mask
Define a mask for the code point used for ECN in the Traffic Class field
(2 bits) of an IPv6 header.

     BE:    0       0       3       0       0       0       0       0
    Bit: 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |Version| Traffic Class |           Flow Label                  |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                              ...                              |

For BE (Big Endian), or network-byte order, this corresponds to 0x00300000.
For Little Endian, it corresponds to 0x00003000.

Reviewed by:	imp, markj
MFC after:	1 week
Pull Request:	https://github.com/freebsd/freebsd-src/pull/879
2024-01-03 12:56:28 -05:00
Richard Kümmel 7df9da47e8 Fix udp IPv4-mapped address
Do not use the cached route if the destination isn't the same.
This fix a problem where an UDP packet will be sent via the wrong route
and interface if a previous one was sent via them.

PR:	275774
Reviewed by:	glebius, tuexen
Sponsored by:	Beckhoff Automation GmbH & Co. KG
2024-01-02 07:49:12 +01:00
Michael Tuexen 642ac6015b tcp: fix ports
inline is only support in C99 and newer. To support also C89, use
__inline instead as suggested by dim.

Reported by:		eduardo
Reviewed by:		rscheff, markj, dim, imp
Tested by:		eduardo
Fixes:			a8b70cf260 ("netpfil: Use accessor functions and named constants for all tcphdr flags")
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D43231
2023-12-30 03:28:13 +01:00
John Baldwin f7d5900aa0 sys: Style fix for M_EXT | M_EXTPG
Add a space around the | operator in places testing for either M_EXT
or M_EXTPG.

Reviewed by:	imp, glebius
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D43216
2023-12-28 11:17:59 -08:00
Gleb Smirnoff 4a0c6403b0 inpcb: poison several inpcb pointer in in_pcbfree()
There are few subsystems that reference inpcb and allow it to outlive
in_pcbfree().  There are no known bugs with them to unreference the
options pointers for a freed inpcb.  Enforce this so that such bugs
don't appear in the future.

Reviewed by:		markj
Differential Revision:	https://reviews.freebsd.org/D43134
2023-12-27 08:34:37 -08:00
Gleb Smirnoff a13039e270 inpcb: reoder inpcb destruction
First, merge in_pcbdetach() with in_pcbfree().  The comment for
in_pcbdetach() was no longer correct.  Then, make sure we remove
the inpcb from the hash before we commit any destructive actions
on it.  There are couple functions that rely on the hash lock
skipping SMR + inpcb lock to lookup an inpcb.  Although there are
no known functions that similarly rely on the global inpcb list
lock, also do list removal before destructive actions.

PR:			273890
Reviewed by:		markj
Differential Revision:	https://reviews.freebsd.org/D43122
2023-12-27 08:34:37 -08:00
Gordon Bergling 7b0b448ba9 tcp_stacks: Fix two typos in a source code comments
- s/recieved/received/

MFC after:	3 days
2023-12-27 09:36:30 +01:00
Richard Scheffenegger a8b70cf260 netpfil: Use accessor functions and named constants for all tcphdr flags
Update all remaining references to the struct tcphdr th_x2 field.
This completes the compatibilty of various aspects with AccECN
(TH_AE), after the internal ipfw "re-checksum required" was moved
to use the TH_RES1 flag.

No functional change.

Reviewed By:           tuexen, #transport, glebius
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43172
2023-12-25 13:18:01 +01:00
Gleb Smirnoff 08c33cd94d hpts: avoid duplicate call to tcp_output()
Obtained from:	rrs
2023-12-26 13:09:09 -08:00
Richard Scheffenegger 8717c306bd tcp: allow userspace use of tcp header flags accessor functions
Provide accessor functions to all 12 possible TCP header
flags for userspace too.

Reviewed By:           zlei
MFC after:             2 weeks
Sponsored by:          Netapp, Inc.
Differential Revision: https://reviews.freebsd.org/D43152
2023-12-22 02:20:29 +01:00
Gleb Smirnoff 513f2e2e71 tcp: always set tcp_tun_port to a correct value
The tcp_tun_port field that is used to pass port value between UDP
and TCP in case of tunneling is a generic field that used to pass
data between network layers.  It can be contaminated on entry, e.g.
by a VLAN tag set by a NIC driver.  Explicily set it, so that it
is zeroed out in a normal not-tunneled TCP.  If it contains garbage,
tcp_twcheck() later can enter wrong block of code and treat the packet
as incorrectly tunneled one.  On main and stable/14 that will end up
with sending incorrect responses, but on stable/13 with ipfw(8) and
pcb-matching rules it may end up in a panic.

This is a minimal conservative patch to be merged to stable branches.
Later we may redesign this.

PR:			275169
Reviewed by:		tuexen
Differential Revision:	https://reviews.freebsd.org/D43065
2023-12-19 11:24:17 -08:00
Gleb Smirnoff 48b55a7c7b tcp_hpts: make the module unloadable
Although the HPTS subsytem wasn't initially designed as a loadable
module, now it is so.  Make it possible to also unload it, but for
safety reasons hide that under 'kldunload -f'.

Reviewed by:		tuexen
Differential Revision:	https://reviews.freebsd.org/D43092
2023-12-19 10:21:56 -08:00
Gleb Smirnoff 175d4d6988 tcp_hpts: use tcp_pace.cts_last_ran for last ran table
Remove the global cts_last_ran and use already existing unused field of
struct tcp_hptsi, which seems originally planned to hold this table.  This
makes it consistent with other malloc-ed tables, like main array of HPTS
entities and CPU groups.

Reviewed by:		tuexen
Differential Revision:	https://reviews.freebsd.org/D43091
2023-12-19 10:21:56 -08:00