system/freebsd-src

mirror of https://github.com/freebsd/freebsd-src synced 2024-10-16 13:23:36 +00:00

Author	SHA1	Message	Date
Gleb Smirnoff	0f617ae48a	Add in_pcb_var.h for KPIs that are private to in_pcb.c and in6_pcb.c.	2021-10-18 10:19:57 -07:00
Gleb Smirnoff	744a64bd92	in_pcb: garbage collect in_pcbrele()	2021-10-18 10:07:16 -07:00
Gleb Smirnoff	5a78df20ce	in_pcb: garbage collect unused structure in_pcblist	2021-10-18 10:06:39 -07:00
Maxim Sobolev	461e6f23db	Fix fragmented UDP packets handling since rev.360967. Consider IP_MF flag when checking length of the UDP packet to match the declared value. Sponsored by: Sippy Software, Inc. Differential Revision: https://reviews.freebsd.org/D32363 MFC after: 2 weeks	2021-10-15 16:48:12 -07:00
Gleb Smirnoff	2144431c11	Remove in_ifaddr_lock acquisiton to access in_ifaddrhead. An IPv4 address is embedded into an ifaddr which is freed via epoch. And the in_ifaddrhead is already a CK list. Use the network epoch to protect against use after free. Next step would be to CK-ify the in_addr hash and get rid of the... Reviewed by: melifaro Differential Revision: https://reviews.freebsd.org/D32434	2021-10-13 10:04:46 -07:00
Marko Zec	bc8b8e106b	[fib_algo][dxr] Retire counters which are no longer used The number of chunks can still be tracked via vmstat -z\|fgrep dxr. MFC after: 3 days	2021-10-09 13:47:10 +02:00
Marko Zec	1549575f22	[fib_algo][dxr] Improve incremental updating strategy Tracking the number of unused holes in the trie and the range table was a bad metric based on which full trie and / or range rebuilds were triggered, which would happen in vain by far too frequently, particularly with live BGP feeds. Instead, track the total unused space inside the trie and range table structures, and trigger rebuilds if the percentage of unused space exceeds a sysctl-tunable threshold. MFC after: 3 days PR: 257965	2021-10-09 13:22:27 +02:00
Michael Tuexen	bd19202c92	sctp: improve KASSERT messages MFC after: 1 week	2021-10-08 11:33:56 +02:00
Michael Tuexen	3ff3733991	sctp: don't keep being locked on a stream which is removed Reported by: syzbot+f5f551e8a3a0302a4914@syzkaller.appspotmail.com MFC after: 1 week	2021-10-02 00:48:01 +02:00
Randall Stewart	a36230f75e	tcp: Make dsack stats available in netstat and also make sure its aware of TLP's. DSACK accounting has been for quite some time under a NETFLIX_STATS ifdef. Statistics on DSACKs however are very useful in figuring out how much bad retransmissions you are doing. This is further complicated, however, by stacks that do TLP. A TLP when discovering a lost ack in the reverse path will cause the generation of a DSACK. For this situation we introduce a new dsack-tlp-bytes as well as the more traditional dsack-bytes and dsack-packets. These will now all display in netstat -p tcp -s. This also updates all stacks that are currently built to keep track of these stats. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32158	2021-10-01 10:36:27 -04:00
Michael Tuexen	28ea947078	sctp: provide a specific stream scheduler function for FCFS A KASSERT in the genric routine does not apply and triggers incorrectly. Reported by: syzbot+8435af157238c6a11430@syzkaller.appspotmail.com MFC after: 1 week	2021-09-29 02:08:37 +02:00
Michael Tuexen	fa947a3687	sctp: cleanup and adding KASSERT()s, no functional change MFC after: 1 week	2021-09-28 20:31:12 +02:00
Michael Tuexen	5b53e749a9	sctp: fix usage of stream scheduler functions sctp_ss_scheduled() should only be called for streams that are scheduled. So call sctp_ss_remove_from_stream() before it. This bug was uncovered by the earlier cleanup. Reported by: syzbot+bbf739922346659df4b2@syzkaller.appspotmail.com Reported by: syzbot+0a0857458f4a7b0507c8@syzkaller.appspotmail.com Reported by: syzbot+a0b62c6107b34a04e54d@syzkaller.appspotmail.com Reported by: syzbot+0aa0d676429ebcd53299@syzkaller.appspotmail.com Reported by: syzbot+104cc0c1d3ccf2921c1d@syzkaller.appspotmail.com MFC after: 1 week	2021-09-28 05:25:58 +02:00
Michael Tuexen	171633765c	sctp: avoid locking an already locked mutex Reported by: syzbot+f048680690f2e8d7ddad@syzkaller.appspotmail.com Reported by: syzbot+0725c712ba89d123c2e9@syzkaller.appspotmail.com MFC after: 1 week	2021-09-28 05:17:03 +02:00
Gordon Bergling	d2e616147d	sctp: Fix a typo in a comment - s/assue/assume/ MFC after: 3 days	2021-09-26 15:15:39 +02:00
Marko Zec	43880c511c	[fib_algo][dxr] Split unused range chunk list in multiple buckets Traversing a single list of unused range chunks in search for a block of optimal size was suboptimal. The experience with real-world BGP workloads has shown that on average unused range chunks are tiny, mostly in length from 1 to 4 or 5, when DXR is configured with K = 20 which is the current default (D16X4R). Therefore, introduce a limited amount of buckets to accomodate descriptors of empty blocks of fixed (small) size, so that those can be found in O(1) time. If no empty chunks of the requested size can be found in fixed-size buckets, the search continues in an unsorted list of empty chunks of variable lengths, which should only happen infrequently. This change should permit us to manage significantly more empty range chunks without sacrifying the speed of incremental range table updating. MFC after: 3 days	2021-09-25 06:29:48 +02:00
Randall Stewart	1ca931a540	tcp: Rack compressed ack path updates the recv window too easily The compressed ack path of rack is not following proper procedures in updating the peers window. It should be checking the seq and ack values before updating and instead it is blindly updating the values. This could in theory get the wrong window in the connection for some length of time. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32082	2021-09-23 11:43:29 -04:00
Randall Stewart	fd69939e79	tcp: Two bugs in rack one of which can lead to a panic. In extensive testing in NF we have found two issues inside the rack stack. 1) An incorrect offset is being generated by the fast send path when a fast send is initiated on the end of the socket buffer and before the fast send runs, the sb_compress macro adds data to the trailing socket. This fools the fast send code into thinking the sb offset changed and it miscalculates a "updated offset". It should only do that when the mbuf in question got smaller.. i.e. an ack was processed. This can lead to a panic deref'ing a NULL mbuf if that packet is ever retransmitted. At the best case it leads to invalid data being sent to the client which usually terminates the connection. The fix is to have the proper logic (that is in the rsm fast path) to make sure we only update the offset when the mbuf shrinks. 2) The other issue is more bothersome. The timestamp check in rack needs to use the msec timestamp when comparing the timestamp echo to now. It was using a microsecond timestamp which ends up giving error prone results but causes only small harm in trying to identify which send to use in RTT calculations if its a retransmit. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32062	2021-09-23 10:54:23 -04:00
Michael Tuexen	414499b3f9	sctp: Cleanup stream schedulers. No functional change intended. MFC after: 1 week	2021-09-23 14:16:56 +02:00
Michael Tuexen	762ae0ec8d	sctp: Simplify stream scheduler usage Callers are getting the stcb send lock, so just KASSERT that. No need to signal this when calling stream scheduler functions. No functional change intended. MFC after: 1 week	2021-09-21 17:13:57 +02:00
Michael Tuexen	0b79a76f84	sctp: improve consistency when calling stream scheduler Hold always the stcb send lock when calling sctp_ss_init() and sctp_ss_remove_from_stream(). MFC after: 1 week	2021-09-21 00:54:13 +02:00
Michael Tuexen	34b1efcea1	sctp: use a valid outstream when adding it to the scheduler Without holding the stcb send lock, the outstreams might get reallocated if the number of streams are increased. Reported by: syzbot+4a5431d7caa666f2c19c@syzkaller.appspotmail.com Reported by: syzbot+aa2e3b013a48870e193d@syzkaller.appspotmail.com Reported by: syzbot+e4368c3bde07cd2fb29f@syzkaller.appspotmail.com Reported by: syzbot+fe2f110e34811ea91690@syzkaller.appspotmail.com Reported by: syzbot+ed6e8de942351d0309f4@syzkaller.appspotmail.com MFC after: 1 week	2021-09-20 15:52:10 +02:00
Marko Zec	2ac039f7be	[fib_algo][dxr] Merge adjacent empty range table chunks. MFC after: 3 days	2021-09-20 06:30:45 +02:00
Michael Tuexen	e19d93b19d	sctp: fix FCFS stream scheduler Reported by: syzbot+c6793f0f0ce698bce230@syzkaller.appspotmail.com MFC after: 1 week	2021-09-19 11:56:26 +02:00
Mark Johnston	bf25678226	ktls: Fix error/mode confusion in TCP_*TLS_MODE getsockopt handlers ktls_get_(rx\|tx)_mode() can return an errno value or a TLS mode, so errors are effectively hidden. Fix this by using a separate output parameter. Convert to the new socket buffer locking macros while here. Note that the socket buffer lock is not needed to synchronize the SOLISTENING check here, we can rely on the PCB lock. Reviewed by: jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31977	2021-09-17 14:19:05 -04:00
Mike Karels	fd0765933c	Change lowest address on subnet (host 0) not to broadcast by default. The address with a host part of all zeros was used as a broadcast long ago, but the default has been all ones since 4.3BSD and RFC1122. Until now, we would broadcast the host zero address as well as the configured address. Change to not broadcasting that address by default, but add a sysctl (net.inet.ip.broadcast_lowest) to re-enable it. Note that the correct way to use the zero address for broadcast would be to configure it as the broadcast address for the network. See https:/datatracker.ietf.org/doc/draft-schoen-intarea-lowest-address/ and the discussion in https://reviews.freebsd.org/D19316. Note, Linux now implements this. Reviewed by: rgrimes, tuexen; melifaro (previous version) MFC after: 1 month Relnotes: yes Differential Revision: https://reviews.freebsd.org/D31861	2021-09-16 19:42:20 -05:00
Marko Zec	eb3148cc4d	[fib algo][dxr] Fix division by zero. A division by zero would occur if DXR would be activated on a vnet with no IP addresses configured on any interfaces. PR: 257965 MFC after: 3 days Reported by: Raul Munoz	2021-09-16 16:34:05 +02:00
Marko Zec	b51f8bae57	[fib algo][dxr] Optimize trie updating. Don't rebuild in vain trie parts unaffected by accumulated incremental RIB updates. PR: 257965 Tested by: Konrad Kreciwilk MFC after: 3 days	2021-09-15 22:42:49 +02:00
Marko Zec	442c8a245e	[fib algo][dxr] Fix undefined behavior. The result of shifting uint32_t by 32 (or more) is undefined: fix it.	2021-09-15 22:42:48 +02:00
Hans Petter Selasky	e3e7d95332	tcp: Avoid division by zero when KERN_TLS is enabled in tcp_account_for_send(). If the "len" variable is non-zero, we can assume that the sum of "tp->t_snd_rxt_bytes + tp->t_sndbytes" is also non-zero. It is also assumed that the 64-bit byte counters will never wrap around. Differential Revision: https://reviews.freebsd.org/D31959 Reviewed by: gallatin, rrs and tuexen Found by: "I told you so", also called hselasky MFC after: 1 week Sponsored by: NVIDIA Networking	2021-09-15 18:05:31 +02:00
Michael Tuexen	4542164685	sctp: cleanup, no functional change intended MFC after: 1 week	2021-09-15 10:18:11 +02:00
John Baldwin	c782ea8bb5	Add a switch structure for send tags. Move the type and function pointers for operations on existing send tags (modify, query, next, free) out of 'struct ifnet' and into a new 'struct if_snd_tag_sw'. A pointer to this structure is added to the generic part of send tags and is initialized by m_snd_tag_init() (which now accepts a switch structure as a new argument in place of the type). Previously, device driver ifnet methods switched on the type to call type-specific functions. Now, those type-specific functions are saved in the switch structure and invoked directly. In addition, this more gracefully permits multiple implementations of the same tag within a driver. In particular, NIC TLS for future Chelsio adapters will use a different implementation than the existing NIC TLS support for T6 adapters. Reviewed by: gallatin, hselasky, kib (older version) Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D31572	2021-09-14 11:43:41 -07:00
Mark Johnston	e6c19aa94d	sctp: Allow blocking on I/O locks even with non-blocking sockets There are two flags to request a non-blocking receive on a socket: MSG_NBIO and MSG_DONTWAIT. They are handled a bit differently in that soreceive_generic() and soreceive_stream() will block on the socket I/O lock when MSG_NBIO is set, but not if MSG_DONTWAIT is set. In general, MSG_NBIO seems to mean, "don't block if there is no data to receive" and MSG_DONTWAIT means "don't go to sleep for any reason". SCTP's soreceive implementation did not allow blocking on the I/O lock if either flag is set, but this violates an assumption in aio_process_sb(), which specifies MSG_NBIO but nonetheless expects to make progress if data is available to read. Change sctp_sorecvmsg() to block on the I/O lock only if MSG_DONTWAIT is not set. Reported by: syzbot+c7d22dbbb9aef509421d@syzkaller.appspotmail.com Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31915	2021-09-14 09:02:05 -04:00
Michael Tuexen	29545986bd	sctp: avoid LOR Don't lock the inp-info lock while holding an stcb lock. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D31921	2021-09-12 21:11:14 +02:00
Michael Tuexen	4181fa2a20	sctp: minor cleanup, no functional change MFC after: 1 week	2021-09-12 19:21:15 +02:00
Mark Johnston	2d5c48eccd	sctp: Tighten up locking around sctp_aloc_assoc() All callers of sctp_aloc_assoc() mark the PCB as connected after a successful call (for one-to-one-style sockets). In all cases this is done without the PCB lock, so the PCB's flags can be corrupted. We also do not atomically check whether a one-to-one-style socket is a listening socket, which violates various assumptions in solisten_proto(). We need to hold the PCB lock across all of sctp_aloc_assoc() to fix this. In order to do that without introducing lock order reversals, we have to hold the global info lock as well. So: - Convert sctp_aloc_assoc() so that the inp and info locks are consistently held. It returns with the association lock held, as before. - Fix an apparent bug where we failed to remove an association from a global hash if sctp_add_remote_addr() fails. - sctp_select_a_tag() is called when initializing an association, and it acquires the global info lock. To avoid lock recursion, push locking into its callers. - Introduce sctp_aloc_assoc_connected(), which atomically checks for a listening socket and sets SCTP_PCB_FLAGS_CONNECTED. There is still one edge case in sctp_process_cookie_new() where we do not update PCB/socket state correctly. Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31908	2021-09-11 10:15:21 -04:00
orange30	f5777c123a	net: Fix memory leaks upon arp_fillheader() failures Free memory before return from arprequest_internal(). In in_arpinput(), if arp_fillheader() fails, it should use goto drop. Reviewed by: melifaro, imp, markj MFC after: 1 week Pull Request: https://github.com/freebsd/freebsd-src/pull/534	2021-09-10 09:45:26 -04:00
Michael Tuexen	3ea2cdd45e	sctp: add explicit cast, no functional change intended MFC after: 3 days	2021-09-09 19:13:47 +02:00
Michael Tuexen	0c1a20beb4	sctp: use appropriate argument when freeing association Reported by: syzbot+7fe26e26911344e7211d@syzkaller.appspotmail.com MFC after: 3 days	2021-09-09 18:01:35 +02:00
Mark Johnston	4250aa1188	sctp: Clear assoc socket references when freeing a PCB This restores behaviour present in the first import of SCTP. Commit `ceaad40ae7` commented this out and commit `62fb761ff2` removed it. However, once sctp_inpcb_free() returns, the socket reference is gone no matter what, so we need to clear it. Reported by: syzbot+30dd69297fcbc5f0e10a@syzkaller.appspotmail.com Reported by: syzbot+7b2f9d4bcac1c9569291@syzkaller.appspotmail.com Reported by: syzbot+ed3e651f7d040af480a6@syzkaller.appspotmail.com Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31886	2021-09-09 08:33:26 -04:00
Michael Tuexen	58a7bf124c	sctp: cleanup timewait handling for vtags MFC after: 1 week	2021-09-09 01:18:58 +02:00
Mark Johnston	ee4731179c	sctp: Fix a lock order reversal in sctp_swap_inpcb_for_listen() When port reuse is enabled in a one-to-one-style socket, sctp_listen() may call sctp_swap_inpcb_for_listen() to move the PCB out of the "TCP pool". In so doing it will drop the PCB lock, yielding an LOR since we now hold several socket locks. Reorder sctp_listen() so that it performs this operation before beginning the conversion to a listening socket. Also modify sctp_swap_inpcb_for_listen() to return with PCB write-locked, since that's what sctp_listen() expects now. Reviewed by: tuexen Fixes: `bd4a39cc93` ("socket: Properly interlock when transitioning to a listening socket") MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31879	2021-09-08 11:41:19 -04:00
Mark Johnston	6e3af6321b	sctp: Fix lock recursion in sctp_swap_inpcb_for_listen() After commit `bd4a39cc93` we now hold the global inp info lock across the call to sctp_swap_inpcb_for_listen(), which attempts to acquire it again. Since sctp_swap_inpcb_for_listen()'s sole caller is sctp_listen(), we can simply change it to not try to acquire the lock. Reported by: syzbot+a76b19ea2f8e1190c451@syzkaller.appspotmail.com Reported by: syzbot+a1b6cef257ad145b7187@syzkaller.appspotmail.com Reviewed by: tuexen Fixes: `bd4a39cc93` ("socket: Properly interlock when transitioning to a listening socket") MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31878	2021-09-08 11:41:18 -04:00
Michael Tuexen	aab1d593b2	sctp: minor cleanups, no functional change intended	2021-09-08 15:13:49 +02:00
Alexander V. Chernikov	4b631fc832	routing: fix source address selection rules for IPv4 over IPv6. Current logic always selects an IFA of the same family from the outgoing interfaces. In IPv4 over IPv6 setup there can be just single non-127.0.0.1 ifa, attached to the loopback interface. Create a separate rt_getifa_family() to handle entire ifa selection for the IPv4 over IPv6. Differential Revision: https://reviews.freebsd.org/D31868 MFC after: 1 week	2021-09-07 21:41:05 +00:00
Mark Johnston	c4b44adcf0	sctp: Remove special handling for a listen(2) backlog of 0 ... when applied to one-to-one-style sockets. sctp_listen() cannot be used to toggle the listening state of such a socket. See RFC 6458's description of expected listen(2) semantics for one-to-one- and one-to-many-style sockets. Reviewed by: tuexen MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31774	2021-09-07 17:12:09 -04:00
Mark Johnston	bd4a39cc93	socket: Properly interlock when transitioning to a listening socket Currently, most protocols implement pru_listen with something like the following: SOCK_LOCK(so); error = solisten_proto_check(so); if (error) { SOCK_UNLOCK(so); return (error); } solisten_proto(so); SOCK_UNLOCK(so); solisten_proto_check() fails if the socket is connected or connecting. However, the socket lock is not used during I/O, so this pattern is racy. The change modifies solisten_proto_check() to additionally acquire socket buffer locks, and the calling thread holds them until solisten_proto() or solisten_proto_abort() is called. Now that the socket buffer locks are preserved across a listen(2), this change allows socket I/O paths to properly interlock with listen(2). This fixes a large number of syzbot reports, only one is listed below and the rest will be dup'ed to it. Reported by: syzbot+9fece8a63c0e27273821@syzkaller.appspotmail.com Reviewed by: tuexen, gallatin MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31659	2021-09-07 17:11:43 -04:00
Mark Johnston	f94acf52a4	socket: Rename sb(un)lock() and interlock with listen(2) In preparation for moving sockbuf locks into the containing socket, provide alternative macros for the sockbuf I/O locks: SOCK_IO_SEND_(UN)LOCK() and SOCK_IO_RECV_(UN)LOCK(). These operate on a socket rather than a socket buffer. Note that these locks are used only to prevent concurrent readers and writters from interleaving I/O. When locking for I/O, return an error if the socket is a listening socket. Currently the check is racy since the sockbuf sx locks are destroyed during the transition to a listening socket, but that will no longer be true after some follow-up changes. Modify a few places to check for errors from sblock()/SOCK_IO_(SEND\|RECV)_LOCK() where they were not before. In particular, add checks to sendfile() and sorflush(). Reviewed by: tuexen, gallatin MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31657	2021-09-07 15:06:48 -04:00
Mark Johnston	173a7a4ee4	sctp: Fix iterator synchronization in sctp_sendall() - The SCTP_PCB_FLAGS_SND_ITERATOR_UP check was racy, since two threads could observe that the flag is not set and then both set it. I'm not sure if this is actually a problem in practice, i.e., maybe there's no problem having multiple sends for a single PCB in the iterator list? - sctp_sendall() was modifying sctp_flags without the inp lock held. The change simply acquires the PCB write lock before toggling the flag, fixing both problems. Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31813	2021-09-07 11:19:29 -04:00
Mark Johnston	e8e23ec127	sctp: Remove an unused sctp_inpcb field This appears to be unused in usrsctp as well. No functional change intended. Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31812	2021-09-07 11:19:29 -04:00
Mark Johnston	c17b531bed	sctp: Fix races around sctp_inpcb_free() sctp_close() and sctp_abort() disassociate the PCB from its socket. As a part of this, they attempt to free the PCB, which may end up lingering. Fix some bugs in this area: - For some reason, sctp_close() and sctp_abort() set SCTP_PCB_FLAGS_SOCKET_GONE using an atomic compare-and-set without the PCB lock held. This is racy since sctp_flags is normally updated without atomics, using the PCB lock to synchronize. So, the update can be lost, which can cause all sort of races with other SCTP components which look for the _GONE flag. Fix the problem simply by acquiring the PCB lock in order to set the flag. Note that we have to drop and re-acquire the lock again in sctp_inpcb_free(), but I don't see a good way around that for now. If it's a real problem, the _GONE flag could be split out of sctp_flags and into a dedicated sctp_inpcb field. - In sctp_inpcb_free(), load sctp_socket after acquiring the PCB lock, to avoid possible races with parallel sctp_inpcb_free() calls. - Add an assertion sctp_inpcb_free() to verify that _ALLGONE is not set. Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31811	2021-09-07 11:19:29 -04:00
Alexander V. Chernikov	936f4a42fa	lltable: do not require prefix lookup when checking lle allocation rules. With the new FIB_ALGO infrastructure, nearly all subsystems use fib[46]_lookup() functions, which provides lockless lookups. A number of places remains that uses old-style lookup functions, that still requires RIB read lock to return the result. One of such places is arp processing code. FIB_ALGO implementation makes some tradeoffs, resulting in (relatively) prolonged periods of holding RIB_WLOCK. If the lock is held and datapath competes for it, the RX ring may get blocked, ending in traffic delays and losses. As currently arp processing is performed directly in the interrupt handler, handling ARP replies triggers the problem descibed above when the amount of ARP replies is high. To be more specific, prior to creating new ARP entry, routing lookup for the entry address in interface fib is executed. The following conditions are the verified: 1. If lookup returns an empty result, or the resulting prefix is non-directly-reachable, failure is returned. The only exception are host routes w/ gateway==address. 2. If the routing lookup returns different interface and non-host route, we want to support the use case of having multiple interfaces with the same prefix. In fact, the current code just checks if the returned prefix covers target address (always true) and effectively allow allocating ARP entries for any directly-reachable prefix, regardless of its interface. Change the code to perform the following: 1) use fib4_lookup() to get the nexthop, instead of requesting exact prefix. 2) Rewrite first condition check using nexthop flags (1:1 match) 3) Rewrite second condition to check for interface addresses matching target address on the input interface. Differential Revision: https://reviews.freebsd.org/D31824 Reviewed by: ae MFC after: 1 week PR: 257965	2021-09-06 21:03:22 +00:00
Gordon Bergling	631504fb34	Fix a common typo in source code comments - s/existant/existent/ MFC after: 3 days	2021-09-04 12:56:57 +02:00
Mark Johnston	c98bf2a45e	sctp: Always check for a vanishing inpcb when processing COOKIE-ECHO We previously did this only in the normal case where no association exists yet. However, it is not safe to process COOKIE-ECHO even if an association exists, as sctp_process_cookie_existing() may dereference the socket pointer. See also commit `0c7dc84076`. Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31755	2021-09-01 10:28:17 -04:00
Mark Johnston	d35be50f57	sctp: Hold association locks across socket wakeups when freeing At this point we do not hold the inpcb lock, so the only thing holding the socket reference live is the TCB lock, which needs to be acquired by sctp_inpcb_free() in order to destroy associations. Defer the unlock to until after we dereference the socket reference. Reported by: syzbot+1d0f2c4675de76a4cf1e@syzkaller.appspotmail.com Reported by: syzbot+fabee77954fe69d3a5ad@syzkaller.appspotmail.com Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31754	2021-09-01 10:27:51 -04:00
Mark Johnston	65f30a39e1	sctp: Release the socket reference when detaching an association Later in sctp_free_assoc(), when we clean up chunk lists, sctp_free_spbufspace() is used to reset the byte count in the socket send buffer. However, if the PCB is going away, the socket may already have been detached from the PCB, in which case this becomes a use-after free. Clear the socket reference from the association before detaching it from the PCB, if the PCB has already lost its socket reference. Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31753	2021-09-01 10:27:31 -04:00
Mark Johnston	457abbb857	sctp: Implement sctp_inpcb_bind_locked() This will be used by sctp_listen() to avoid dropping locks when performing an implicit bind. No functional change intended. Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31757	2021-09-01 10:06:18 -04:00
Mark Johnston	be8ee77e9e	sctp: Add macros to assert on inp info lock state Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31756	2021-09-01 10:06:18 -04:00
Mark Johnston	4a36122b1d	sctp: Fix racy UNBOUND flag check in sctp_inpcb_bind() SCTP needs to avoid binding a given socket twice. The check used to avoid this is racy since neither the inpcb lock nor the global info lock is held. Fix it by synchronizing using the global info lock. In particular, sctp_inpcb_bind() may drop the inpcb lock in some cases, but the info lock is sufficient to prevent double insertion into PCB hash tables. Reported by: syzbot+548a8560d959669d0e12@syzkaller.appspotmail.com Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31734	2021-08-31 07:44:42 -04:00
Mark Johnston	2496d812a9	sctp: Simplify the free port search in sctp_inpcb_bind() Eliminate a flag variable and reduce indentation. No functional change intended. Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31733	2021-08-31 07:43:39 -04:00
Mark Johnston	93908fce72	sctp: Avoid unnecessary refcount bumps in sctp_inpcb_bind() We only drop the inp lock when binding to a specific port. So, only acquire an extra reference when required. This simplifies error handling a bit. Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31732	2021-08-31 07:43:27 -04:00
Mark Johnston	0d29e4bc01	sctp: Remove always-false checks in sctp_inpcb_bind() No functional change intended. Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31731	2021-08-31 07:43:13 -04:00
Gordon Bergling	586c9dc374	inet(3): Fix a few common typos in source code comments - s/funtion/function/ MFC after: 3 days	2021-08-28 18:53:02 +02:00
Mark Johnston	d174534a27	tcp: Remove unused v6 state definitions These are supposedly for compatibility with KAME, but they are completely unused in our tree and don't exist in OpenBSD or NetBSD. Reviewed by: kbowling, bz, gnn Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31700	2021-08-27 08:31:32 -04:00
Artem Khramov	620cf65c2b	netinet: prevent NULL pointer dereference in in_aifaddr_ioctl() It appears that maliciously crafted ifaliasreq can lead to NULL pointer dereference in in_aifaddr_ioctl(). In order to replicate that, one needs to 1. Ensure that carp(4) is not loaded 2. Issue SIOCAIFADDR call setting ifra_vhid field of the request to a negative value. A repro code would look like this. int main() { struct ifaliasreq req; struct sockaddr_in sin, mask; int fd, error; bzero(&sin, sizeof(struct sockaddr_in)); bzero(&mask, sizeof(struct sockaddr_in)); sin.sin_len = sizeof(struct sockaddr_in); sin.sin_family = AF_INET; sin.sin_addr.s_addr = inet_addr("192.168.88.2"); mask.sin_len = sizeof(struct sockaddr_in); mask.sin_family = AF_INET; mask.sin_addr.s_addr = inet_addr("255.255.255.0"); fd = socket(AF_INET, SOCK_DGRAM, 0); if (fd < 0) return (-1); memset(&req, 0, sizeof(struct ifaliasreq)); strlcpy(req.ifra_name, "lo0", sizeof(req.ifra_name)); memcpy(&req.ifra_addr, &sin, sin.sin_len); memcpy(&req.ifra_mask, &mask, mask.sin_len); req.ifra_vhid = -1; return ioctl(fd, SIOCAIFADDR, (char *)&req); } To fix, discard both positive and negative vhid values in in_aifaddr_ioctl, if carp(4) is not loaded. This prevents NULL pointer dereference and kernel panic. Reviewed by: imp@ Pull Request: https://github.com/freebsd/freebsd-src/pull/530	2021-08-26 12:08:03 -06:00
Michael Tuexen	dc6ab77d66	tcp: make network epoch expectations of LRO explicit Reviewed by: gallatin, hselasky MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D31648	2021-08-25 17:12:36 +02:00
Zhenlei Huang	62e1a437f3	routing: Allow using IPv6 next-hops for IPv4 routes (RFC 5549). Implement kernel support for RFC 5549/8950. * Relax control plane restrictions and allow specifying IPv6 gateways for IPv4 routes. This behavior is controlled by the net.route.rib_route_ipv6_nexthop sysctl (on by default). * Always pass final destination in ro->ro_dst in ip_forward(). * Use ro->ro_dst to exract packet family inside if_output() routines. Consistently use RO_GET_FAMILY() macro to handle ro=NULL case. * Pass extracted family to nd6_resolve() to get the LLE with proper encap. It leverages recent lltable changes committed in `c541bd368f`. Presence of the functionality can be checked using ipv4_rfc5549_support feature(3). Example usage: route add -net 192.0.0.0/24 -inet6 fe80::5054:ff:fe14:e319%vtnet0 Differential Revision: https://reviews.freebsd.org/D30398 MFC after: 2 weeks	2021-08-22 22:56:08 +00:00
Alexander V. Chernikov	c541bd368f	lltable: Add support for "child" LLEs holding encap for IPv4oIPv6 entries. Currently we use pre-calculated headers inside LLE entries as prepend data for `if_output` functions. Using these headers allows saving some CPU cycles/memory accesses on the fast path. However, this approach makes adding L2 header for IPv4 traffic with IPv6 nexthops more complex, as it is not possible to store multiple pre-calculated headers inside lle. Additionally, the solution space is limited by the fact that PCB caching saves LLEs in addition to the nexthop. Thus, add support for creating special "child" LLEs for the purpose of holding custom family encaps and store mbufs pending resolution. To simplify handling of those LLEs, store them in a linked-list inside a "parent" (e.g. normal) LLE. Such LLEs are not visible when iterating LLE table. Their lifecycle is bound to the "parent" LLE - it is not possible to delete "child" when parent is alive. Furthermore, "child" LLEs are static (RTF_STATIC), avoding complex state machine used by the standard LLEs. nd6_lookup() and nd6_resolve() now accepts an additional argument, family, allowing to return such child LLEs. This change uses `LLE_SF()` macro which packs family and flags in a single int field. This is done to simplify merging back to stable/. Once this code lands, most of the cases will be converted to use a dedicated `family` parameter. Differential Revision: https://reviews.freebsd.org/D31379 MFC after: 2 weeks	2021-08-21 17:34:35 +00:00
Michael Tuexen	a3665770d7	sctp: improve handling of illegal parameters of INIT-ACK chunks MFC after: 3 days	2021-08-20 14:06:41 +02:00
Luiz Otavio O Souza	20ffd88ed5	ipfw: use unsigned int for dummynet bandwidth This allows the maximum value of 4294967295 (~4Gb/s) instead of previous value of 2147483647 (~2Gb/s). Reviewed by: np, scottl Obtained from: pfSense MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31582	2021-08-19 10:48:53 +02:00
Michael Tuexen	eba8e643b1	sctp: improve handling of INIT chunks with invalid parameters MFC after: 3 days	2021-08-19 00:33:28 +02:00
Randall Stewart	5baf32c97a	tcp: Add support for DSACK based reordering window to rack. The rack stack, with respect to the rack bits in it, was originally built based on an early I-D of rack. In fact at that time the TLP bits were in a separate I-D. The dynamic reordering window based on DSACK events was not present in rack at that time. It is now part of the RFC and we need to update our stack to include these features. However we want to have a way to control the feature so that we can, if the admin decides, make it stay the same way system wide as well as via socket option. The new sysctl and socket option has the following meaning for setting: 00 (0) - Keep the old way, i.e. reordering window is 1 and do not use DSACK bytes to add to reorder window 01 (1) - Change the Reordering window to 1/4 of an RTT but do not use DSACK bytes to add to reorder window 10 (2) - Keep the reordering window as 1, but do use SACK bytes to add additional 1/4 RTT delay to the reorder window 11 (3) - reordering window is 1/4 of an RTT and add additional DSACK bytes to increase the reordering window (RFC behavior) The default currently in the sysctl is 3 so we get standards based behavior. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31506	2021-08-17 16:29:22 -04:00
Mateusz Guzik	3be3cbe06d	ip_reass: do less work in ipreass_slowtimo if possible ipreass_slowtimo avoidably uses CPU on otherwise idle boxes Reviewed by: kp Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31526	2021-08-14 18:50:12 +02:00
Mateusz Guzik	d2b95af1c2	ip_reass: drop the volatile keyword from nfrags and mark with __exclusive_cache_line The keyword adds nothing as all operations on the var are performed through atomic_* Reviewed by: kp Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31526	2021-08-14 18:49:30 +02:00
Wojciech Macek	f61cb12aac	mroute: fix locking issues In some cases the code may fall into deadlock. Avoid calling epoch_wait when W-lock is taken. Sponsored by: Stormshield Obtained from: Semihalf	2021-08-13 11:06:17 +02:00
Eric van Gyzen	13a58148de	netdump: send key before dump, in case dump fails Previously, if an encrypted netdump failed, such as due to a timeout or network failure, the key was not saved, so a partial dump was completely useless. Send the key first, so the partial dump can be decrypted, because even a partial dump can be useful. Reviewed by: bdrewery, markj MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D31453	2021-08-11 10:54:56 -05:00
Michael Tuexen	3808ab732e	sctp: remove some set, but unused variables Thanks to pkasting for submitting the patch for the userland stack. MFC after: 3 days	2021-08-09 15:58:46 +02:00
Wojciech Macek	d9d59bb1af	ipsec: Handle ICMP NEEDFRAG message. It will be needed for upcoming PMTU implementation in ipsec. For now simply create/update an entry in tcp hostcache when needed. The code is based on https://people.freebsd.org/~ae/ipsec_transport_mode_ctlinput.diff Authored by: Kornel Duleba <mindal@semihalf.com> Differential revision: https://reviews.freebsd.org/D30992 Reviewed by: tuxen Sponsored by: Stormshield Obtained from: Semihalf	2021-08-09 12:01:46 +02:00
Alexander V. Chernikov	9748eb7427	Simplify nhop operations in ip_output(). Consistently use `nh` instead of always dereferencing ro->ro_nh inside the if block. Always use nexthop mtu, as it provides guarantee that mtu is accurate. Pass `nh` pointer to rt_update_ro_flags() to allow upcoming uses of updating ro flags based on different nexthop. Differential Revision: https://reviews.freebsd.org/D31451 Reviewed by: kp MFC after: 2 weeks	2021-08-08 09:19:27 +00:00
Gordon Bergling	04389c855e	Fix some common typos in comments - s/configuraiton/configuration/ - s/specifed/specified/ - s/compatiblity/compatibility/ MFC after: 5 days	2021-08-08 10:16:06 +02:00
Michael Tuexen	112899c6af	sctp: improve input validation of mapped addresses in sctp_connectx() MFC after: 3 days	2021-08-07 15:12:09 +02:00
Michael Tuexen	b732091a76	sctp: improve input validation of mapped addresses in send() Reported by: syzbot+35528f275f2eea6317cc@syzkaller.appspotmail.com Reported by: syzbot+ac29916d5f16d241553d@syzkaller.appspotmail.com MFC after: 3 days	2021-08-07 14:50:40 +02:00
Hans Petter Selasky	bb5cd80e8b	Update the TCP LRO code to handle both encrypted and un-encrypted traffic. Encrypted and un-encrypted traffic needs to be coalesced separately. Split the 16-bit lro_type field in the address information into two 8-bit fields, and then use the last 8-bit field for flags, which among other indicate if the received mbuf is encrypted or un-encrypted. Differential Revision: https://reviews.freebsd.org/D31377 Reviewed by: gallatin MFC after: 1 week Sponsored by: NVIDIA Networking	2021-08-06 11:28:44 +02:00
Andrew Gallatin	739de953ec	ktls: Move KERN_TLS ifdef to tcp_var.h This allows us to remove stubs in ktls.h and allows us to sort the function prototypes. Reviewed by: jhb Sponsored by: Netflix	2021-08-05 19:17:35 -04:00
Alexander V. Chernikov	8482aa7748	Use lltable calculated header when sending lle holdchain after successful lle resolution. Subscribers: imp, ae, bz Differential Revision: https://reviews.freebsd.org/D31391	2021-08-05 20:44:36 +00:00
Michael Tuexen	3f1f6b6ef7	tcp, udp: improve input validation in handling bind() Reported by: syzbot+24fcfd8057e9bc339295@syzkaller.appspotmail.com Reported by: syzbot+6e90ceb5c89285b2655b@syzkaller.appspotmail.com Reviewed by: markj, rscheff MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D31422	2021-08-05 13:48:44 +02:00
Alexander V. Chernikov	f3a3b06121	[lltable] Unify datapath feedback mechamism. Use newly-create llentry_request_feedback(), llentry_mark_used() and llentry_get_hittime() to request datapatch usage check and fetch the results in the same fashion both in IPv4 and IPv6. While here, simplify llentry_provide_feedback() wrapper by eliminating 1 condition check. MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D31390	2021-08-04 22:52:43 +00:00
Konstantin Kukushkin	a61c24ddb7	udp: Fix soroverflow SOCKBUF unlocking We hold the SOCKBUF_LOCK so use soroverflow_locked here. This bug may manifest as a non-killable process stuck in [*so_rcv]. Approved by: scottl Reviewed by: Roy Marples <roy@marples.name> Fixes: `7045b1603b` MFC after: 10 days Differential Revision: https://reviews.freebsd.org/D31374	2021-08-01 08:07:33 -07:00
Roy Marples	7045b1603b	socket: Implement SO_RERROR SO_RERROR indicates that receive buffer overflows should be handled as errors. Historically receive buffer overflows have been ignored and programs could not tell if they missed messages or messages had been truncated because of overflows. Since programs historically do not expect to get receive overflow errors, this behavior is not the default. This is really really important for programs that use route(4) to keep in sync with the system. If we loose a message then we need to reload the full system state, otherwise the behaviour from that point is undefined and can lead to chasing bogus bug reports. Reviewed by: philip (network), kbowling (transport), gbe (manpages) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D26652	2021-07-28 09:35:09 -07:00
Bryan Drewery	a573243370	netdump: Fix leaking debugnet state on errors. Reviewed by: cem, markj Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D31319	2021-07-27 09:06:23 -07:00
Mark Johnston	ba21825202	rip: Add missing minimum length validation in rip_output() If the socket is configured such that the sender is expected to supply the IP header, then we need to verify that it actually did so. Reported by: syzkaller+KMSAN Reviewed by: donner MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31302	2021-07-26 16:39:37 -04:00
Kristof Provost	8e1864ed07	pf: syncookie support Import OpenBSD's syncookie support for pf. This feature help pf resist TCP SYN floods by only creating states once the remote host completes the TCP handshake rather than when the initial SYN packet is received. This is accomplished by using the initial sequence numbers to encode a cookie (hence the name) in the SYN+ACK response and verifying this on receipt of the client ACK. Reviewed by: kbowling Obtained from: OpenBSD MFC after: 1 week Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D31138	2021-07-20 10:36:13 +02:00
Michael Tuexen	a730d82378	tcp: fix RACK and BBR when using VIMAGE enabled kernel Fix a bug in VNET handling, which occurs when using specific NICs. PR: 257195 Reviewed by: rrs MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D31212	2021-07-20 00:29:18 +02:00
Randall Stewart	db4d2d7222	tcp: When rack or bbr get a pullup failure in the common code, don't free the NULL mbuf. There is a bug in the error path where rack_bbr_common does a m_pullup() and the pullup fails. There is a stray mfree(m) after m is set to NULL. This is not a good idea :-) Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31194	2021-07-16 13:59:57 -04:00
Randall Stewart	1d171e5ab9	tcp: Lro needs to validate that it does not go beyond the end of the mbuf as it parses. Currently the LRO parser, if given a packet that say has ETH+IP header but the TCP header is in the next mbuf (split), would walk garbage. Lets make sure we keep track as we parse of the length and return NULL anytime we exceed the length of the mbuf. Reviewed by: tuexen, hselasky Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31195	2021-07-16 06:07:13 -04:00
Randall Stewart	ca1a7e1021	tcp: TCP_LRO getting bad checksums and sending it in to TCP incorrectly. In reviewing tcp_lro.c we have a possibility that some drives may send a mbuf into LRO without making sure that the checksum passes. Some drivers actually are aware of this and do not call lro when the csum failed, others do not do this and thus could end up sending data up that we think has a checksum passing when it does not. This change will fix that situation by properly verifying that the mbuf has the correct markings (CSUM VALID bits as well as csum in mbuf header is set to 0xffff). Reviewed by: tuexen, hselasky, gallatin Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31155	2021-07-13 12:45:15 -04:00
Stefan Eßer	58080fbca0	libalias: fix divide by zero causing panic The packet_limit can fall to 0, leading to a divide by zero abort in the "packets % packet_limit". An possible solution would be to apply a lower limit of 1 after the calculation of packet_limit, but since any number modulo 1 gives 0, the more efficient solution is to skip the modulo operation for packet_limit <= 1. Since this is a fix for a panic observed in stable/12, merging this fix to stable/12 and stable/13 before expiry of the 3 day waiting period might be justified, if it works for the reporter of the issue. Reported by: Karl Denninger <karl@denninger.net> MFC after: 3 days	2021-07-10 13:08:18 +02:00
Michael Tuexen	105b68b42d	sctp: Fix errno in case of association setup failures Do not report always ETIMEDOUT, but only when appropriate. In other cases report ECONNABORTED. MFC after: 3 days	2021-07-09 23:19:25 +02:00
Michael Tuexen	ce64352a70	sctp: provide consistent stream information in case of early errors While there, make sure the function is called correctly. MFC after: 3 days	2021-07-09 14:16:59 +02:00
Michael Tuexen	84992a3251	sctp: provide sac_error also for ABORT chunk being sent Thanks to Florent Castelli for bringing this issue up for the userland stack and providing an initial patch. MFC: 3 days	2021-07-09 13:46:27 +02:00
Randall Stewart	7312e4e5cf	tcp: Fix 32 bit platform breakage This fixes the incorrect use of a sysctl add to u64. It was for a useconds time, but on 32 bit platforms its not a u64. Instead use the long directive. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31107	2021-07-08 08:16:45 -04:00
Andrew Gallatin	b1e806c0ed	tcp: fix alternate stack build with LINT-NO{INET,INET6,IP} When fixing another bug, I noticed that the alternate TCP stacks do not build when various combinations of ipv4 and ipv6 are disabled. Reviewed by: rrs, tuexen Differential Revision: https://reviews.freebsd.org/D31094 Sponsored by: Netflix	2021-07-07 13:02:08 -04:00
Randall Stewart	d7955cc0ff	tcp: HPTS performance enhancements HPTS drives both rack and bbr, and yet there have been many complaints about performance. This bit of work restructures hpts to help reduce CPU overhead. It does this by now instead of relying on the timer/callout to drive it instead use user return from a system call as well as lro flushes to drive hpts. The timer becomes a backstop that dynamically adjusts based on how "late" we are. Reviewed by: tuexen, glebius Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31083	2021-07-07 07:22:35 -04:00
Randall Stewart	e834f9a44a	tcp: Address goodput and TLP edge cases. There are several cases where we make a goodput measurement and we are running out of data when we decide to make the measurement. In reality we should not make such a measurement if there is no chance we can have "enough" data. There is also some corner case TLP's that end up not registering as a TLP like they should, we fix this by pushing the doing_tlp setup to the actual timeout that knows it did a TLP. This makes it so we always have the appropriate flag on the sendmap indicating a TLP being done as well as count correctly so we make no more that two TLP's. In addressing the goodput lets also add a "quality" metric that can be viewed via blackbox logs so that a casual observer does not have to figure out how good of a measurement it is. This is needed due to the fact that we may still make a measurement that is of a poorer quality as we run out of data but still have a minimal amount of data to make a measurement. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31076	2021-07-06 15:26:37 -04:00
Andrew Gallatin	28d0a740dd	ktls: auto-disable ifnet (inline hw) kTLS Ifnet (inline) hw kTLS NICs typically keep state within a TLS record, so that when transmitting in-order, they can continue encryption on each segment sent without DMA'ing extra state from the host. This breaks down when transmits are out of order (eg, TCP retransmits). In this case, the NIC must re-DMA the entire TLS record up to and including the segment being retransmitted. This means that when re-transmitting the last 1448 byte segment of a TLS record, the NIC will have to re-DMA the entire 16KB TLS record. This can lead to the NIC running out of PCIe bus bandwidth well before it saturates the network link if a lot of TCP connections have a high retransmoit rate. This change introduces a new sysctl (kern.ipc.tls.ifnet_max_rexmit_pct), where TCP connections with higher retransmit rate will be switched to SW kTLS so as to conserve PCIe bandwidth. Reviewed by: hselasky, markj, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30908	2021-07-06 10:28:32 -04:00
Lutz Donnerhacke	4060e77f49	libalias: Remove a stray directive Removal of a preprocessor line was missed during development. Do it now and MFC it together with the other patches. MFC after: 2 days	2021-07-04 17:54:45 +02:00
Lutz Donnerhacke	2f4d91f9cb	libalias: Rewrite HISTORY Fix the history entry (wrong year) and add the missing recent work. MFC together with the other patches. MFC after: 2 days	2021-07-04 17:46:47 +02:00
Lutz Donnerhacke	f284553444	libalias: Fix API bug on initialization The kernel part of ipfw(8) does initialize LibAlias uncondistionally with an zeroized port range (allowed ports from 0 to 0). During restucturing of libalias, port ranges are used everytime and are therefor initialized with different values than zero. The secondary initialization from ipfw (and probably others) overrides the new default values and leave the instance in an unfunctional state. The obvious solution is to detect such reinitializations and use the new default value instead. MFC after: 3 days	2021-07-03 23:03:07 +02:00
Lutz Donnerhacke	b50a4dce18	libalias: Avoid uninitialized expiration The expiration time of direct address mappings is explicitly uninitialized. Expire times are always compared during housekeeping. Despite the uninitialized value does not harm, it's simpler to just set it to a reasonable default. This was detected during valgrinding the test suite. MFC after: 3 days	2021-07-03 01:09:18 +02:00
Lutz Donnerhacke	25392fac94	libalias: Fix splay comparsion bug Comparing elements in a tree requires transitiviy. If a < b and b < c then a must be smaller than c. This way the tree elements are always pairwise comparable. Tristate comparsion functions returning values lower, equal, or greater than zero, are usually implemented by a simple subtraction of the operands. If the size of the operands are equal to the size of the result, integer modular arithmetics kick in and violates the transitivity. Example: Working on byte with 0, 120, and 240. Now computing the differences: 120 - 0 = 120 240 - 120 = 120 240 - 0 = -16 MFC after: 3 days	2021-07-03 00:31:53 +02:00
Michael Tuexen	c7f048ab35	sctp: initialize sequence numbers for ECN correctly MFC after: 3 days Reported by: Junseok Yang (for the userland stack)	2021-06-27 20:14:48 +02:00
Michael Tuexen	6587a2bd1e	sctp: Fix length check for ECNE chunks MFC after: 3 days	2021-06-27 16:10:39 +02:00
Michael Tuexen	870af3f4dc	tcp: tolerate missing timestamps Some TCP stacks negotiate TS support, but do not send TS at all or not for keep-alive segments. Since this includes modern widely deployed stacks, tolerate the violation of RFC 7323 per default. Reviewed by: rgrimes, rrs, rscheff MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D30740 Sponsored by: Netflix, Inc.	2021-06-27 16:03:57 +02:00
Randall Stewart	9e4d9e4c4d	tcp: Preparation for allowing hardware TLS to be able to kick a tcp connection that is retransmitting too much out of hardware and back to software. Hardware TLS is now supported in some interface cards and it works well. Except that when we have connections that retransmit a lot we get into trouble with all the retransmits. This prep step makes way for change that Drew will be making so that we can "kick out" a session from hardware TLS. Reviewed by: mtuexen, gallatin Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30895	2021-06-25 09:30:54 -04:00
Randall Stewart	66aec14a53	tcp: Rack not being very friendly with V6:4 socket and having a connection from V4 There were two bugs that prevented V4 sockets from connecting to a rack server running a V4/V6 socket. As well as a bug that stops the mapped v4 in V6 address from working. Reviewed by: mtuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30885	2021-06-24 14:42:21 -04:00
Wojciech Macek	17ac6d94db	ip_mroute: initialize vif ifnet properly Use if_alloc to ensure all fields of ifnet are allocated properly Reported by: Damien Deville Sponsored by: Stormshield Obtained from: Semihalf Reviewed by: mw Differential revision: https://reviews.freebsd.org/D30608	2021-06-23 10:13:52 +02:00
Lutz Donnerhacke	f70c98a2f5	libalias: Fix compile time warning about unused functions Compiling libalias results in warnings about unused functions. Those warnings are caused by clang's heuristic to consider an inline function as in use, iff the declaration is in a .c file. Declarations in .h files do not emit those warnings. Hence the declarations must be moved to an extra *.h file. MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D30844	2021-06-23 10:06:04 +02:00
John Baldwin	a7f6c6fd94	toe: Read-lock the inp in toe_4tuple_check(). tcp_twcheck now expects a read lock on the inp for the SYN case instead of a write lock. Reviewed by: np Fixes: `1db08fbe3f` tcp_input: always request read-locking of PCB for any pure SYN segment. Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D30782	2021-06-22 16:31:01 -07:00
Gleb Smirnoff	c4804b6b0b	Unbreak TFO, that was broken with `8d5719aa74`. These two assignments are unneccessary and used to be there before TFO as an invariant. With TFO and after `8d5719aa74` the "so" value is still needed. Reported & tested by: tuexen Fixes: `8d5719aa74`	2021-06-22 16:03:44 -07:00
Lutz Donnerhacke	d261e57dea	libalias: Switch to efficient data structure for incoming traffic Current data structure is using a hash of unordered lists. Those unordered lists are quite efficient, because the least recently inserted entries are most likely to be used again. In order to avoid long search times in other cases, the lists are hashed into many buckets. Unfortunatly a search for a miss needs an exhaustive inspection and a careful definition of the hash. Splay trees offer a similar feature: Almost O(1) for access of the least recently used entries, and amortized O(ln(n)) for almost all other cases. Get rid of the hash. Now the data structure should able to quickly react to external packets without eating CPU cycles for breakfast, preventing a DoS. PR: 192888 Discussed with: Dimitry Luhtionov MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30536	2021-06-19 22:12:28 +02:00
Lutz Donnerhacke	935fc93af1	libalias: Switch to efficient data structure for outgoing traffic Current data structure is using a hash of unordered lists. Those unordered lists are quite efficient, because the least recently inserted entries are most likely to be used again. In order to avoid long search times in other cases, the lists are hashed into many buckets. Unfortunatly a search for a miss needs an exhaustive inspection and a careful definition of the hash. Splay trees offer a similar feature - almost O(1) for access of the least recently used entries), and amortized O(ln(n) - for almost all other cases. Get rid of the hash. Discussed with: Dimitry Luhtionov MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30516	2021-06-19 22:09:44 +02:00
Lutz Donnerhacke	d989935b5b	libalias: Restructure - Finalize Note, that the restructuring is done. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30582	2021-06-19 21:58:56 +02:00
Lutz Donnerhacke	fe83900f9f	libalias: Restructure - Remove temporary state deleteAllLinks from global struct The entry deleteAllLinks in the struct libalias is only used to signal a state between internal calls. It's not used between API calls. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30604	2021-06-19 21:55:11 +02:00
Lutz Donnerhacke	9efcad61d8	libalias: Restructure - Use AliasRange instead of PORT_BASE Get rid of PORT_BASE, replace by AliasRange. Simplify code. Factor out the search for a new port. Improves the perfomance a bit. Discussed with: Dimitry Luhtionov MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30581	2021-06-19 21:40:09 +02:00
Lutz Donnerhacke	1178dda53d	libalias: Restructure - Table for PPTP Let PPTP use its own data structure. Regroup and rename other lists, which are not PPTP. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30580	2021-06-19 21:26:31 +02:00
Lutz Donnerhacke	7b44ff4c52	libalias: Restructure - Group expire handling entries Reorder the internal structure semantically. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30575	2021-06-19 21:12:27 +02:00
Lutz Donnerhacke	492d3b7109	libalias: Restructure - Group incoming links Reorder incoming links by grouping of common search terms. Significant performance improvement for incoming (missing) flows. Remove LSNAT from outgoing search. Slight speedup due to less comparsions in the loop. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30574	2021-06-19 21:03:47 +02:00
Lutz Donnerhacke	d4ab07d2ae	libalias: Restructure - Cleanup and Use for links Factor out a common idiom to return found links. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30573	2021-06-19 20:28:53 +02:00
Lutz Donnerhacke	d541903438	libalias: Restructure - Outgoing search Factor out the outgoing search function. Preparation for a new data structure. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30572	2021-06-19 20:25:08 +02:00
Lutz Donnerhacke	19dcc4f225	libalias: Restructure - Cleanup _FindLinkIn Simplify program flow in function _FindLinkIn. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30571	2021-06-19 20:19:16 +02:00
Lutz Donnerhacke	cac129e603	libalias: Restructure - Table for partially links Separate the partially specified links into a separate data structure. This would causes a major parformance impact, if there are many of them. Use a (smaller) hash table to speed up the partially link access. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30570	2021-06-19 20:03:08 +02:00
Richard Scheffenegger	74d7fc8753	tcp: Add PRR cwnd reduction for non-SACK loss This completes PRR cwnd reduction in all circumstances for the base TCP stack (SACK loss recovery, ECN window reduction, non-SACK loss recovery), preventing the arriving ACKs to clock out new data at the old, too high rate. This reduces the chance to induce additional losses while recovering from loss (during congested network conditions). For non-SACK loss recovery, each ACK is assumed to have one MSS delivered. In order to prevent ACK-split attacks, only one window worth of ACKs is considered to actually have delivered new data. MFC after: 6 weeks Reviewed By: rrs, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29441	2021-06-19 19:25:22 +02:00
Lutz Donnerhacke	32f9c2ceb3	libalias: Restructure - Separate fully qualified search Search fully specified links first. Some performance loss due to need to revisit the db twice, if not found. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30569	2021-06-19 19:21:05 +02:00
Lutz Donnerhacke	d41044ddfd	libalias: Restructure - Common search terms Factor out the common Out and In filter Slightly better performance due to eager skip of search loop MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30568	2021-06-19 18:58:52 +02:00
Lutz Donnerhacke	ef828d39be	libalias: Promote per instance global variable timeStamp Summary: - Use LibAliasTime as a real global variable for central timekeeping. - Reduce number of syscalls in user space considerably. - Dynamically adjust the packet counters to match the second resolution. - Only check the first few packets after a time increase for expiry. Discussed with: hselasky MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30566	2021-06-19 18:25:44 +02:00
Lutz Donnerhacke	3fd20a79e7	libalias: Stats are unsigned Stats counters are used as unsigned valued (i.e. printf("%u")) but are defined as signed int. This causes trouble later, so fix it early. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30587	2021-06-19 18:21:17 +02:00
Mark Johnston	a100217489	Consistently use the SOCKBUF_MTX() and SOCK_MTX() macros This makes it easier to change the socket locking protocols. No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-06-14 17:32:32 -04:00
Mark Johnston	f4bb1869dd	Consistently use the SOLISTENING() macro Some code was using it already, but in many places we were testing SO_ACCEPTCONN directly. As a small step towards fixing some bugs involving synchronization with listen(2), make the kernel consistently use SOLISTENING(). No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-06-14 17:32:27 -04:00
Michael Tuexen	f1536bb538	tcp: remove debug output from RACK Reported by: iron.udjin@gmail.com, Marek Zarychta Reviewed by: rrs PR: 256538 MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D30723 Sponsored by: Netflix, Inc.	2021-06-11 20:23:39 +02:00
Randall Stewart	ba1b3e48f5	tcp: Missing mfree in rack and bbr Recently (Nov) we added logic that protects against a peer negotiating a timestamp, and then not including a timestamp. This involved in the input path doing a goto done_with_input label. Now I suspect the code was cribbed from one in Rack that has to do with the SYN. This had a bug, i.e. it should have a m_freem(m) before going to the label (bbr had this missing m_freem() but rack did not). This then caused the missing m_freem to show up in both BBR and Rack. Also looking at the code referencing m->m_pkthdr.lro_nsegs later (after processing) is not a good idea, even though its only for logging. Best to copy that off before any frees can take place. Reviewed by: mtuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30727	2021-06-11 11:38:08 -04:00
Michael Tuexen	fa3746be42	tcp: fix two bugs in new reno * Completely initialise the CC module specific data * Use beta_ecn in case of an ECN event whenever ABE is enabled or it is requested by the stack. Reviewed by: rscheff, rrs MFC after: 3 days Sponsored by: Netflix, Inc.	2021-06-11 15:40:34 +02:00
Michael Tuexen	224cf7b35b	tcp: fix compilation of IPv4-only builds PR: 256538 Reported by: iron.udjin@gmail.com MFC after: 3 days Sponsored by: Netflix, Inc.	2021-06-11 09:50:46 +02:00
Lutz Donnerhacke	294799c6b0	libalias: tidy up housekeeping Replace current expensive, but sparsly called housekeeping by a single, repetive action. This is part of a larger restructure of libalias in order to switch to more efficient data structures. The whole restructure process is split into 15 reviews to ease reviewing. All those steps will be squashed into a single commit for MFC in order to hide the intermediate states from production systems. Reviewed by: hselasky MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30277	2021-06-10 23:30:10 +02:00
Randall Stewart	67e892819b	tcp: Mbuf leak while holding a socket buffer lock. When running at NF the current Rack and BBR changes with the recent commits from Richard that cause the socket buffer lock to be held over the ip_output() call and then finally culminating in a call to tcp_handle_wakeup() we get a lot of leaked mbufs. I don't think that this leak is actually caused by holding the lock or what Richard has done, but is exposing some other bug that has probably been lying dormant for a long time. I will continue to look (using his changes) at what is going on to try to root cause out the issue. In the meantime I can't leave the leaks out for everyone else. So this commit will revert all of Richards changes and move both Rack and BBR back to just doing the old sorwakeup_locked() calls after messing with the so_rcv buffer. We may want to look at adding back in Richards changes after I have pinpointed the root cause of the mbuf leak and fixed it. Reviewed by: mtuexen,rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30704	2021-06-10 08:33:57 -04:00
Randall Stewart	b45daaea95	tcp: LRO timestamps have lost their previous precision Recently we had a rewrite to tcp_lro.c that was tested but one subtle change was the move to a less precise timestamp. This causes all kinds of chaos in tcp's that do pacing and needs to be fixed to use the more precise time that was there before. Reviewed by: mtuexen, gallatin, hselasky Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30695	2021-06-09 13:58:54 -04:00
Randall Stewart	4747500dea	tcp: A better fix for the previously attempted fix of the ack-war issue with tcp. So it turns out that my fix before was not correct. It ended with us failing some of the "improved" SYN tests, since we are not in the correct states. With more digging I have figured out the root of the problem is that when we receive a SYN\|FIN the reassembly code made it so we create a segq entry to hold the FIN. In the established state where we were not in order this would be correct i.e. a 0 len with a FIN would need to be accepted. But if you are in a front state we need to strip the FIN so we correctly handle the ACK but ignore the FIN. This gets us into the proper states and avoids the previous ack war. I back out some of the previous changes but then add a new change here in tcp_reass() that fixes the root cause of the issue. We still leave the rack panic fixes in place however. Reviewed by: mtuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30627	2021-06-04 05:26:43 -04:00
Mark Johnston	f96603b56f	tcp, udp: Permit binding with AF_UNSPEC if the address is INADDR_ANY Prior to commit `f161d294b` we only checked the sockaddr length, but now we verify the address family as well. This breaks at least ttcp. Relax the check to avoid breaking compatibility too much: permit AF_UNSPEC if the address is INADDR_ANY. Fixes: `f161d294b` Reported by: Bakul Shah <bakul@iitbombay.org> Reviewed by: tuexen MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30539	2021-05-31 18:53:34 -04:00
Lutz Donnerhacke	bec0a5dca7	libalias: Remove LibAliasCheckNewLink Finally drop the function in 14-CURRENT. Discussed with: kp Differential Revision: https://reviews.freebsd.org/D30275	2021-05-31 13:04:11 +02:00
Lutz Donnerhacke	bfd41ba1fe	libalias: Remove unused function LibAliasCheckNewLink The functionality to detect a newly created link after processing a single packet is decoupled from the packet processing. Every new packet is processed asynchronously and will reset the indicator, hence the function is unusable. I made a Google search for third party code, which uses the function, and failed to find one. That's why the function should be removed: It unusable and unused. A much simplified API/ABI will remain in anything below 14. Discussed with: kp Reviewed by: manpages (bcr) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30275	2021-05-31 12:53:57 +02:00
Wojciech Macek	d40cd26a86	ip_mroute: rework ip_mroute Approved by: mw Obtained from: Semihalf Sponsored by: Stormshield Differential Revision: https://reviews.freebsd.org/D30354 Changes: 1. add spinlock to bw_meter If two contexts read and modify bw_meter values it might happen that these are corrupted. Guard only code fragments which do read-and-modify. Context which only do "reads" are not done inside spinlock block. The only sideffect that can happen is an 1-p;acket outdated value reported back to userspace. 2. replace all locks with a single RWLOCK Multiple locks caused a performance issue in routing hot path, when two of them had to be taken. All locks were replaced with single RWLOCK which makes the hot path able to take only shared access to lock most of the times. All configuration routines have to take exclusive lock (as it was done before) but these operation are very rare compared to packet routing. 3. redesign MFC expire and UPCALL expire Use generic kthread and cv_wait/cv_signal for deferring work. Previously, upcalls could be sent from two contexts which complicated the design. All upcall sending is now done in a kthread which allows hot path to work more efficient in some rare cases. 4. replace mutex-guarded linked list with lock free buf_ring All message and data is now passed over lockless buf_ring. This allowed to remove some heavy locking when linked lists were used.	2021-05-31 05:48:15 +02:00
Lutz Donnerhacke	b03a41befe	libalias: Fix nameing and initialization of a constant The commit `189f8eea` contains a refactorisation of a constant. During later review D30283 the naming of the constant was improved and the initialization became explicit. Put this into the tree, in order to MFC the correct naming.	2021-05-30 15:47:29 +02:00
Randall Stewart	8c69d988a8	tcp: When we have an out-of-order FIN we do want to strip off the FIN bit. The last set of commits fixed both a panic (in rack) and an ACK-war (in freebsd and bbr). However there was a missing case, i.e. where we get an out-of-order FIN by itself. In such a case we don't want to leave the FIN bit set, otherwise we will do the wrong thing and ack the FIN incorrectly. Instead we need to go through the tcp_reasm() code and that way the FIN will be stripped and all will be well. Reviewed by: mtuexen,rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30497	2021-05-27 10:50:32 -04:00
Richard Scheffenegger	c358f1857f	tcp: Use local CC data only in the correct context Most CC algos do use local data, and when calling newreno_cong_signal from there, the latter misinterprets the data as its own struct, leading to incorrect behavior. Reported by: chengc_netapp.com Reviewed By: chengc_netapp.com, tuexen, #transport MFC after: 3 days Sponsored By: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D30470	2021-05-26 20:15:53 +02:00
Randall Stewart	4f3addd94b	tcp: Add a socket option to rack so we can test various changes to the slop value in timers. Timer_slop, in TCP, has been 200ms for a long time. This value dates back a long time when delayed ack timers were longer and links were slower. A 200ms timer slop allows 1 MSS to be sent over a 60kbps link. Its possible that lowering this value to something more in line with todays delayed ack values (40ms) might improve TCP. This bit of code makes it so rack can, via a socket option, adjust the timer slop. Reviewed by: mtuexen Sponsered by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30249	2021-05-26 06:43:30 -04:00
Andrew Gallatin	086a35562f	tcp: enter network epoch when calling tfb_tcp_fb_fini We need to enter the network epoch when calling into tfb_tcp_fb_fini. I noticed this when I hit an assert running the latest rack Differential Revision: https://reviews.freebsd.org/D30407 Reviewed by: rrs, tuexen Sponsored by: Netflix	2021-05-25 13:45:37 -04:00
Randall Stewart	13c0e198ca	tcp: Fix bugs related to the PUSH bit and rack and an ack war Michaels testing with UDP tunneling found an issue with the push bit, which was only partly fixed in the last commit. The problem is the left edge gets transmitted before the adjustments are done to the send_map, this means that right edge bits must be considered to be added only if the entire RSM is being retransmitted. Now syzkaller also continued to find a crash, which Michael sent me the reproducer for. Turns out that the reproducer on default (freebsd) stack made the stack get into an ack-war with itself. After fixing the reference issues in rack the same ack-war was found in rack (and bbr). Basically what happens is we go into the reassembly code and lose the FIN bit. The trick here is we should not be going into the reassembly code if tlen == 0 i.e. the peer never sent you anything. That then gets the proper action on the FIN bit but then you end up in LAST_ACK with no timers running. This is because the usrclosed function gets called and the FIN's and such have already been exchanged. So when we should be entering FIN_WAIT2 (or even FIN_WAIT1) we get stuck in LAST_ACK. Fixing this means tweaking the usrclosed function so that we properly recognize the condition and drop into FIN_WAIT2 where a timer will allow at least TP_MAXIDLE before closing (to allow time for the peer to retransmit its FIN if the ack is lost). Setting the fast_finwait2 timer can speed this up in testing. Reviewed by: mtuexen,rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30451	2021-05-25 13:23:31 -04:00
Michael Tuexen	9bbd1a8fcb	tcp: fix a RACK socket buffer lock issue Fix a missing socket buffer unlocking of the socket receive buffer. Reviewed by: gallatin, rrs MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D30402	2021-05-24 20:31:23 +02:00
Randall Stewart	631449d5d0	tcp: Fix an issue with the PUSH bit as well as fill in the missing mtu change for fsb's The push bit itself was also not actually being properly moved to the right edge. The FIN bit was incorrectly on the left edge. We fix these two issues as well as plumb in the mtu_change for alternate stacks. Reviewed by: mtuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30413	2021-05-24 14:42:15 -04:00
Zhenlei Huang	03b0505b8f	ip_forward: Restore RFC reference Add RFC reference lost in `3d846e4822` PR: 255388 Reviewed By: rgrimes, donner, karels, marcus, emaste MFC after: 27 days Differential Revision: https://reviews.freebsd.org/D30374	2021-05-23 00:01:37 +02:00
Michael Tuexen	8923ce6304	tcp: Handle stack switch while processing socket options Handle the case where during socket option processing, the user switches a stack such that processing the stack specific socket option does not make sense anymore. Return an error in this case. MFC after: 1 week Reviewed by: markj Reported by: syzbot+a6e1d91f240ad5d72cd1@syzkaller.appspotmail.com Sponsored by: Netflix, Inc. Differential revision: https://reviews.freebsd.org/D30395	2021-05-22 14:39:36 +02:00
Richard Scheffenegger	3975688563	rack: honor prior socket buffer lock when doing the upcall While partially reverting D24237 with D29690, due to introducing some unintended effects for in-kernel TCP consumers, the preexisting lock on the socket send buffer was not considered properly. Found by: markj MFC after: 2 weeks Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D30390	2021-05-22 00:09:59 +02:00
Mark Johnston	7d2608a5d2	tcp: Make error handling in tcp_usr_send() more consistent - Free the input mbuf in a single place instead of in every error path. - Handle PRUS_NOTREADY consistently. - Flush the socket's send buffer if an implicit connect fails. At that point the mbuf has already been enqueued but we don't want to keep it in the send buffer. Reviewed by: gallatin, tuexen Discussed with: jhb MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30349	2021-05-21 17:45:18 -04:00
Richard Scheffenegger	032bf749fd	[tcp] Keep socket buffer locked until upcall r367492 would unlock the socket buffer before eventually calling the upcall. This leads to problematic interaction with NFS kernel server/client components (MP threads) accessing the socket buffer with potentially not correctly updated state. Reported by: rmacklem Reviewed By: tuexen, #transport Tested by: rmacklem, otis MFC after: 2 weeks Sponsored By: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29690	2021-05-21 11:07:51 +02:00
Michael Tuexen	500eb6dd80	tcp: Fix sending of TCP segments with IP level options When bringing in TCP over UDP support in https://cgit.FreeBSD.org/src/commit/?id=9e644c23000c2f5028b235f6263d17ffb24d3605, the length of IP level options was considered when locating the transport header. This was incorrect and is fixed by this patch. X-MFC with: https://cgit.FreeBSD.org/src/commit/?id=9e644c23000c2f5028b235f6263d17ffb24d3605 MFC after: 3 days Reviewed by: markj, rscheff Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D30358	2021-05-21 09:49:45 +02:00
Wojciech Macek	eedbbec3fd	ip_mroute: remove unused declarations fix build for non-x86 targets	2021-05-21 08:01:26 +02:00
Wojciech Macek	741afc6233	ip_mroute: refactor bw_meter API API should work as following: - periodicaly report Lower-or-EQual bandwidth (LEQ) connections over kernel socket, if user application registered for such per-flow notifications - report Grater-or-EQual (GEQ) bandwidth as soon as it reaches specified value in configured time window Custom implementation of callouts was removed. There is no point of doing calout-wheel here as generic callouts are doing exactly the same. The performance is not critical for such reporting, so the biggest concern should be to have a code which can be easily maintained. This is ia preparation for locking rework which is highly inefficient. Approved by: mw Sponsored by: Stormshield Obtained from: Semihalf Differential Revision: https://reviews.freebsd.org/D30210	2021-05-21 06:43:41 +02:00
Wojciech Macek	787845c0e8	Revert "ip_mroute: refactor bw_meter API" This reverts commit `d1cd99b147`.	2021-05-20 12:14:58 +02:00
Wojciech Macek	d1cd99b147	ip_mroute: refactor bw_meter API API should work as following: - periodicaly report Lower-or-EQual bandwidth (LEQ) connections over kernel socket, if user application registered for such per-flow notifications - report Grater-or-EQual (GEQ) bandwidth as soon as it reaches specified value in configured time window Custom implementation of callouts was removed. There is no point of doing calout-wheel here as generic callouts are doing exactly the same. The performance is not critical for such reporting, so the biggest concern should be to have a code which can be easily maintained. This is ia preparation for locking rework which is highly inefficient. Approved by: mw Sponsored by: Stormshield Obtained from: Semihalf Differential Revision: https://reviews.freebsd.org/D30210	2021-05-20 10:13:55 +02:00
Zhenlei Huang	3d846e4822	Do not forward datagrams originated by link-local addresses The current implement of ip_input() reject packets destined for 169.254.0.0/16, but not those original from 169.254.0.0/16 link-local addresses. Fix to fully respect RFC 3927 section 2.7. PR: 255388 Reviewed by: donner, rgrimes, karels MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D29968	2021-05-18 22:59:46 +02:00
Lutz Donnerhacke	2e6b07866f	libalias: Ensure ASSERT behind varable declarations At some places the ASSERT was inserted before variable declarations are finished. This is fixed now. Reported by: kib Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30282	2021-05-16 02:28:36 +02:00
Lutz Donnerhacke	189f8eea13	libalias: replace placeholder with static constant The field nullAddress in struct libalias is never set and never used. It exists as a placeholder for an unused argument only. Reviewed by: hselasky MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D30253	2021-05-15 09:05:30 +02:00
Lutz Donnerhacke	effc8e57fb	libalias: Style cleanup libalias is a convolut of various coding styles modified by a series of different editors enforcing interesting convetions on spacing and comments. This patch is a baseline to start with a perfomance rework of libalias. Upcoming patches should be focus on the code, not on the style. That's why most annoying style errors should be fixed beforehand. Reviewed by: hselasky Discussed by: emaste MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D30259	2021-05-15 08:57:55 +02:00
Randall Stewart	02cffbc250	tcp: Incorrect KASSERT causes a panic in rack Skyzall found an interesting panic in rack. When a SYN and FIN are both sent together a KASSERT gets tripped where it is validating that a mbuf pointer is in the sendmap. But a SYN and FIN often will not have a mbuf pointer. So the fix is two fold a) make sure that the SYN and FIN split the right way when cloning an RSM SYN on left edge and FIN on right. And also make sure the KASSERT properly accounts for the case that we have a SYN or FIN so we don't panic. Reviewed by: mtuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D30241	2021-05-13 07:36:04 -04:00
Michael Tuexen	eec6aed5b8	sctp: fix another locking bug in COOKIE handling Thanks to Tolya Korniltsev for reporting the issue for the userland stack and testing the fix. MFC after: 3 days	2021-05-12 23:05:28 +02:00
Mark Johnston	d8acd2681b	Fix mbuf leaks in various pru_send implementations The various protocol implementations are not very consistent about freeing mbufs in error paths. In general, all protocols must free both "m" and "control" upon an error, except if PRUS_NOTREADY is specified (this is only implemented by TCP and unix(4) and requires further work not handled in this diff), in which case "control" still must be freed. This diff plugs various leaks in the pru_send implementations. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30151	2021-05-12 13:00:09 -04:00
Michael Tuexen	251842c639	tcp rack: improve initialisation of retransmit timeout When the TCP is in the front states, don't take the slop variable into account. This improves consistency with the base stack. Reviewed by: rrs@ Differential Revision: https://reviews.freebsd.org/D30230 MFC after: 1 week Sponsored by: Netflix, Inc.	2021-05-12 18:02:21 +02:00
Michael Tuexen	12dda000ed	sctp: fix locking in case of error handling during a restart Thanks to Taylor Brandstetter for finding the issue and providing a patch for the userland stack. MFC after: 3 days	2021-05-12 15:29:06 +02:00
Randall Stewart	4b86a24a76	tcp: In rack, we must only convert restored rtt when the hostcache does restore them. Rack now after the previous commit is very careful to translate any value in the hostcache for srtt/rttvar into its proper format. However there is a snafu here in that if tp->srtt is 0 is the only time that the HC will actually restore the srtt. We need to then only convert the srtt restored when it is actually restored. We do this by making sure it was zero before the call to cc_conn_init and it is non-zero afterwards. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30213	2021-05-11 08:15:05 -04:00
Wojciech Macek	0b103f7237	mrouter: do not loopback packets unconditionally Looping back router multicast traffic signifficantly stresses network stack. Add possibility to disable or enable loopbacked based on sysctl value. Reported by: Daniel Deville Reviewed by: mw Differential Revision: https://reviews.freebsd.org/D29947	2021-05-11 12:36:07 +02:00
Wojciech Macek	65634ae748	mroute: fix race condition during mrouter shutting down There is a race condition between V_ip_mrouter de-init and ip_mforward handling. It might happen that mrouted is cleaned up after V_ip_mrouter check and before processing packet in ip_mforward. Use epoch call aproach, similar to IPSec which also handles such case. Reported by: Damien Deville Obtained from: Stormshield Reviewed by: mw Differential Revision: https://reviews.freebsd.org/D29946	2021-05-11 12:34:20 +02:00
Richard Scheffenegger	0471a8c734	tcp: SACK Lost Retransmission Detection (LRD) Recover from excessive losses without reverting to a retransmission timeout (RTO). Disabled by default, enable with sysctl net.inet.tcp.do_lrd=1 Reviewed By: #transport, rrs, tuexen, #manpages Sponsored by: Netapp, Inc. Differential Revision: https://reviews.freebsd.org/D28931	2021-05-10 19:06:20 +02:00
Randall Stewart	9867224bab	tcp:Host cache and rack ending up with incorrect values. The hostcache up to now as been updated in the discard callback but without checking if we are all done (the race where there are more than one calls and the counter has not yet reached zero). This means that when the race occurs, we end up calling the hc_upate more than once. Also alternate stacks can keep there srtt/rttvar in different formats (example rack keeps its values in microseconds). Since we call the hc_update before the stack fini() then the values will be in the wrong format. Rack on the other hand, needs to convert items pulled from the hostcache into its internal format else it may end up with very much incorrect values from the hostcache. In the process lets commonize the update mechanism for srtt/rttvar since we now have more than one place that needs to call it. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30172	2021-05-10 11:25:51 -04:00
Randall Stewart	5a4333a537	This takes Warners suggested approach to making it so that platforms that for whatever reason cannot include the RATELIMIT option can still work with rack. It adds two dummy functions that rack will call and find out that the highest hw supported b/w is 0 (which kinda makes sense and rack is already prepared to handle). Reviewed by: Michael Tuexen, Warner Losh Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30163	2021-05-07 17:32:32 -04:00
Mark Johnston	a1fadf7de2	divert: Fix mbuf ownership confusion in div_output() div_output_outbound() and div_output_inbound() relied on the caller to free the mbuf if an error occurred. However, this is contrary to the semantics of their callees, ip_output(), ip6_output() and netisr_queue_src(), which always consume the mbuf. So, if one of these functions returned an error, that would get propagated up to div_output(), resulting in a double free. Fix the problem by making div_output_outbound() and div_output_inbound() responsible for freeing the mbuf in all cases. Reported by: Michael Schmiedgen <schmiedgen@gmx.net> Tested by: Michael Schmiedgen Reviewed by: donner MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30129	2021-05-07 14:31:08 -04:00
Randall Stewart	a16cee0218	Fix a UDP tunneling issue with rack. Basically there are two issues. A) Not enough hdrlen was being calculated when a UDP tunnel is in place. and B) Not enough memory is allocated in racks fsb. We need to overbook the fsb to include a udphdr just in case. Submitted by: Peter Lei Reviewed by: Michael Tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30157	2021-05-07 14:06:43 -04:00
Gleb Smirnoff	be578b67b5	tcp_twcheck(): use correct unlock macro. This crippled in due to conflict between two last commits `1db08fbe3f` and `9e644c2300`. Submitted by: Peter Lei	2021-05-06 10:19:21 -07:00
Randall Stewart	5d8fd932e4	This brings into sync FreeBSD with the netflix versions of rack and bbr. This fixes several breakages (panics) since the tcp_lro code was committed that have been reported. Quite a few new features are now in rack (prefecting of DGP -- Dynamic Goodput Pacing among the largest). There is also support for ack-war prevention. Documents comming soon on rack.. Sponsored by: Netflix Reviewed by: rscheff, mtuexen Differential Revision: https://reviews.freebsd.org/D30036	2021-05-06 11:22:26 -04:00
Michael Tuexen	d1cb8d11b0	sctp: improve consistency when handling chunks of wrong size MFC after: 3 days	2021-05-06 01:02:41 +02:00
Mark Johnston	6c34dde83e	igmp: Avoid an out-of-bounds access when zeroing counters When verifying, byte-by-byte, that the user-supplied counters are zero-filled, sysctl_igmp_stat() would check for zero before checking the loop bound. Perform the checks in the correct order. Reported by: KASAN MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-05-05 17:12:51 -04:00
Marko Zec	2aca58e16f	Introduce DXR as an IPv4 longest prefix matching / FIB module DXR maintains compressed lookup structures with a trivial search procedure. A two-stage trie is indexed by the more significant bits of the search key (IPv4 address), while the remaining bits are used for finding the next hop in a sorted array. The tradeoff between memory footprint and search speed depends on the split between the trie and the remaining binary search. The default of 20 bits of the key being used for trie indexing yields good performance (see below) with footprints of around 2.5 Bytes per prefix with current BGP snapshots. Rebuilding lookup structures takes some time, which is compensated for by batching several RIB change requests into a single FIB update, i.e. FIB synchronization with the RIB may be delayed for a fraction of a second. RIB to FIB synchronization, next-hop table housekeeping, and lockless lookup capability is provided by the FIB_ALGO infrastructure. DXR works well on modern CPUs with several MBytes of caches, especially in VMs, where is outperforms other currently available IPv4 FIB algorithms by a large margin. Synthetic single-thread LPM throughput test method: kldload test_lookup; kldload dpdk_lpm4; kldload fib_dxr sysctl net.route.test.run_lps_rnd=N sysctl net.route.test.run_lps_seq=N where N is the number of randomly generated keys (IPv4 addresses) which should be chosen so that each test iteration runs for several seconds. Each reported score represents the best of three runs, in million lookups per second (MLPS), for two bechmarks (RND & SEQ) with two FIBs: host: single interface address, local subnet route + default route BGP: snapshot from linx.routeviews.org, 887957 prefixes, 496 next hops Bhyve VM on an Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60 GHz: inet.algo host, RND host, SEQ BGP, RND BGP, SEQ bsearch4 40.6 20.2 N/A N/A radix4 7.8 3.8 1.2 0.6 radix4_lockless 18.0 9.0 1.6 0.8 dpdk_lpm4 14.4 5.0 14.6 5.0 dxr 70.3 34.7 43.0 19.5 Intel(R) Core(TM) i5-5300U CPU @ 2.30 GHz: inet.algo host, RND host, SEQ BGP, RND BGP, SEQ bsearch4 47.0 23.1 N/A N/A radix4 8.5 4.2 1.9 1.0 radix4_lockless 19.2 9.5 2.5 1.2 dpdk_lpm4 31.2 9.4 31.6 9.3 dxr 84.9 41.4 51.7 23.6 Intel(R) Core(TM) i7-4771 CPU @ 3.50 GHz: inet.algo host, RND host, SEQ BGP, RND BGP, SEQ bsearch4 59.5 29.4 N/A N/A radix4 10.8 5.5 2.5 1.3 radix4_lockless 24.7 12.0 3.1 1.6 dpdk_lpm4 29.1 9.0 30.2 9.1 dxr 101.3 49.9 69.8 32.5 AMD Ryzen 7 3700X 8-Core Processor @ 3.60 GHz: inet.algo host, RND host, SEQ BGP, RND BGP, SEQ bsearch4 70.8 35.4 N/A N/A radix4 14.4 7.2 2.8 1.4 radix4_lockless 30.2 15.1 3.7 1.8 dpdk_lpm4 29.9 9.0 30.0 8.9 dxr 163.3 81.5 99.5 44.4 AMD Ryzen 5 5600X 6-Core Processor @ 3.70 GHz: inet.algo host, RND host, SEQ BGP, RND BGP, SEQ bsearch4 93.6 46.7 N/A N/A radix4 18.9 9.3 4.3 2.1 radix4_lockless 37.2 18.6 5.3 2.7 dpdk_lpm4 51.8 15.1 51.6 14.9 dxr 218.2 103.3 114.0 49.0 Reviewed by: melifaro MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29821	2021-05-05 13:45:52 +02:00
Michael Tuexen	b621fbb1bf	sctp: drop packet with SHUTDOWN-ACK chunks with wrong vtags MFC after: 3 days	2021-05-04 18:43:31 +02:00
Mark Johnston	f161d294b9	Add missing sockaddr length and family validation to various protocols Several protocol methods take a sockaddr as input. In some cases the sockaddr lengths were not being validated, or were validated after some out-of-bounds accesses could occur. Add requisite checking to various protocol entry points, and convert some existing checks to assertions where appropriate. Reported by: syzkaller+KASAN Reviewed by: tuexen, melifaro MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29519	2021-05-03 13:35:19 -04:00
Michael Tuexen	8b3d0f6439	sctp: improve address list scanning If the alternate address has to be removed, force the stack to find a new one, if it is still needed. MFC after: 3 days	2021-05-03 02:50:05 +02:00
Michael Tuexen	a89481d328	sctp: improve restart handling This fixes in particular a possible use after free bug reported Anatoly Korniltsev and Taylor Brandstetter for the userland stack. MFC after: 3 days	2021-05-03 02:20:24 +02:00
Alexander Motin	655c200cc8	Fix build after `5f2e183505`.	2021-05-02 20:07:38 -04:00
Michael Tuexen	5f2e183505	sctp: improve error handling in INIT/INIT-ACK processing When processing INIT and INIT-ACK information, also during COOKIE processing, delete the current association, when it would end up in an inconsistent state. MFC after: 3 days	2021-05-02 22:41:35 +02:00
Michael Tuexen	e010d20032	sctp: update the vtag for INIT and INIT-ACK chunks This is needed in case of responding with an ABORT to an INIT-ACK.	2021-04-30 13:33:16 +02:00
Michael Tuexen	eb79855920	sctp: fix SCTP_PEER_ADDR_PARAMS socket option Ignore spp_pathmtu if it is 0, when setting the IPPROTO_SCTP level socket option SCTP_PEER_ADDR_PARAMS as required by RFC 6458. MFC after: 1 week	2021-04-30 12:31:09 +02:00
Michael Tuexen	eecdf5220b	sctp: use RTO.Initial of 1 second as specified in RFC 4960bis	2021-04-30 00:45:56 +02:00
Michael Tuexen	9de7354bb8	sctp: improve consistency in handling chunks with wrong size Just skip the chunk, if no other handling is required by the specification.	2021-04-28 18:11:06 +02:00

... 2 3 4 5 6 ...

7236 commits