Commit graph

74647 commits

Author SHA1 Message Date
felix bfca5fb4e9 SUNRPC: Fix RPC client cleaned up the freed pipefs dentries
RPC client pipefs dentries cleanup is in separated rpc_remove_pipedir()
workqueue,which takes care about pipefs superblock locking.
In some special scenarios, when kernel frees the pipefs sb of the
current client and immediately alloctes a new pipefs sb,
rpc_remove_pipedir function would misjudge the existence of pipefs
sb which is not the one it used to hold. As a result,
the rpc_remove_pipedir would clean the released freed pipefs dentries.

To fix this issue, rpc_remove_pipedir should check whether the
current pipefs sb is consistent with the original pipefs sb.

This error can be catched by KASAN:
=========================================================
[  250.497700] BUG: KASAN: slab-use-after-free in dget_parent+0x195/0x200
[  250.498315] Read of size 4 at addr ffff88800a2ab804 by task kworker/0:18/106503
[  250.500549] Workqueue: events rpc_free_client_work
[  250.501001] Call Trace:
[  250.502880]  kasan_report+0xb6/0xf0
[  250.503209]  ? dget_parent+0x195/0x200
[  250.503561]  dget_parent+0x195/0x200
[  250.503897]  ? __pfx_rpc_clntdir_depopulate+0x10/0x10
[  250.504384]  rpc_rmdir_depopulate+0x1b/0x90
[  250.504781]  rpc_remove_client_dir+0xf5/0x150
[  250.505195]  rpc_free_client_work+0xe4/0x230
[  250.505598]  process_one_work+0x8ee/0x13b0
...
[   22.039056] Allocated by task 244:
[   22.039390]  kasan_save_stack+0x22/0x50
[   22.039758]  kasan_set_track+0x25/0x30
[   22.040109]  __kasan_slab_alloc+0x59/0x70
[   22.040487]  kmem_cache_alloc_lru+0xf0/0x240
[   22.040889]  __d_alloc+0x31/0x8e0
[   22.041207]  d_alloc+0x44/0x1f0
[   22.041514]  __rpc_lookup_create_exclusive+0x11c/0x140
[   22.041987]  rpc_mkdir_populate.constprop.0+0x5f/0x110
[   22.042459]  rpc_create_client_dir+0x34/0x150
[   22.042874]  rpc_setup_pipedir_sb+0x102/0x1c0
[   22.043284]  rpc_client_register+0x136/0x4e0
[   22.043689]  rpc_new_client+0x911/0x1020
[   22.044057]  rpc_create_xprt+0xcb/0x370
[   22.044417]  rpc_create+0x36b/0x6c0
...
[   22.049524] Freed by task 0:
[   22.049803]  kasan_save_stack+0x22/0x50
[   22.050165]  kasan_set_track+0x25/0x30
[   22.050520]  kasan_save_free_info+0x2b/0x50
[   22.050921]  __kasan_slab_free+0x10e/0x1a0
[   22.051306]  kmem_cache_free+0xa5/0x390
[   22.051667]  rcu_core+0x62c/0x1930
[   22.051995]  __do_softirq+0x165/0x52a
[   22.052347]
[   22.052503] Last potentially related work creation:
[   22.052952]  kasan_save_stack+0x22/0x50
[   22.053313]  __kasan_record_aux_stack+0x8e/0xa0
[   22.053739]  __call_rcu_common.constprop.0+0x6b/0x8b0
[   22.054209]  dentry_free+0xb2/0x140
[   22.054540]  __dentry_kill+0x3be/0x540
[   22.054900]  shrink_dentry_list+0x199/0x510
[   22.055293]  shrink_dcache_parent+0x190/0x240
[   22.055703]  do_one_tree+0x11/0x40
[   22.056028]  shrink_dcache_for_umount+0x61/0x140
[   22.056461]  generic_shutdown_super+0x70/0x590
[   22.056879]  kill_anon_super+0x3a/0x60
[   22.057234]  rpc_kill_sb+0x121/0x200

Fixes: 0157d021d2 ("SUNRPC: handle RPC client pipefs dentries by network namespace aware routines")
Signed-off-by: felix <fuzhen5@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-11-01 15:40:44 -04:00
Dan Carpenter 4f3ed83718 SUNRPC: Add an IS_ERR() check back to where it was
This IS_ERR() check was deleted during in a cleanup because, at the time,
the rpcb_call_async() function could not return an error pointer.  That
changed in commit 25cf32ad5d ("SUNRPC: Handle allocation failure in
rpc_new_task()") and now it can return an error pointer.  Put the check
back.

A related revert was done in commit 13bd901418 ("Revert "SUNRPC:
Remove unreachable error condition"").

Fixes: 037e910b52 ("SUNRPC: Remove unreachable error condition in rpcb_getport_async()")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-11-01 15:40:44 -04:00
Trond Myklebust 59464b262f SUNRPC: SOFTCONN tasks should time out when on the sending list
SOFTCONN tasks need to periodically check if the transport is still
connected, so that they can time out if that is not the case.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-10-22 19:47:57 -04:00
Trond Myklebust f663507e29 SUNRPC: Force close the socket when a hard error is reported
Fix up xs_wake_error() to close the socket when a hard error is being
reported. Usually, that means an ECONNRESET was received on a connection
attempt.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-10-22 19:47:57 -04:00
Trond Myklebust caa388f7e5 SUNRPC: Don't skip timeout checks in call_connect_status()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-10-22 19:47:56 -04:00
Trond Myklebust 4b09ca1508 SUNRPC: ECONNRESET might require a rebind
If connect() is returning ECONNRESET, it usually means that nothing is
listening on that port. If so, a rebind might be required in order to
obtain the new port on which the RPC service is listening.

Fixes: fd01b25979 ("SUNRPC: ECONNREFUSED should cause a rebind.")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-10-22 19:47:56 -04:00
Linus Torvalds ce55c22ec8 Including fixes from bluetooth, netfilter, WiFi.
Feels like an up-tick in regression fixes, mostly for older releases.
 The hfsc fix, tcp_disconnect() and Intel WWAN fixes stand out as fairly
 clear-cut user reported regressions. The mlx5 DMA bug was causing strife
 for 390x folks. The fixes themselves are not particularly scary, tho.
 No open investigations / outstanding reports at the time of writing.
 
 Current release - regressions:
 
  - eth: mlx5: perform DMA operations in the right locations,
    make devices usable on s390x, again
 
  - sched: sch_hfsc: upgrade 'rt' to 'sc' when it becomes a inner curve,
    previous fix of rejecting invalid config broke some scripts
 
  - rfkill: reduce data->mtx scope in rfkill_fop_open, avoid deadlock
 
  - revert "ethtool: Fix mod state of verbose no_mask bitset",
    needs more work
 
 Current release - new code bugs:
 
  - tcp: fix listen() warning with v4-mapped-v6 address
 
 Previous releases - regressions:
 
  - tcp: allow tcp_disconnect() again when threads are waiting,
    it was denied to plug a constant source of bugs but turns out
    .NET depends on it
 
  - eth: mlx5: fix double-free if buffer refill fails under OOM
 
  - revert "net: wwan: iosm: enable runtime pm support for 7560",
    it's causing regressions and the WWAN team at Intel disappeared
 
  - tcp: tsq: relax tcp_small_queue_check() when rtx queue contains
    a single skb, fix single-stream perf regression on some devices
 
 Previous releases - always broken:
 
  - Bluetooth:
    - fix issues in legacy BR/EDR PIN code pairing
    - correctly bounds check and pad HCI_MON_NEW_INDEX name
 
  - netfilter:
    - more fixes / follow ups for the large "commit protocol" rework,
      which went in as a fix to 6.5
    - fix null-derefs on netlink attrs which user may not pass in
 
  - tcp: fix excessive TLP and RACK timeouts from HZ rounding
    (bless Debian for keeping HZ=250 alive)
 
  - net: more strict VIRTIO_NET_HDR_GSO_UDP_L4 validation, prevent
    letting frankenstein UDP super-frames from getting into the stack
 
  - net: fix interface altnames when ifc moves to a new namespace
 
  - eth: qed: fix the size of the RX buffers
 
  - mptcp: avoid sending RST when closing the initial subflow
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmUxawUACgkQMUZtbf5S
 Irt75w/+MC2ilnFQhfoQqpCxL2uHWOKzhenUz2PoNKR60OKgLLmTq8YBYcpyCVAQ
 wBQaNzNu1/yjF5BG7aNS/j0suGeJsOfhfoahQvXlLaat9NuDqxTpaeoT5FZ7eNQw
 RZ8+CJug3BRDV0TkaJH9UDVC/nfJTsnsGWNIhNXYGPsuveqAUun+xrnN8ZbvZIrn
 6D9rMF+u9SdVO+ANCquXBC7+CWEWiJS1ljUrU7BRNiv/9FSlnPQtjOdpuKleeBO8
 4usMS7TezHNgRdiAKC8GjSGUiIkIIMJT4y4wuczBEQAD4Pkki9UpBrui97ozOj7h
 W4N7UOuPlUBIardvKNoYz9rZyiFXBcPPm0GruHiuCqpxyqmzoFgv2XJsb/6KfzNn
 Dyro+lvh8smtbFHvFqiwaNu5y8ucfClaowvR4gjSe2KcB7hIpwNkh6vWC6OMGJK3
 hiKHnDrnXBQMbnP1YfiJ4feLmm3UYCG8eFdv/ULZT0a9TzZ7fKfzAfywUwD+/O8Y
 +S28Hr9srdDCHO7ih/gF3Wq9wtnnLy8QEkpt7cpXjRDj0uWH8JkHU3YEIGF2814Y
 LNVGmX9y6RcgrHNp03K1PmUcgAzhTTuV9QRoQKEucuBT5AK9ALDQ8YomnWDWDgrp
 UOdJPi1RUTsmqslADF15wZ9W5Ki/cDnUJsE4HU/MtnM2w95C49Q=
 =z1F6
 -----END PGP SIGNATURE-----

Merge tag 'net-6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bluetooth, netfilter, WiFi.

  Feels like an up-tick in regression fixes, mostly for older releases.
  The hfsc fix, tcp_disconnect() and Intel WWAN fixes stand out as
  fairly clear-cut user reported regressions. The mlx5 DMA bug was
  causing strife for 390x folks. The fixes themselves are not
  particularly scary, tho. No open investigations / outstanding reports
  at the time of writing.

  Current release - regressions:

   - eth: mlx5: perform DMA operations in the right locations, make
     devices usable on s390x, again

   - sched: sch_hfsc: upgrade 'rt' to 'sc' when it becomes a inner
     curve, previous fix of rejecting invalid config broke some scripts

   - rfkill: reduce data->mtx scope in rfkill_fop_open, avoid deadlock

   - revert "ethtool: Fix mod state of verbose no_mask bitset", needs
     more work

  Current release - new code bugs:

   - tcp: fix listen() warning with v4-mapped-v6 address

  Previous releases - regressions:

   - tcp: allow tcp_disconnect() again when threads are waiting, it was
     denied to plug a constant source of bugs but turns out .NET depends
     on it

   - eth: mlx5: fix double-free if buffer refill fails under OOM

   - revert "net: wwan: iosm: enable runtime pm support for 7560", it's
     causing regressions and the WWAN team at Intel disappeared

   - tcp: tsq: relax tcp_small_queue_check() when rtx queue contains a
     single skb, fix single-stream perf regression on some devices

  Previous releases - always broken:

   - Bluetooth:
      - fix issues in legacy BR/EDR PIN code pairing
      - correctly bounds check and pad HCI_MON_NEW_INDEX name

   - netfilter:
      - more fixes / follow ups for the large "commit protocol" rework,
        which went in as a fix to 6.5
      - fix null-derefs on netlink attrs which user may not pass in

   - tcp: fix excessive TLP and RACK timeouts from HZ rounding (bless
     Debian for keeping HZ=250 alive)

   - net: more strict VIRTIO_NET_HDR_GSO_UDP_L4 validation, prevent
     letting frankenstein UDP super-frames from getting into the stack

   - net: fix interface altnames when ifc moves to a new namespace

   - eth: qed: fix the size of the RX buffers

   - mptcp: avoid sending RST when closing the initial subflow"

* tag 'net-6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits)
  Revert "ethtool: Fix mod state of verbose no_mask bitset"
  selftests: mptcp: join: no RST when rm subflow/addr
  mptcp: avoid sending RST when closing the initial subflow
  mptcp: more conservative check for zero probes
  tcp: check mptcp-level constraints for backlog coalescing
  selftests: mptcp: join: correctly check for no RST
  net: ti: icssg-prueth: Fix r30 CMDs bitmasks
  selftests: net: add very basic test for netdev names and namespaces
  net: move altnames together with the netdevice
  net: avoid UAF on deleted altname
  net: check for altname conflicts when changing netdev's netns
  net: fix ifname in netlink ntf during netns move
  net: ethernet: ti: Fix mixed module-builtin object
  net: phy: bcm7xxx: Add missing 16nm EPHY statistics
  ipv4: fib: annotate races around nh->nh_saddr_genid and nh->nh_saddr
  tcp_bpf: properly release resources on error paths
  net/sched: sch_hfsc: upgrade 'rt' to 'sc' when it becomes a inner curve
  net: mdio-mux: fix C45 access returning -EIO after API change
  tcp: tsq: relax tcp_small_queue_check() when rtx queue contains a single skb
  octeon_ep: update BQL sent bytes before ringing doorbell
  ...
2023-10-19 12:08:18 -07:00
Kory Maincent 524515020f Revert "ethtool: Fix mod state of verbose no_mask bitset"
This reverts commit 108a36d07c.

It was reported that this fix breaks the possibility to remove existing WoL
flags. For example:
~$ ethtool lan2
...
        Supports Wake-on: pg
        Wake-on: d
...
~$ ethtool -s lan2 wol gp
~$ ethtool lan2
...
        Wake-on: pg
...
~$ ethtool -s lan2 wol d
~$ ethtool lan2
...
        Wake-on: pg
...

This worked correctly before this commit because we were always updating
a zero bitmap (since commit 6699170376 ("ethtool: fix application of
verbose no_mask bitset"), that is) so that the rest was left zero
naturally. But now the 1->0 change (old_val is true, bit not present in
netlink nest) no longer works.

Reported-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Closes: https://lore.kernel.org/netdev/20231019095140.l6fffnszraeb6iiw@lion.mk-sys.cz/
Cc: stable@vger.kernel.org
Fixes: 108a36d07c ("ethtool: Fix mod state of verbose no_mask bitset")
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Link: https://lore.kernel.org/r/20231019-feature_ptp_bitset_fix-v1-1-70f3c429a221@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-19 09:27:12 -07:00
Geliang Tang 14c56686a6 mptcp: avoid sending RST when closing the initial subflow
When closing the first subflow, the MPTCP protocol unconditionally
calls tcp_disconnect(), which in turn generates a reset if the subflow
is established.

That is unexpected and different from what MPTCP does with MPJ
subflows, where resets are generated only on FASTCLOSE and other edge
scenarios.

We can't reuse for the first subflow the same code in place for MPJ
subflows, as MPTCP clean them up completely via a tcp_close() call,
while must keep the first subflow socket alive for later re-usage, due
to implementation constraints.

This patch adds a new helper __mptcp_subflow_disconnect() that
encapsulates, a logic similar to tcp_close, issuing a reset only when
the MPTCP_CF_FASTCLOSE flag is set, and performing a clean shutdown
otherwise.

Fixes: c2b2ae3925 ("mptcp: handle correctly disconnect() failures")
Cc: stable@vger.kernel.org
Reviewed-by: Matthieu Baerts <matttbe@kernel.org>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231018-send-net-20231018-v1-4-17ecb002e41d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-19 09:10:00 -07:00
Paolo Abeni 72377ab2d6 mptcp: more conservative check for zero probes
Christoph reported that the MPTCP protocol can find the subflow-level
write queue unexpectedly not empty while crafting a zero-window probe,
hitting a warning:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 188 at net/mptcp/protocol.c:1312 mptcp_sendmsg_frag+0xc06/0xe70
Modules linked in:
CPU: 0 PID: 188 Comm: kworker/0:2 Not tainted 6.6.0-rc2-g1176aa719d7a #47
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
Workqueue: events mptcp_worker
RIP: 0010:mptcp_sendmsg_frag+0xc06/0xe70 net/mptcp/protocol.c:1312
RAX: 47d0530de347ff6a RBX: 47d0530de347ff6b RCX: ffff8881015d3c00
RDX: ffff8881015d3c00 RSI: 47d0530de347ff6b RDI: 47d0530de347ff6b
RBP: 47d0530de347ff6b R08: ffffffff8243c6a8 R09: ffffffff82042d9c
R10: 0000000000000002 R11: ffffffff82056850 R12: ffff88812a13d580
R13: 0000000000000001 R14: ffff88812b375e50 R15: ffff88812bbf3200
FS:  0000000000000000(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000695118 CR3: 0000000115dfc001 CR4: 0000000000170ef0
Call Trace:
 <TASK>
 __subflow_push_pending+0xa4/0x420 net/mptcp/protocol.c:1545
 __mptcp_push_pending+0x128/0x3b0 net/mptcp/protocol.c:1614
 mptcp_release_cb+0x218/0x5b0 net/mptcp/protocol.c:3391
 release_sock+0xf6/0x100 net/core/sock.c:3521
 mptcp_worker+0x6e8/0x8f0 net/mptcp/protocol.c:2746
 process_scheduled_works+0x341/0x690 kernel/workqueue.c:2630
 worker_thread+0x3a7/0x610 kernel/workqueue.c:2784
 kthread+0x143/0x180 kernel/kthread.c:388
 ret_from_fork+0x4d/0x60 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:304
 </TASK>

The root cause of the issue is that expectations are wrong: e.g. due
to MPTCP-level re-injection we can hit the critical condition.

Explicitly avoid the zero-window probe when the subflow write queue
is not empty and drop the related warnings.

Reported-by: Christoph Paasch <cpaasch@apple.com>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/444
Fixes: f70cad1085 ("mptcp: stop relying on tcp_tx_skb_cache")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231018-send-net-20231018-v1-3-17ecb002e41d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-19 09:10:00 -07:00
Paolo Abeni 6db8a37dfc tcp: check mptcp-level constraints for backlog coalescing
The MPTCP protocol can acquire the subflow-level socket lock and
cause the tcp backlog usage. When inserting new skbs into the
backlog, the stack will try to coalesce them.

Currently, we have no check in place to ensure that such coalescing
will respect the MPTCP-level DSS, and that may cause data stream
corruption, as reported by Christoph.

Address the issue by adding the relevant admission check for coalescing
in tcp_add_backlog().

Note the issue is not easy to reproduce, as the MPTCP protocol tries
hard to avoid acquiring the subflow-level socket lock.

Fixes: 648ef4b886 ("mptcp: Implement MPTCP receive path")
Cc: stable@vger.kernel.org
Reported-by: Christoph Paasch <cpaasch@apple.com>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/420
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231018-send-net-20231018-v1-2-17ecb002e41d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-19 09:10:00 -07:00
Jakub Kicinski 8e15aee621 net: move altnames together with the netdevice
The altname nodes are currently not moved to the new netns
when netdevice itself moves:

  [ ~]# ip netns add test
  [ ~]# ip -netns test link add name eth0 type dummy
  [ ~]# ip -netns test link property add dev eth0 altname some-name
  [ ~]# ip -netns test link show dev some-name
  2: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
      link/ether 1e:67:ed:19:3d:24 brd ff:ff:ff:ff:ff:ff
      altname some-name
  [ ~]# ip -netns test link set dev eth0 netns 1
  [ ~]# ip link
  ...
  3: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
      link/ether 02:40:88:62:ec:b8 brd ff:ff:ff:ff:ff:ff
      altname some-name
  [ ~]# ip li show dev some-name
  Device "some-name" does not exist.

Remove them from the hash table when device is unlisted
and add back when listed again.

Fixes: 36fbf1e52b ("net: rtnetlink: add linkprop commands to add and delete alternative ifnames")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-19 15:51:16 +02:00
Jakub Kicinski 1a83f4a7c1 net: avoid UAF on deleted altname
Altnames are accessed under RCU (dev_get_by_name_rcu())
but freed by kfree() with no synchronization point.

Each node has one or two allocations (node and a variable-size
name, sometimes the name is netdev->name). Adding rcu_heads
here is a bit tedious. Besides most code which unlists the names
already has rcu barriers - so take the simpler approach of adding
synchronize_rcu(). Note that the one on the unregistration path
(which matters more) is removed by the next fix.

Fixes: ff92741270 ("net: introduce name_node struct to be used in hashlist")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-19 15:51:16 +02:00
Jakub Kicinski 7663d52209 net: check for altname conflicts when changing netdev's netns
It's currently possible to create an altname conflicting
with an altname or real name of another device by creating
it in another netns and moving it over:

 [ ~]$ ip link add dev eth0 type dummy

 [ ~]$ ip netns add test
 [ ~]$ ip -netns test link add dev ethX netns test type dummy
 [ ~]$ ip -netns test link property add dev ethX altname eth0
 [ ~]$ ip -netns test link set dev ethX netns 1

 [ ~]$ ip link
 ...
 3: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
     link/ether 02:40:88:62:ec:b8 brd ff:ff:ff:ff:ff:ff
 ...
 5: ethX: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
     link/ether 26:b7:28:78:38:0f brd ff:ff:ff:ff:ff:ff
     altname eth0

Create a macro for walking the altnames, this hopefully makes
it clearer that the list we walk contains only altnames.
Which is otherwise not entirely intuitive.

Fixes: 36fbf1e52b ("net: rtnetlink: add linkprop commands to add and delete alternative ifnames")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-19 15:51:16 +02:00
Jakub Kicinski 311cca4066 net: fix ifname in netlink ntf during netns move
dev_get_valid_name() overwrites the netdev's name on success.
This makes it hard to use in prepare-commit-like fashion,
where we do validation first, and "commit" to the change
later.

Factor out a helper which lets us save the new name to a buffer.
Use it to fix the problem of notification on netns move having
incorrect name:

 5: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
     link/ether be:4d:58:f9:d5:40 brd ff:ff:ff:ff:ff:ff
 6: eth1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
     link/ether 1e:4a:34:36:e3:cd brd ff:ff:ff:ff:ff:ff

 [ ~]# ip link set dev eth0 netns 1 name eth1

ip monitor inside netns:
 Deleted inet eth0
 Deleted inet6 eth0
 Deleted 5: eth1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
     link/ether be:4d:58:f9:d5:40 brd ff:ff:ff:ff:ff:ff new-netnsid 0 new-ifindex 7

Name is reported as eth1 in old netns for ifindex 5, already renamed.

Fixes: d90310243f ("net: device name allocation cleanups")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-19 15:51:16 +02:00
Jakub Kicinski 9b9ac46c6c netfilter pr 2023-18-10
-----BEGIN PGP SIGNATURE-----
 
 iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmUvzlINHGZ3QHN0cmxl
 bi5kZQAKCRBwkajZrV/2APelD/9b9FP1jNIX11OqPqmOL0bgGczU5OW3iI6VxLOx
 1qygw4pXHjtLGfNBEA7QtckbkFgXIxou/JvgavYBXY5mGDHDEJuV3JnX/HqIxKy9
 czAahqkdQ94/a0mI9wFuPY5ZeLTF/yGAe1iLKKWf4f//QwTc7dCfDsGjdKeKLHwL
 ye0f0MRwIwojTjBCf+szlacYrVE+Jgog+GFKX6F3eUtA+JOhpspwWNCPt5W/IQg/
 Q9Bl0dRQeqbIUzaB2YxvIFxB80NG4dg31ThYfgFW8IBz6bTYm3Vml4iorGSkYAqD
 c4PjKY/46/cpCAzkLd+5c+WE0YLMlq2u52KJXThUd5Lpzy3QG8KG5AMR1vXDVw2s
 7fLNeJGH+4zHDY3nXVTqCA/IV2x2sNwgfF6VZHvvkqpKhL5euaKJGu3a7/KGOhoI
 JO1Q6CBaTpZ3HuBNJoTI/uMeZJySlfv7080pHIypy7mYtTyChQC7/IvYpaI9yFwU
 Ka7wIR5MOJ/npaj6ZpVkbNQLntt+SbJINyTiJax5vO63FJ5kSrdgPZiR1PwtMHpZ
 br1e+2eFz8foxaEf6ADL/5QwfIvfglT4fVk/flsukJdyrsPxI2H7kYm/6G8Fcbu3
 TOZQmsNvqnqUpnQLXYe/YbtKoIvTAZOoGy/rcj+IIl2ym/2DiwbWx2nye6uwufIH
 UlP5gA==
 =M/4J
 -----END PGP SIGNATURE-----

Merge tag 'nf-23-10-18' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Florian Westphal says:

====================
netfilter: updates for net

First patch, from Phil Sutter, reduces number of audit notifications
when userspace requests to re-set stateful objects.
This change also comes with a selftest update.

Second patch, also from Phil, moves the nftables audit selftest
to its own netns to avoid interference with the init netns.

Third patch, from Pablo Neira, fixes an inconsistency with the "rbtree"
set backend: When set element X has expired, a request to delete element
X should fail (like with all other backends).

Finally, patch four, also from Pablo, reverts a recent attempt to speed
up abort of a large pending update with the "pipapo" set backend.

It could cause stray references to remain in the set, which then
results in a double-free.

* tag 'nf-23-10-18' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables: revert do not remove elements if set backend implements .abort
  netfilter: nft_set_rbtree: .deactivate fails if element has expired
  selftests: netfilter: Run nft_audit.sh in its own netns
  netfilter: nf_tables: audit log object reset once per table
====================

Link: https://lore.kernel.org/r/20231018125605.27299-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-18 18:17:51 -07:00
Jakub Kicinski 88343fbe5a A few more fixes:
* prevent value bounce/glitch in rfkill GPIO probe
  * fix lockdep report in rfkill
  * fix error path leak in mac80211 key handling
  * use system_unbound_wq for wiphy work since it
    can take longer
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmUvhKQACgkQ10qiO8sP
 aAB+3A//eNEP5W9ZigsZ5pPYuxKyXoctyV3Qp8yChYWHEE46JbbnQsvvvGa6SYFu
 TzWS3dShLqXU83pWHrvNSuWo/AySaoT4kDp7c8TK7GIed8MHxguU9QXaPVqQiNEG
 tmAVwPz5y8Y4NzjytBU0MjuAfbWo/HTIs5nKxBhNTJp3imqlezCnhYz4pMjc0rcN
 CnWKH8bYW+ubqnJVi0l01GDwtGs4xROf+43is6sA8zu0zA3gx+V0BJdx3J7t1Jrd
 qCBmi8ZSFcUZ/GLqYE04MdlHGRrzGR5p+Rq+/MsF3AUoobIrbkVIXXP0VoPpLAqf
 xZwccVhWXI1ZGM3T5L91hrhONcwoCA67m79qhMcr2KZZ0flhBjNLVqCXXRAXn0Hs
 /gynjrvGVaZMYlJegAd0iOIAOn9FjYcdcYgYhDRFdbRCt7eCRSj3YMuO9rFhX6kL
 tbmIII6vMoPunzSwUeNI1CXUmJoTJMnzwWVPC0EZqT2gymhBAeS/GK8ELC3lG1dz
 aXoxX2dCHMk2rv86B3g/Pk2j0smxCNd+hXjFVfJIJ+t1xp8TvuEI8AGrae1nNzck
 c2cPn34u/5PLkO5bJqu2JgF9qAHgArqfPIFIKLZNcpAOvqCNWqSozc5KtNQ4ASMt
 f/cnPGQUxcHekAFsJWB9aqwW3oPAUkWR6tJ8jYHxZC47oGs/qnY=
 =Sth1
 -----END PGP SIGNATURE-----

Merge tag 'wireless-2023-10-18' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
A few more fixes:
 * prevent value bounce/glitch in rfkill GPIO probe
 * fix lockdep report in rfkill
 * fix error path leak in mac80211 key handling
 * use system_unbound_wq for wiphy work since it
   can take longer

* tag 'wireless-2023-10-18' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  net: rfkill: reduce data->mtx scope in rfkill_fop_open
  net: rfkill: gpio: prevent value glitch during probe
  wifi: mac80211: fix error path key leak
  wifi: cfg80211: use system_unbound_wq for wiphy work
====================

Link: https://lore.kernel.org/r/20231018071041.8175-2-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-18 18:14:25 -07:00
Eric Dumazet 195374d893 ipv4: fib: annotate races around nh->nh_saddr_genid and nh->nh_saddr
syzbot reported a data-race while accessing nh->nh_saddr_genid [1]

Add annotations, but leave the code lazy as intended.

[1]
BUG: KCSAN: data-race in fib_select_path / fib_select_path

write to 0xffff8881387166f0 of 4 bytes by task 6778 on cpu 1:
fib_info_update_nhc_saddr net/ipv4/fib_semantics.c:1334 [inline]
fib_result_prefsrc net/ipv4/fib_semantics.c:1354 [inline]
fib_select_path+0x292/0x330 net/ipv4/fib_semantics.c:2269
ip_route_output_key_hash_rcu+0x659/0x12c0 net/ipv4/route.c:2810
ip_route_output_key_hash net/ipv4/route.c:2644 [inline]
__ip_route_output_key include/net/route.h:134 [inline]
ip_route_output_flow+0xa6/0x150 net/ipv4/route.c:2872
send4+0x1f5/0x520 drivers/net/wireguard/socket.c:61
wg_socket_send_skb_to_peer+0x94/0x130 drivers/net/wireguard/socket.c:175
wg_socket_send_buffer_to_peer+0xd6/0x100 drivers/net/wireguard/socket.c:200
wg_packet_send_handshake_initiation drivers/net/wireguard/send.c:40 [inline]
wg_packet_handshake_send_worker+0x10c/0x150 drivers/net/wireguard/send.c:51
process_one_work kernel/workqueue.c:2630 [inline]
process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2703
worker_thread+0x525/0x730 kernel/workqueue.c:2784
kthread+0x1d7/0x210 kernel/kthread.c:388
ret_from_fork+0x48/0x60 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304

read to 0xffff8881387166f0 of 4 bytes by task 6759 on cpu 0:
fib_result_prefsrc net/ipv4/fib_semantics.c:1350 [inline]
fib_select_path+0x1cb/0x330 net/ipv4/fib_semantics.c:2269
ip_route_output_key_hash_rcu+0x659/0x12c0 net/ipv4/route.c:2810
ip_route_output_key_hash net/ipv4/route.c:2644 [inline]
__ip_route_output_key include/net/route.h:134 [inline]
ip_route_output_flow+0xa6/0x150 net/ipv4/route.c:2872
send4+0x1f5/0x520 drivers/net/wireguard/socket.c:61
wg_socket_send_skb_to_peer+0x94/0x130 drivers/net/wireguard/socket.c:175
wg_socket_send_buffer_to_peer+0xd6/0x100 drivers/net/wireguard/socket.c:200
wg_packet_send_handshake_initiation drivers/net/wireguard/send.c:40 [inline]
wg_packet_handshake_send_worker+0x10c/0x150 drivers/net/wireguard/send.c:51
process_one_work kernel/workqueue.c:2630 [inline]
process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2703
worker_thread+0x525/0x730 kernel/workqueue.c:2784
kthread+0x1d7/0x210 kernel/kthread.c:388
ret_from_fork+0x48/0x60 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304

value changed: 0x959d3217 -> 0x959d3218

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 6759 Comm: kworker/u4:15 Not tainted 6.6.0-rc4-syzkaller-00029-gcbf3a2cb156a #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/06/2023
Workqueue: wg-kex-wg1 wg_packet_handshake_send_worker

Fixes: 436c3b66ec ("ipv4: Invalidate nexthop cache nh_saddr more correctly.")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20231017192304.82626-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-18 18:11:31 -07:00
Paolo Abeni 68b54aeff8 tcp_bpf: properly release resources on error paths
In the blamed commit below, I completely forgot to release the acquired
resources before erroring out in the TCP BPF code, as reported by Dan.

Address the issues by replacing the bogus return with a jump to the
relevant cleanup code.

Fixes: 419ce133ab ("tcp: allow again tcp_disconnect() when threads are waiting")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/8f99194c698bcef12666f0a9a999c58f8b1cb52c.1697557782.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-18 18:09:31 -07:00
Pedro Tammela a13b67c9a0 net/sched: sch_hfsc: upgrade 'rt' to 'sc' when it becomes a inner curve
Christian Theune says:
   I upgraded from 6.1.38 to 6.1.55 this morning and it broke my traffic shaping script,
   leaving me with a non-functional uplink on a remote router.

A 'rt' curve cannot be used as a inner curve (parent class), but we were
allowing such configurations since the qdisc was introduced. Such
configurations would trigger a UAF as Budimir explains:
   The parent will have vttree_insert() called on it in init_vf(),
   but will not have vttree_remove() called on it in update_vf()
   because it does not have the HFSC_FSC flag set.

The qdisc always assumes that inner classes have the HFSC_FSC flag set.
This is by design as it doesn't make sense 'qdisc wise' for an 'rt'
curve to be an inner curve.

Budimir's original patch disallows users to add classes with a 'rt'
parent, but this is too strict as it breaks users that have been using
'rt' as a inner class. Another approach, taken by this patch, is to
upgrade the inner 'rt' into a 'sc', warning the user in the process.
It avoids the UAF reported by Budimir while also being more permissive
to bad scripts/users/code using 'rt' as a inner class.

Users checking the `tc class ls [...]` or `tc class get [...]` dumps would
observe the curve change and are potentially breaking with this change.

v1->v2: https://lore.kernel.org/all/20231013151057.2611860-1-pctammela@mojatatu.com/
- Correct 'Fixes' tag and merge with revert (Jakub)

Cc: Christian Theune <ct@flyingcircus.io>
Cc: Budimir Markovic <markovicbudimir@gmail.com>
Fixes: b3d26c5702 ("net/sched: sch_hfsc: Ensure inner classes have fsc curve")
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20231017143602.3191556-1-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-18 18:08:28 -07:00
Eric Dumazet f921a4a5bf tcp: tsq: relax tcp_small_queue_check() when rtx queue contains a single skb
In commit 75eefc6c59 ("tcp: tsq: add a shortcut in tcp_small_queue_check()")
we allowed to send an skb regardless of TSQ limits being hit if rtx queue
was empty or had a single skb, in order to better fill the pipe
when/if TX completions were slow.

Then later, commit 75c119afe1 ("tcp: implement rb-tree based
retransmit queue") accidentally removed the special case for
one skb in rtx queue.

Stefan Wahren reported a regression in single TCP flow throughput
using a 100Mbit fec link, starting from commit 65466904b0 ("tcp: adjust
TSO packet sizes based on min_rtt"). This last commit only made the
regression more visible, because it locked the TCP flow on a particular
behavior where TSQ prevented two skbs being pushed downstream,
adding silences on the wire between each TSO packet.

Many thanks to Stefan for his invaluable help !

Fixes: 75c119afe1 ("tcp: implement rb-tree based retransmit queue")
Link: https://lore.kernel.org/netdev/7f31ddc8-9971-495e-a1f6-819df542e0af@gmx.net/
Reported-by: Stefan Wahren <wahrenst@gmx.net>
Tested-by: Stefan Wahren <wahrenst@gmx.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20231017124526.4060202-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-18 18:06:36 -07:00
Pablo Neira Ayuso f86fb94011 netfilter: nf_tables: revert do not remove elements if set backend implements .abort
nf_tables_abort_release() path calls nft_set_elem_destroy() for
NFT_MSG_NEWSETELEM which releases the element, however, a reference to
the element still remains in the working copy.

Fixes: ebd032fa88 ("netfilter: nf_tables: do not remove elements if set backend implements .abort")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18 13:47:32 +02:00
Pablo Neira Ayuso d111692a59 netfilter: nft_set_rbtree: .deactivate fails if element has expired
This allows to remove an expired element which is not possible in other
existing set backends, this is more noticeable if gc-interval is high so
expired elements remain in the tree. On-demand gc also does not help in
this case, because this is delete element path. Return NULL if element
has expired.

Fixes: 8d8540c4f5 ("netfilter: nft_set_rbtree: add timeout support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18 13:47:32 +02:00
Phil Sutter 1baf0152f7 netfilter: nf_tables: audit log object reset once per table
When resetting multiple objects at once (via dump request), emit a log
message per table (or filled skb) and resurrect the 'entries' parameter
to contain the number of objects being logged for.

To test the skb exhaustion path, perform some bulk counter and quota
adds in the kselftest.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Paul Moore <paul@paul-moore.com> (Audit)
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18 13:43:40 +02:00
Gavrilov Ilia 1d30162f35 net: pktgen: Fix interface flags printing
Device flags are displayed incorrectly:
1) The comparison (i == F_FLOW_SEQ) is always false, because F_FLOW_SEQ
is equal to (1 << FLOW_SEQ_SHIFT) == 2048, and the maximum value
of the 'i' variable is (NR_PKT_FLAG - 1) == 17. It should be compared
with FLOW_SEQ_SHIFT.

2) Similarly to the F_IPSEC flag.

3) Also add spaces to the print end of the string literal "spi:%u"
to prevent the output from merging with the flag that follows.

Found by InfoTeCS on behalf of Linux Verification Center
(linuxtesting.org) with SVACE.

Fixes: 99c6d3d20d ("pktgen: Remove brute-force printing of flags")
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 10:12:30 +01:00
Jakub Kicinski f6c7b42243 ipsec-2023-10-17
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmUuRZ4ACgkQrB3Eaf9P
 W7e7eg//fgQux/sJoq2Qf7T8cvVt3NSpOLXB43NRsIb9nUyMNN/cYTtypYM/+0FM
 f0zxmwtUkj9Mx/IL2nNzK/h8lMtJeKfy14tt8lAX2ux5oJltCcTiLgdp3C4HWxOh
 9DrXMGIryr0apedkcStSzhRoJBd8giFViAQZE8NwO5EKG8WLJNVvw0YzlNgM0WOk
 5y/sN1uXeUmUCYDkOGcdt6yIyV+GIo/nEg/XOY8LkaQDzKveDypydj0dPGL/Dj8U
 MhczTCf1WdbTHh0dmITIdX/yZ8/bPNfV3EzAtAaYgceVh1DSq8/F9buRXz2oxxvh
 r8Igv+640+SoWEtIsVoXHx7KIZ7LckasrJv1IoKRXGVV25LjnR+bxGVQNyUjdd6+
 Cb11Q/m66fp/k7B+++b6eFVlKvr9O4jBogb3kfGpqMiwm6LNM+CJicdAfM8/GEgM
 ejnVNSEaChhdgGvrXVbKhI2ACatwbPrt6d4PN0d9fpUbZJnuBZl4QHElLNSBznjO
 xJUF8LvPjFGx6vj3YFnBg5f3uNH35nR2jQxPsx+yWdBEFqbxBt7yKr/pftwP1gxD
 ZLHF8JbOT4J+96ClDRFnlWFdY+Phq0fg5S5WhPhNGQ4rj9rsjAS8ZMcHWZeA7iZ2
 0w5P/DlFmmnw6CC+Xr+wQJlgB1tZ2sdC8nAK6gxQJ3eFVv4wnhk=
 =yHtO
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-2023-10-17' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec

Steffen Klassert says:

====================
pull request (net): ipsec 2023-10-17

1) Fix a slab-use-after-free in xfrm_policy_inexact_list_reinsert.
   From Dong Chenchen.

2) Fix data-races in the xfrm interfaces dev->stats fields.
   From Eric Dumazet.

3) Fix a data-race in xfrm_gen_index.
   From Eric Dumazet.

4) Fix an inet6_dev refcount underflow.
   From Zhang Changzhong.

5) Check the return value of pskb_trim in esp_remove_trailer
   for esp4 and esp6. From Ma Ke.

6) Fix a data-race in xfrm_lookup_with_ifid.
   From Eric Dumazet.

* tag 'ipsec-2023-10-17' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
  xfrm: fix a data-race in xfrm_lookup_with_ifid()
  net: ipv4: fix return value check in esp_remove_trailer
  net: ipv6: fix return value check in esp_remove_trailer
  xfrm6: fix inet6_dev refcount underflow problem
  xfrm: fix a data-race in xfrm_gen_index()
  xfrm: interface: use DEV_STATS_INC()
  net: xfrm: skip policies marked as dead while reinserting policies
====================

Link: https://lore.kernel.org/r/20231017083723.1364940-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 18:21:13 -07:00
Neal Cardwell 1c2709cfff tcp: fix excessive TLP and RACK timeouts from HZ rounding
We discovered from packet traces of slow loss recovery on kernels with
the default HZ=250 setting (and min_rtt < 1ms) that after reordering,
when receiving a SACKed sequence range, the RACK reordering timer was
firing after about 16ms rather than the desired value of roughly
min_rtt/4 + 2ms. The problem is largely due to the RACK reorder timer
calculation adding in TCP_TIMEOUT_MIN, which is 2 jiffies. On kernels
with HZ=250, this is 2*4ms = 8ms. The TLP timer calculation has the
exact same issue.

This commit fixes the TLP transmit timer and RACK reordering timer
floor calculation to more closely match the intended 2ms floor even on
kernels with HZ=250. It does this by adding in a new
TCP_TIMEOUT_MIN_US floor of 2000 us and then converting to jiffies,
instead of the current approach of converting to jiffies and then
adding th TCP_TIMEOUT_MIN value of 2 jiffies.

Our testing has verified that on kernels with HZ=1000, as expected,
this does not produce significant changes in behavior, but on kernels
with the default HZ=250 the latency improvement can be large. For
example, our tests show that for HZ=250 kernels at low RTTs this fix
roughly halves the latency for the RACK reorder timer: instead of
mostly firing at 16ms it mostly fires at 8ms.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Fixes: bb4d991a28 ("tcp: adjust tail loss probe timeout")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231015174700.2206872-1-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:25:42 -07:00
Jakub Kicinski 2b10740ce7 bluetooth pull request for net:
- Fix race when opening vhci device
  - Avoid memcmp() out of bounds warning
  - Correctly bounds check and pad HCI_MON_NEW_INDEX name
  - Fix using memcmp when comparing keys
  - Ignore error return for hci_devcd_register() in btrtl
  - Always check if connection is alive before deleting
  - Fix a refcnt underflow problem for hci_conn
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmUqBjIZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKeVmD/95NHJd61l0JpQw9yYNjDc+
 8C2tAa4rDrbooqfhgS3PPy4SQ9X07KO9VRKaQVEjnVahnfIsbE++sptBWjMSQz94
 8cRde2NJs2uUnl0Eovv9hcE/dgscXu5Lgt/fCCOyY6YwbsyFW5FmYcNUSzOLu83i
 6KuPigt2n/bFC7pnB/VROmO3WusNVFX1J8unZa/MvMK5wn6DQkJeZgJlFlgJSfeg
 Tou4SwI8ormUsDAjV3+dVl7vZVNpyX4mv5cR6RlwHEtvcnkFIj6+ZE96M1BwOBrm
 zbc1OyzrXt5jXuSORlOQEAfjZUab+ygrv1tCgpVkQf+vnIiREsN/57NS6J/kmbc9
 tbOqfJwYokRwjKtXsQ2bc0Jm5kvA0j/9Mrep4E3UZpTv2LwjGTaN/wXGNu1qW8+d
 LTgzqwmgx/PwqC4oiAFrT7guPDwoHzFqSwnvdhkdr04dH19zk7IjIm8ZrTruPDDk
 3wiMOijCtx31RBlRPmfb2njostkS+ysrW1bw3vQmfs04djvcgar0MqKdauXIuBvE
 QMB+dyvPwCFQ8OYHQwfuwYfdHSgG7v+f35IEeflB4OISbZI4aFWkQnjvcloFpnmP
 m/vo+y86ORlxJiHlJypgwqxZ42nAxpGUCwxxhiMDeiqp4GzxUYcwK5RteOcDkDTb
 /ukSEQz/suO3Fya2kTz4mw==
 =SY+z
 -----END PGP SIGNATURE-----

Merge tag 'for-net-2023-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - Fix race when opening vhci device
 - Avoid memcmp() out of bounds warning
 - Correctly bounds check and pad HCI_MON_NEW_INDEX name
 - Fix using memcmp when comparing keys
 - Ignore error return for hci_devcd_register() in btrtl
 - Always check if connection is alive before deleting
 - Fix a refcnt underflow problem for hci_conn

* tag 'for-net-2023-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: hci_sock: Correctly bounds check and pad HCI_MON_NEW_INDEX name
  Bluetooth: avoid memcmp() out of bounds warning
  Bluetooth: hci_sock: fix slab oob read in create_monitor_event
  Bluetooth: btrtl: Ignore error return for hci_devcd_register()
  Bluetooth: hci_event: Fix coding style
  Bluetooth: hci_event: Fix using memcmp when comparing keys
  Bluetooth: Fix a refcnt underflow problem for hci_conn
  Bluetooth: hci_sync: always check if connection is alive before deleting
  Bluetooth: Reject connection with the device which has same BD_ADDR
  Bluetooth: hci_event: Ignore NULL link key
  Bluetooth: ISO: Fix invalid context error
  Bluetooth: vhci: Fix race when opening vhci device
====================

Link: https://lore.kernel.org/r/20231014031336.1664558-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16 18:02:20 -07:00
Krzysztof Kozlowski 7937609cd3 nfc: nci: fix possible NULL pointer dereference in send_acknowledge()
Handle memory allocation failure from nci_skb_alloc() (calling
alloc_skb()) to avoid possible NULL pointer dereference.

Reported-by: 黄思聪 <huangsicong@iie.ac.cn>
Fixes: 391d8a2da7 ("NFC: Add NCI over SPI receive")
Cc: <stable@vger.kernel.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231013184129.18738-1-krzysztof.kozlowski@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16 17:34:53 -07:00
Christoph Paasch 503930f8e1 netlink: Correct offload_xstats size
rtnl_offload_xstats_get_size_hw_s_info_one() conditionalizes the
size-computation for IFLA_OFFLOAD_XSTATS_HW_S_INFO_USED based on whether
or not the device has offload_xstats enabled.

However, rtnl_offload_xstats_fill_hw_s_info_one() is adding the u8 for
that field uncondtionally.

syzkaller triggered a WARNING in rtnl_stats_get due to this:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 754 at net/core/rtnetlink.c:5982 rtnl_stats_get+0x2f4/0x300
Modules linked in:
CPU: 0 PID: 754 Comm: syz-executor148 Not tainted 6.6.0-rc2-g331b78eb12af #45
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
RIP: 0010:rtnl_stats_get+0x2f4/0x300 net/core/rtnetlink.c:5982
Code: ff ff 89 ee e8 7d 72 50 ff 83 fd a6 74 17 e8 33 6e 50 ff 4c 89 ef be 02 00 00 00 e8 86 00 fa ff e9 7b fe ff ff e8 1c 6e 50 ff <0f> 0b eb e5 e8 73 79 7b 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90
RSP: 0018:ffffc900006837c0 EFLAGS: 00010293
RAX: ffffffff81cf7f24 RBX: ffff8881015d9000 RCX: ffff888101815a00
RDX: 0000000000000000 RSI: 00000000ffffffa6 RDI: 00000000ffffffa6
RBP: 00000000ffffffa6 R08: ffffffff81cf7f03 R09: 0000000000000001
R10: ffff888101ba47b9 R11: ffff888101815a00 R12: ffff8881017dae00
R13: ffff8881017dad00 R14: ffffc90000683ab8 R15: ffffffff83c1f740
FS:  00007fbc22dbc740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000046 CR3: 000000010264e003 CR4: 0000000000170ef0
Call Trace:
 <TASK>
 rtnetlink_rcv_msg+0x677/0x710 net/core/rtnetlink.c:6480
 netlink_rcv_skb+0xea/0x1c0 net/netlink/af_netlink.c:2545
 netlink_unicast+0x430/0x500 net/netlink/af_netlink.c:1342
 netlink_sendmsg+0x4fc/0x620 net/netlink/af_netlink.c:1910
 sock_sendmsg+0xa8/0xd0 net/socket.c:730
 ____sys_sendmsg+0x22a/0x320 net/socket.c:2541
 ___sys_sendmsg+0x143/0x190 net/socket.c:2595
 __x64_sys_sendmsg+0xd8/0x150 net/socket.c:2624
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x47/0xa0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x6e/0xd8
RIP: 0033:0x7fbc22e8d6a9
Code: 5c c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4f 37 0d 00 f7 d8 64 89 01 48
RSP: 002b:00007ffc4320e778 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000004007d0 RCX: 00007fbc22e8d6a9
RDX: 0000000000000000 RSI: 0000000020000000 RDI: 0000000000000003
RBP: 0000000000000001 R08: 0000000000000000 R09: 00000000004007d0
R10: 0000000000000008 R11: 0000000000000246 R12: 00007ffc4320e898
R13: 00007ffc4320e8a8 R14: 00000000004004a0 R15: 00007fbc22fa5a80
 </TASK>
---[ end trace 0000000000000000 ]---

Which didn't happen prior to commit bf9f1baa27 ("net: add dedicated
kmem_cache for typical/small skb->head") as the skb always was large
enough.

Fixes: 0e7788fd76 ("net: rtnetlink: Add UAPI for obtaining L3 offload xstats")
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/20231013041448.8229-1-cpaasch@apple.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16 17:32:59 -07:00
Dust Li 4abbd2e3c1 net/smc: return the right falback reason when prefix checks fail
In the smc_listen_work(), if smc_listen_prfx_check() failed,
the real reason: SMC_CLC_DECL_DIFFPREFIX was dropped, and
SMC_CLC_DECL_NOSMCDEV was returned.

Althrough this is also kind of SMC_CLC_DECL_NOSMCDEV, but return
the real reason is much friendly for debugging.

Fixes: e49300a6bf ("net/smc: add listen processing for SMC-Rv2")
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Link: https://lore.kernel.org/r/20231012123729.29307-1-dust.li@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16 16:37:40 -07:00
Kees Cook cb3871b1cd Bluetooth: hci_sock: Correctly bounds check and pad HCI_MON_NEW_INDEX name
The code pattern of memcpy(dst, src, strlen(src)) is almost always
wrong. In this case it is wrong because it leaves memory uninitialized
if it is less than sizeof(ni->name), and overflows ni->name when longer.

Normally strtomem_pad() could be used here, but since ni->name is a
trailing array in struct hci_mon_new_index, compilers that don't support
-fstrict-flex-arrays=3 can't tell how large this array is via
__builtin_object_size(). Instead, open-code the helper and use sizeof()
since it will work correctly.

Additionally mark ni->name as __nonstring since it appears to not be a
%NUL terminated C string.

Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Cc: Edward AD <twuufnxlz@gmail.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: linux-bluetooth@vger.kernel.org
Cc: netdev@vger.kernel.org
Fixes: 18f547f3fc ("Bluetooth: hci_sock: fix slab oob read in create_monitor_event")
Link: https://lore.kernel.org/lkml/202310110908.F2639D3276@keescook/
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-10-13 20:06:33 -07:00
Arnd Bergmann 9d1a3c7474 Bluetooth: avoid memcmp() out of bounds warning
bacmp() is a wrapper around memcpy(), which contain compile-time
checks for buffer overflow. Since the hci_conn_request_evt() also calls
bt_dev_dbg() with an implicit NULL pointer check, the compiler is now
aware of a case where 'hdev' is NULL and treats this as meaning that
zero bytes are available:

In file included from net/bluetooth/hci_event.c:32:
In function 'bacmp',
    inlined from 'hci_conn_request_evt' at net/bluetooth/hci_event.c:3276:7:
include/net/bluetooth/bluetooth.h:364:16: error: 'memcmp' specified bound 6 exceeds source size 0 [-Werror=stringop-overread]
  364 |         return memcmp(ba1, ba2, sizeof(bdaddr_t));
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Add another NULL pointer check before the bacmp() to ensure the compiler
understands the code flow enough to not warn about it.  Since the patch
that introduced the warning is marked for stable backports, this one
should also go that way to avoid introducing build regressions.

Fixes: 1ffc6f8cc3 ("Bluetooth: Reject connection with the device which has same BD_ADDR")
Cc: Kees Cook <keescook@chromium.org>
Cc: "Lee, Chun-Yi" <jlee@suse.com>
Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-10-13 20:04:45 -07:00
Edward AD 18f547f3fc Bluetooth: hci_sock: fix slab oob read in create_monitor_event
When accessing hdev->name, the actual string length should prevail

Reported-by: syzbot+c90849c50ed209d77689@syzkaller.appspotmail.com
Fixes: dcda165706 ("Bluetooth: hci_core: Fix build warnings")
Signed-off-by: Edward AD <twuufnxlz@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-10-13 20:03:04 -07:00
Luiz Augusto von Dentz 35d91d95a0 Bluetooth: hci_event: Fix coding style
This fixes the following code style problem:

ERROR: that open brace { should be on the previous line
+	if (!bacmp(&hdev->bdaddr, &ev->bdaddr))
+	{

Fixes: 1ffc6f8cc3 ("Bluetooth: Reject connection with the device which has same BD_ADDR")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-10-13 20:02:35 -07:00
Luiz Augusto von Dentz b541260615 Bluetooth: hci_event: Fix using memcmp when comparing keys
memcmp is not consider safe to use with cryptographic secrets:

 'Do  not  use memcmp() to compare security critical data, such as
 cryptographic secrets, because the required CPU time depends on the
 number of equal bytes.'

While usage of memcmp for ZERO_KEY may not be considered a security
critical data, it can lead to more usage of memcmp with pairing keys
which could introduce more security problems.

Fixes: 455c2ff0a5 ("Bluetooth: Fix BR/EDR out-of-band pairing with only initiator data")
Fixes: 33155c4aae ("Bluetooth: hci_event: Ignore NULL link key")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-10-13 20:00:25 -07:00
Jakub Kicinski f50ee3a003 nf pull request 2023-10-12
-----BEGIN PGP SIGNATURE-----
 
 iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmUnsR0NHGZ3QHN0cmxl
 bi5kZQAKCRBwkajZrV/2AJaDD/9WqsWX8KK+QaXMNdhuY2TYicRUrzAObeCOkPzU
 8M0S0pZiSsfZO4Tzlo30krSoLERnBYsedngEIzLHHlQ/V+65H8h9jY1p1oGZFKIg
 5ui9ZQbcvWLqKWTOaCneZuEjsGxodWvVokVubSAWw9XExuYqbfo8WOOKBZzAX4mt
 J7KznGlShGHithvssWrk8n2FQr8+ZqzRrJjpoNfQcKU1a+GEIiQvIIfm4f3P6baj
 yiuP2dXWsUXuDK63JmMUK2k3w1e/XwjPpCZ+uyoNxPCggu0X0u//MpM71CNpGAZ4
 g9V5dn+p6WsLh1gQ4QvWGgJqXPm8xU1mRux5WLnp5dHVwm+8bgGGy0O8QC+s9H63
 15pZmeaGcTd6zp3FxbnCWcWFbcW1jsYoscLrMyqjtgDmGIOBxQfia2oCyaN3Z37M
 vjRYtNFQbU6TbKZc0tKlh79geUykZL+ibtVuBiHkniZkAMOQXqUBkh3K1yfJ1j76
 w/GjIm+wFKn0KKij27mzebwTJNLbXdXWem5bzUg6lhC0X57Mcf3+L8VbNWPv220x
 drsI9uzw+YPE7neVbmXQ8hHyJKelG5cF0xgfykLT+6DnadiCpdrlrxwiMCP/POQ5
 XIPkjgyi1T3k/5Sv4kab7cZxbq1A14URJHASidT/JN2NCYlMV1A1CaeE1vhrZ5Z4
 BHgDRw==
 =mZQY
 -----END PGP SIGNATURE-----

Merge tag 'nf-23-10-12' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Florian Westphal says:

====================
netfilter updates for net

Patch 1, from Pablo Neira Ayuso, fixes a performance regression
(since 6.4) when a large pending set update has to be canceled towards
the end of the transaction.

Patch 2 from myself, silences an incorrect compiler warning reported
with a few (older) compiler toolchains.

Patch 3, from Kees Cook, adds __counted_by annotation to
nft_pipapo set backend type.  I took this for net instead of -next
given infra is already in place and no actual code change is made.

Patch 4, from Pablo Neira Ayso, disables timeout resets on
stateful element reset.  The rest should only affect internal object
state, e.g. reset a quota or counter, but not affect a pending timeout.

Patches 5 and 6 fix NULL dereferences in 'inner header' match,
control plane doesn't test for netlink attribute presence before
accessing them. Broken since feature was added in 6.2, fixes from
Xingyuan Mo.

Last patch, from myself, fixes a bogus rule match when skb has
a 0-length mac header, in this case we'd fetch data from network
header instead of canceling rule evaluation.  This is a day 0 bug,
present since nftables was merged in 3.13.

* tag 'nf-23-10-12' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nft_payload: fix wrong mac header matching
  nf_tables: fix NULL pointer dereference in nft_expr_inner_parse()
  nf_tables: fix NULL pointer dereference in nft_inner_init()
  netfilter: nf_tables: do not refresh timeout when resetting element
  netfilter: nf_tables: Annotate struct nft_pipapo_match with __counted_by
  netfilter: nfnetlink_log: silence bogus compiler warning
  netfilter: nf_tables: do not remove elements if set backend implements .abort
====================

Link: https://lore.kernel.org/r/20231012085724.15155-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-13 17:50:58 -07:00
Albert Huang c68681ae46 net/smc: fix smc clc failed issue when netdevice not in init_net
If the netdevice is within a container and communicates externally
through network technologies such as VxLAN, we won't be able to find
routing information in the init_net namespace. To address this issue,
we need to add a struct net parameter to the smc_ib_find_route function.
This allow us to locate the routing information within the corresponding
net namespace, ensuring the correct completion of the SMC CLC interaction.

Fixes: e5c4744cfb ("net/smc: add SMC-Rv2 connection establishment")
Signed-off-by: Albert Huang <huangjie.albert@bytedance.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Link: https://lore.kernel.org/r/20231011074851.95280-1-huangjie.albert@bytedance.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-13 16:52:02 -07:00
Paolo Abeni 419ce133ab tcp: allow again tcp_disconnect() when threads are waiting
As reported by Tom, .NET and applications build on top of it rely
on connect(AF_UNSPEC) to async cancel pending I/O operations on TCP
socket.

The blamed commit below caused a regression, as such cancellation
can now fail.

As suggested by Eric, this change addresses the problem explicitly
causing blocking I/O operation to terminate immediately (with an error)
when a concurrent disconnect() is executed.

Instead of tracking the number of threads blocked on a given socket,
track the number of disconnect() issued on such socket. If such counter
changes after a blocking operation releasing and re-acquiring the socket
lock, error out the current operation.

Fixes: 4faeee0cf8 ("tcp: deny tcp_disconnect() when threads are waiting")
Reported-by: Tom Deseyn <tdeseyn@redhat.com>
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1886305
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/f3b95e47e3dbed840960548aebaa8d954372db41.1697008693.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-13 16:49:32 -07:00
Linus Torvalds a1ef447dee Fixes for an overreaching WARN_ON, two error paths and a switch to
kernel_connect() which recently grown protection against someone using
 BPF to rewrite the address.  All but one marked for stable.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmUpYoQTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHziwvmCACK13dkAaupcHyteYPloBgtJLNixR3X
 6++nHCOXGtE7cK6n1snobFQgp/d5BqSKAeymyLqDjJOJVqG/5n8FR1gwcuY/Ogdj
 Aju2Mkt7R/R/V6kvmCbqGSwiOvxZP1gBFkKcluRNQkFNP3boKw4vmJGq29Rabbyr
 66NfZSETsR/H4JhAWvUyVmffrvxIx11THnvmrAnprGVKoK72HwOQFx0H4KLD2Hio
 aN/yiiR15yDS6hL8dfilt+QGO6o+ZWgvRl1GOzqjwISgfeUUYVknMJXrEf+20kHs
 kX3tWaLWVZsU1FyVFRO5HPYLAREjALXstFO4K1OHPERM7EipWNinIHoS
 =wyYP
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.6-rc6' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Fixes for an overreaching WARN_ON, two error paths and a switch to
  kernel_connect() which recently grown protection against someone using
  BPF to rewrite the address.

  All but one marked for stable"

* tag 'ceph-for-6.6-rc6' of https://github.com/ceph/ceph-client:
  ceph: fix type promotion bug on 32bit systems
  libceph: use kernel_connect()
  ceph: remove unnecessary IS_ERR() check in ceph_fname_to_usr()
  ceph: fix incorrect revoked caps assert in ceph_fill_file_size()
2023-10-13 11:27:31 -07:00
Kuniyuki Iwashima 8702cf12e6 tcp: Fix listen() warning with v4-mapped-v6 address.
syzbot reported a warning [0] introduced by commit c48ef9c4ae ("tcp: Fix
bind() regression for v4-mapped-v6 non-wildcard address.").

After the cited commit, a v4 socket's address matches the corresponding
v4-mapped-v6 tb2 in inet_bind2_bucket_match_addr(), not vice versa.

During X.X.X.X -> ::ffff:X.X.X.X order bind()s, the second bind() uses
bhash and conflicts properly without checking bhash2 so that we need not
check if a v4-mapped-v6 sk matches the corresponding v4 address tb2 in
inet_bind2_bucket_match_addr().  However, the repro shows that we need
to check that in a no-conflict case.

The repro bind()s two sockets to the 2-tuples using SO_REUSEPORT and calls
listen() for the first socket:

  from socket import *

  s1 = socket()
  s1.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1)
  s1.bind(('127.0.0.1', 0))

  s2 = socket(AF_INET6)
  s2.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1)
  s2.bind(('::ffff:127.0.0.1', s1.getsockname()[1]))

  s1.listen()

The second socket should belong to the first socket's tb2, but the second
bind() creates another tb2 bucket because inet_bind2_bucket_find() returns
NULL in inet_csk_get_port() as the v4-mapped-v6 sk does not match the
corresponding v4 address tb2.

  bhash2[] -> tb2(::ffff:X.X.X.X) -> tb2(X.X.X.X)

Then, listen() for the first socket calls inet_csk_get_port(), where the
v4 address matches the v4-mapped-v6 tb2 and WARN_ON() is triggered.

To avoid that, we need to check if v4-mapped-v6 sk address matches with
the corresponding v4 address tb2 in inet_bind2_bucket_match().

The same checks are needed in inet_bind2_bucket_addr_match() too, so we
can move all checks there and call it from inet_bind2_bucket_match().

Note that now tb->family is just an address family of tb->(v6_)?rcv_saddr
and not of sockets in the bucket.  This could be refactored later by
defining tb->rcv_saddr as tb->v6_rcv_saddr.s6_addr32[3] and prepending
::ffff: when creating v4 tb2.

[0]:
WARNING: CPU: 0 PID: 5049 at net/ipv4/inet_connection_sock.c:587 inet_csk_get_port+0xf96/0x2350 net/ipv4/inet_connection_sock.c:587
Modules linked in:
CPU: 0 PID: 5049 Comm: syz-executor288 Not tainted 6.6.0-rc2-syzkaller-00018-g2cf0f7156238 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/04/2023
RIP: 0010:inet_csk_get_port+0xf96/0x2350 net/ipv4/inet_connection_sock.c:587
Code: 7c 24 08 e8 4c b6 8a 01 31 d2 be 88 01 00 00 48 c7 c7 e0 94 ae 8b e8 59 2e a3 f8 2e 2e 2e 31 c0 e9 04 fe ff ff e8 ca 88 d0 f8 <0f> 0b e9 0f f9 ff ff e8 be 88 d0 f8 49 8d 7e 48 e8 65 ca 5a 00 31
RSP: 0018:ffffc90003abfbf0 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff888026429100 RCX: 0000000000000000
RDX: ffff88807edcbb80 RSI: ffffffff88b73d66 RDI: ffff888026c49f38
RBP: ffff888026c49f30 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff9260f200
R13: ffff888026c49880 R14: 0000000000000000 R15: ffff888026429100
FS:  00005555557d5380(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000045ad50 CR3: 0000000025754000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 inet_csk_listen_start+0x155/0x360 net/ipv4/inet_connection_sock.c:1256
 __inet_listen_sk+0x1b8/0x5c0 net/ipv4/af_inet.c:217
 inet_listen+0x93/0xd0 net/ipv4/af_inet.c:239
 __sys_listen+0x194/0x270 net/socket.c:1866
 __do_sys_listen net/socket.c:1875 [inline]
 __se_sys_listen net/socket.c:1873 [inline]
 __x64_sys_listen+0x53/0x80 net/socket.c:1873
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f3a5bce3af9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 17 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc1a1c79e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000032
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a5bce3af9
RDX: 00007f3a5bce3af9 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00007f3a5bd565f0 R08: 0000000000000006 R09: 0000000000000006
R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000000001
R13: 431bde82d7b634db R14: 0000000000000001 R15: 0000000000000001
 </TASK>

Fixes: c48ef9c4ae ("tcp: Fix bind() regression for v4-mapped-v6 non-wildcard address.")
Reported-by: syzbot+71e724675ba3958edb31@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=71e724675ba3958edb31
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231010013814.70571-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-13 10:01:49 -07:00
Eric Dumazet de5724ca38 xfrm: fix a data-race in xfrm_lookup_with_ifid()
syzbot complains about a race in xfrm_lookup_with_ifid() [1]

When preparing commit 0a9e5794b2 ("xfrm: annotate data-race
around use_time") I thought xfrm_lookup_with_ifid() was modifying
a still private structure.

[1]
BUG: KCSAN: data-race in xfrm_lookup_with_ifid / xfrm_lookup_with_ifid

write to 0xffff88813ea41108 of 8 bytes by task 8150 on cpu 1:
xfrm_lookup_with_ifid+0xce7/0x12d0 net/xfrm/xfrm_policy.c:3218
xfrm_lookup net/xfrm/xfrm_policy.c:3270 [inline]
xfrm_lookup_route+0x3b/0x100 net/xfrm/xfrm_policy.c:3281
ip6_dst_lookup_flow+0x98/0xc0 net/ipv6/ip6_output.c:1246
send6+0x241/0x3c0 drivers/net/wireguard/socket.c:139
wg_socket_send_skb_to_peer+0xbd/0x130 drivers/net/wireguard/socket.c:178
wg_socket_send_buffer_to_peer+0xd6/0x100 drivers/net/wireguard/socket.c:200
wg_packet_send_handshake_initiation drivers/net/wireguard/send.c:40 [inline]
wg_packet_handshake_send_worker+0x10c/0x150 drivers/net/wireguard/send.c:51
process_one_work kernel/workqueue.c:2630 [inline]
process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2703
worker_thread+0x525/0x730 kernel/workqueue.c:2784
kthread+0x1d7/0x210 kernel/kthread.c:388
ret_from_fork+0x48/0x60 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304

write to 0xffff88813ea41108 of 8 bytes by task 15867 on cpu 0:
xfrm_lookup_with_ifid+0xce7/0x12d0 net/xfrm/xfrm_policy.c:3218
xfrm_lookup net/xfrm/xfrm_policy.c:3270 [inline]
xfrm_lookup_route+0x3b/0x100 net/xfrm/xfrm_policy.c:3281
ip6_dst_lookup_flow+0x98/0xc0 net/ipv6/ip6_output.c:1246
send6+0x241/0x3c0 drivers/net/wireguard/socket.c:139
wg_socket_send_skb_to_peer+0xbd/0x130 drivers/net/wireguard/socket.c:178
wg_socket_send_buffer_to_peer+0xd6/0x100 drivers/net/wireguard/socket.c:200
wg_packet_send_handshake_initiation drivers/net/wireguard/send.c:40 [inline]
wg_packet_handshake_send_worker+0x10c/0x150 drivers/net/wireguard/send.c:51
process_one_work kernel/workqueue.c:2630 [inline]
process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2703
worker_thread+0x525/0x730 kernel/workqueue.c:2784
kthread+0x1d7/0x210 kernel/kthread.c:388
ret_from_fork+0x48/0x60 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304

value changed: 0x00000000651cd9d1 -> 0x00000000651cd9d2

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 15867 Comm: kworker/u4:58 Not tainted 6.6.0-rc4-syzkaller-00016-g5e62ed3b1c8a #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/06/2023
Workqueue: wg-kex-wg2 wg_packet_handshake_send_worker

Fixes: 0a9e5794b2 ("xfrm: annotate data-race around use_time")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-10-13 07:57:27 +02:00
Florian Westphal d351c1ea2d netfilter: nft_payload: fix wrong mac header matching
mcast packets get looped back to the local machine.
Such packets have a 0-length mac header, we should treat
this like "mac header not set" and abort rule evaluation.

As-is, we just copy data from the network header instead.

Fixes: 96518518cc ("netfilter: add nftables")
Reported-by: Blažej Krajňák <krajnak@levonet.sk>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-12 10:28:45 +02:00
Xingyuan Mo 505ce0630a nf_tables: fix NULL pointer dereference in nft_expr_inner_parse()
We should check whether the NFTA_EXPR_NAME netlink attribute is present
before accessing it, otherwise a null pointer deference error will occur.

Call Trace:
 <TASK>
 dump_stack_lvl+0x4f/0x90
 print_report+0x3f0/0x620
 kasan_report+0xcd/0x110
 __asan_load2+0x7d/0xa0
 nla_strcmp+0x2f/0x90
 __nft_expr_type_get+0x41/0xb0
 nft_expr_inner_parse+0xe3/0x200
 nft_inner_init+0x1be/0x2e0
 nf_tables_newrule+0x813/0x1230
 nfnetlink_rcv_batch+0xec3/0x1170
 nfnetlink_rcv+0x1e4/0x220
 netlink_unicast+0x34e/0x4b0
 netlink_sendmsg+0x45c/0x7e0
 __sys_sendto+0x355/0x370
 __x64_sys_sendto+0x84/0xa0
 do_syscall_64+0x3f/0x90
 entry_SYSCALL_64_after_hwframe+0x6e/0xd8

Fixes: 3a07327d10 ("netfilter: nft_inner: support for inner tunnel header matching")
Signed-off-by: Xingyuan Mo <hdthky0@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-12 10:28:45 +02:00
Xingyuan Mo 52177bbf19 nf_tables: fix NULL pointer dereference in nft_inner_init()
We should check whether the NFTA_INNER_NUM netlink attribute is present
before accessing it, otherwise a null pointer deference error will occur.

Call Trace:
 dump_stack_lvl+0x4f/0x90
 print_report+0x3f0/0x620
 kasan_report+0xcd/0x110
 __asan_load4+0x84/0xa0
 nft_inner_init+0x128/0x2e0
 nf_tables_newrule+0x813/0x1230
 nfnetlink_rcv_batch+0xec3/0x1170
 nfnetlink_rcv+0x1e4/0x220
 netlink_unicast+0x34e/0x4b0
 netlink_sendmsg+0x45c/0x7e0
 __sys_sendto+0x355/0x370
 __x64_sys_sendto+0x84/0xa0
 do_syscall_64+0x3f/0x90
 entry_SYSCALL_64_after_hwframe+0x6e/0xd8

Fixes: 3a07327d10 ("netfilter: nft_inner: support for inner tunnel header matching")
Signed-off-by: Xingyuan Mo <hdthky0@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-12 10:28:45 +02:00
Pablo Neira Ayuso 4c90bba60c netfilter: nf_tables: do not refresh timeout when resetting element
The dump and reset command should not refresh the timeout, this command
is intended to allow users to list existing stateful objects and reset
them, element expiration should be refresh via transaction instead with
a specific command to achieve this, otherwise this is entering combo
semantics that will be hard to be undone later (eg. a user asking to
retrieve counters but _not_ requiring to refresh expiration).

Fixes: 079cd63321 ("netfilter: nf_tables: Introduce NFT_MSG_GETSETELEM_RESET")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-12 10:28:45 +02:00
Kees Cook d51c42cdef netfilter: nf_tables: Annotate struct nft_pipapo_match with __counted_by
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for
array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).

As found with Coccinelle[1], add __counted_by for struct nft_pipapo_match.

Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Florian Westphal <fw@strlen.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: netfilter-devel@vger.kernel.org
Cc: coreteam@netfilter.org
Cc: netdev@vger.kernel.org
Link: https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci [1]
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-12 10:28:45 +02:00
Florian Westphal 2e1d175410 netfilter: nfnetlink_log: silence bogus compiler warning
net/netfilter/nfnetlink_log.c:800:18: warning: variable 'ctinfo' is uninitialized

The warning is bogus, the variable is only used if ct is non-NULL and
always initialised in that case.  Init to 0 too to silence this.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202309100514.ndBFebXN-lkp@intel.com/
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-12 10:28:45 +02:00
Pablo Neira Ayuso ebd032fa88 netfilter: nf_tables: do not remove elements if set backend implements .abort
pipapo set backend maintains two copies of the datastructure, removing
the elements from the copy that is going to be discarded slows down
the abort path significantly, from several minutes to few seconds after
this patch.

Fixes: 212ed75dc5 ("netfilter: nf_tables: integrate pipapo into commit protocol")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-12 10:28:45 +02:00
Jeremy Cline 354a6e707e nfc: nci: assert requested protocol is valid
The protocol is used in a bit mask to determine if the protocol is
supported. Assert the provided protocol is less than the maximum
defined so it doesn't potentially perform a shift-out-of-bounds and
provide a clearer error for undefined protocols vs unsupported ones.

Fixes: 6a2968aaf5 ("NFC: basic NCI protocol implementation")
Reported-and-tested-by: syzbot+0839b78e119aae1fec78@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=0839b78e119aae1fec78
Signed-off-by: Jeremy Cline <jeremy@jcline.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231009200054.82557-1-jeremy@jcline.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-12 09:32:10 +02:00