Commit graph

67456 commits

Author SHA1 Message Date
Kuniyuki Iwashima 33d60fbd21 sock: Use sock_owned_by_user_nocheck() instead of sk_lock.owned.
This patch moves sock_release_ownership() down in include/net/sock.h and
replaces some sk_lock.owned tests with sock_owned_by_user_nocheck().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Link: https://lore.kernel.org/r/20211208062158.54132-1-kuniyu@amazon.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-10 19:43:00 -08:00
Jakub Kicinski be3158290d Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Andrii Nakryiko says:

====================
bpf-next 2021-12-10 v2

We've added 115 non-merge commits during the last 26 day(s) which contain
a total of 182 files changed, 5747 insertions(+), 2564 deletions(-).

The main changes are:

1) Various samples fixes, from Alexander Lobakin.

2) BPF CO-RE support in kernel and light skeleton, from Alexei Starovoitov.

3) A batch of new unified APIs for libbpf, logging improvements, version
   querying, etc. Also a batch of old deprecations for old APIs and various
   bug fixes, in preparation for libbpf 1.0, from Andrii Nakryiko.

4) BPF documentation reorganization and improvements, from Christoph Hellwig
   and Dave Tucker.

5) Support for declarative initialization of BPF_MAP_TYPE_PROG_ARRAY in
   libbpf, from Hengqi Chen.

6) Verifier log fixes, from Hou Tao.

7) Runtime-bounded loops support with bpf_loop() helper, from Joanne Koong.

8) Extend branch record capturing to all platforms that support it,
   from Kajol Jain.

9) Light skeleton codegen improvements, from Kumar Kartikeya Dwivedi.

10) bpftool doc-generating script improvements, from Quentin Monnet.

11) Two libbpf v0.6 bug fixes, from Shuyi Cheng and Vincent Minet.

12) Deprecation warning fix for perf/bpf_counter, from Song Liu.

13) MAX_TAIL_CALL_CNT unification and MIPS build fix for libbpf,
    from Tiezhu Yang.

14) BTF_KING_TYPE_TAG follow-up fixes, from Yonghong Song.

15) Selftests fixes and improvements, from Ilya Leoshkevich, Jean-Philippe
    Brucker, Jiri Olsa, Maxim Mikityanskiy, Tirthendu Sarkar, Yucong Sun,
    and others.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (115 commits)
  libbpf: Add "bool skipped" to struct bpf_map
  libbpf: Fix typo in btf__dedup@LIBBPF_0.0.2 definition
  bpftool: Switch bpf_object__load_xattr() to bpf_object__load()
  selftests/bpf: Remove the only use of deprecated bpf_object__load_xattr()
  selftests/bpf: Add test for libbpf's custom log_buf behavior
  selftests/bpf: Replace all uses of bpf_load_btf() with bpf_btf_load()
  libbpf: Deprecate bpf_object__load_xattr()
  libbpf: Add per-program log buffer setter and getter
  libbpf: Preserve kernel error code and remove kprobe prog type guessing
  libbpf: Improve logging around BPF program loading
  libbpf: Allow passing user log setting through bpf_object_open_opts
  libbpf: Allow passing preallocated log_buf when loading BTF into kernel
  libbpf: Add OPTS-based bpf_btf_load() API
  libbpf: Fix bpf_prog_load() log_buf logic for log_level 0
  samples/bpf: Remove unneeded variable
  bpf: Remove redundant assignment to pointer t
  selftests/bpf: Fix a compilation warning
  perf/bpf_counter: Use bpf_map_create instead of bpf_create_map
  samples: bpf: Fix 'unknown warning group' build warning on Clang
  samples: bpf: Fix xdp_sample_user.o linking with Clang
  ...
====================

Link: https://lore.kernel.org/r/20211210234746.2100561-1-andrii@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-10 15:56:13 -08:00
Eric Dumazet e1b539bd73 xfrm: add net device refcount tracker to struct xfrm_state_offload
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Link: https://lore.kernel.org/r/20211209154451.4184050-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-10 08:16:33 -08:00
Eric Dumazet 285ec2fef4 l2tp: add netns refcount tracker to l2tp_dfs_seq_data
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-10 06:38:27 -08:00
Eric Dumazet ffa84b5ffb net: add netns refcount tracker to struct sock
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-10 06:38:26 -08:00
Eric Dumazet 9ba74e6c9e net: add networking namespace refcount tracker
We have 100+ syzbot reports about netns being dismantled too soon,
still unresolved as of today.

We think a missing get_net() or an extra put_net() is the root cause.

In order to find the bug(s), and be able to spot future ones,
this patch adds CONFIG_NET_NS_REFCNT_TRACKER and new helpers
to precisely pair all put_net() with corresponding get_net().

To use these helpers, each data structure owning a refcount
should also use a "netns_tracker" to pair the get and put.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-10 06:38:26 -08:00
Jean Sacren 9745177c94 net: x25: drop harmless check of !more
'more' is checked first.  When !more is checked immediately after that,
it is always true.  We should drop this check.

Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Acked-by: Martin Schiller <ms@dev.tdt.de>
Link: https://lore.kernel.org/r/20211208024732.142541-5-sakiwit@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-09 18:35:11 -08:00
Jakub Kicinski 3150a73366 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-09 13:23:02 -08:00
Eric Dumazet 4177e49605 xfrm: use net device refcount tracker helpers
xfrm4_fill_dst() and xfrm6_fill_dst() build dst,
getting a device reference that will likely be released
by standard dst_release() code.

We have to track these references or risk a warning if
CONFIG_NET_DEV_REFCNT_TRACKER=y

Note to XFRM maintainers :

Error path in xfrm6_fill_dst() releases the reference,
but does not clear xdst->u.dst.dev, so I wonder
if this could lead to double dev_put() in some cases,
where a dst_release() _is_ called by the callers in their
error path.

This extra dev_put() was added in commit 84c4a9dfbf ("xfrm6:
release dev before returning error")

Fixes: 9038c32000 ("net: dst: add net device refcount tracking to dst_entry")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <amwang@redhat.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Link: https://lore.kernel.org/r/20211207193203.2706158-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-09 11:51:45 -08:00
Russell King (Oracle) 0a9f0794d9 net: dsa: mark DSA phylink as legacy_pre_march2020
The majority of DSA drivers do not make use of the PCS support, and
thus operate in legacy mode. In order to preserve this behaviour in
future, we need to set the legacy_pre_march2020 flag so phylink knows
this may require the legacy calls.

There are some DSA drivers that do make use of PCS support, and these
will continue operating as before - legacy_pre_march2020 will not
prevent split-PCS support enabling the newer phylink behaviour.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-09 11:21:03 -08:00
Eric Dumazet 61c2402665 net/sched: fq_pie: prevent dismantle issue
For some reason, fq_pie_destroy() did not copy
working code from pie_destroy() and other qdiscs,
thus causing elusive bug.

Before calling del_timer_sync(&q->adapt_timer),
we need to ensure timer will not rearm itself.

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu:    0-....: (4416 ticks this GP) idle=60d/1/0x4000000000000000 softirq=10433/10434 fqs=2579
        (t=10501 jiffies g=13085 q=3989)
NMI backtrace for cpu 0
CPU: 0 PID: 13 Comm: ksoftirqd/0 Not tainted 5.16.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 nmi_cpu_backtrace.cold+0x47/0x144 lib/nmi_backtrace.c:111
 nmi_trigger_cpumask_backtrace+0x1b3/0x230 lib/nmi_backtrace.c:62
 trigger_single_cpu_backtrace include/linux/nmi.h:164 [inline]
 rcu_dump_cpu_stacks+0x25e/0x3f0 kernel/rcu/tree_stall.h:343
 print_cpu_stall kernel/rcu/tree_stall.h:627 [inline]
 check_cpu_stall kernel/rcu/tree_stall.h:711 [inline]
 rcu_pending kernel/rcu/tree.c:3878 [inline]
 rcu_sched_clock_irq.cold+0x9d/0x746 kernel/rcu/tree.c:2597
 update_process_times+0x16d/0x200 kernel/time/timer.c:1785
 tick_sched_handle+0x9b/0x180 kernel/time/tick-sched.c:226
 tick_sched_timer+0x1b0/0x2d0 kernel/time/tick-sched.c:1428
 __run_hrtimer kernel/time/hrtimer.c:1685 [inline]
 __hrtimer_run_queues+0x1c0/0xe50 kernel/time/hrtimer.c:1749
 hrtimer_interrupt+0x31c/0x790 kernel/time/hrtimer.c:1811
 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086 [inline]
 __sysvec_apic_timer_interrupt+0x146/0x530 arch/x86/kernel/apic/apic.c:1103
 sysvec_apic_timer_interrupt+0x8e/0xc0 arch/x86/kernel/apic/apic.c:1097
 </IRQ>
 <TASK>
 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638
RIP: 0010:write_comp_data kernel/kcov.c:221 [inline]
RIP: 0010:__sanitizer_cov_trace_const_cmp1+0x1d/0x80 kernel/kcov.c:273
Code: 54 c8 20 48 89 10 c3 66 0f 1f 44 00 00 53 41 89 fb 41 89 f1 bf 03 00 00 00 65 48 8b 0c 25 40 70 02 00 48 89 ce 4c 8b 54 24 08 <e8> 4e f7 ff ff 84 c0 74 51 48 8b 81 88 15 00 00 44 8b 81 84 15 00
RSP: 0018:ffffc90000d27b28 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff888064bf1bf0 RCX: ffff888011928000
RDX: ffff888011928000 RSI: ffff888011928000 RDI: 0000000000000003
RBP: ffff888064bf1c28 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff875d8295 R11: 0000000000000000 R12: 0000000000000000
R13: ffff8880783dd300 R14: 0000000000000000 R15: 0000000000000000
 pie_calculate_probability+0x405/0x7c0 net/sched/sch_pie.c:418
 fq_pie_timer+0x170/0x2a0 net/sched/sch_fq_pie.c:383
 call_timer_fn+0x1a5/0x6b0 kernel/time/timer.c:1421
 expire_timers kernel/time/timer.c:1466 [inline]
 __run_timers.part.0+0x675/0xa20 kernel/time/timer.c:1734
 __run_timers kernel/time/timer.c:1715 [inline]
 run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1747
 __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
 run_ksoftirqd kernel/softirq.c:921 [inline]
 run_ksoftirqd+0x2d/0x60 kernel/softirq.c:913
 smpboot_thread_fn+0x645/0x9c0 kernel/smpboot.c:164
 kthread+0x405/0x4f0 kernel/kthread.c:327
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
 </TASK>

Fixes: ec97ecf1eb ("net: sched: add Flow Queue PIE packet scheduler")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Cc: Sachin D. Patil <sdp.sachin@gmail.com>
Cc: V. Saicharan <vsaicharan1998@gmail.com>
Cc: Mohit Bhasi <mohitbhasi1998@gmail.com>
Cc: Leslie Monis <lesliemonis@gmail.com>
Cc: Gautam Ramakrishnan <gautamramk@gmail.com>
Link: https://lore.kernel.org/r/20211209084937.3500020-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-09 08:01:00 -08:00
Andrea Mayer ae68d93354 seg6: fix the iif in the IPv6 socket control block
When an IPv4 packet is received, the ip_rcv_core(...) sets the receiving
interface index into the IPv4 socket control block (v5.16-rc4,
net/ipv4/ip_input.c line 510):

    IPCB(skb)->iif = skb->skb_iif;

If that IPv4 packet is meant to be encapsulated in an outer IPv6+SRH
header, the seg6_do_srh_encap(...) performs the required encapsulation.
In this case, the seg6_do_srh_encap function clears the IPv6 socket control
block (v5.16-rc4 net/ipv6/seg6_iptunnel.c line 163):

    memset(IP6CB(skb), 0, sizeof(*IP6CB(skb)));

The memset(...) was introduced in commit ef489749aa ("ipv6: sr: clear
IP6CB(skb) on SRH ip4ip6 encapsulation") a long time ago (2019-01-29).

Since the IPv6 socket control block and the IPv4 socket control block share
the same memory area (skb->cb), the receiving interface index info is lost
(IP6CB(skb)->iif is set to zero).

As a side effect, that condition triggers a NULL pointer dereference if
commit 0857d6f8c7 ("ipv6: When forwarding count rx stats on the orig
netdev") is applied.

To fix that issue, we set the IP6CB(skb)->iif with the index of the
receiving interface once again.

Fixes: ef489749aa ("ipv6: sr: clear IP6CB(skb) on SRH ip4ip6 encapsulation")
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20211208195409.12169-1-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-09 07:55:42 -08:00
Krzysztof Kozlowski 4cd8371a23 nfc: fix potential NULL pointer deref in nfc_genl_dump_ses_done
The done() netlink callback nfc_genl_dump_ses_done() should check if
received argument is non-NULL, because its allocation could fail earlier
in dumpit() (nfc_genl_dump_ses()).

Fixes: ac22ac466a ("NFC: Add a GET_SE netlink API")
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Link: https://lore.kernel.org/r/20211209081307.57337-1-krzysztof.kozlowski@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-09 07:50:32 -08:00
Tadeusz Struk fd79a0cbf0 nfc: fix segfault in nfc_genl_dump_devices_done
When kmalloc in nfc_genl_dump_devices() fails then
nfc_genl_dump_devices_done() segfaults as below

KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
CPU: 0 PID: 25 Comm: kworker/0:1 Not tainted 5.16.0-rc4-01180-g2a987e65025e-dirty #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-6.fc35 04/01/2014
Workqueue: events netlink_sock_destruct_work
RIP: 0010:klist_iter_exit+0x26/0x80
Call Trace:
<TASK>
class_dev_iter_exit+0x15/0x20
nfc_genl_dump_devices_done+0x3b/0x50
genl_lock_done+0x84/0xd0
netlink_sock_destruct+0x8f/0x270
__sk_destruct+0x64/0x3b0
sk_destruct+0xa8/0xd0
__sk_free+0x2e8/0x3d0
sk_free+0x51/0x90
netlink_sock_destruct_work+0x1c/0x20
process_one_work+0x411/0x710
worker_thread+0x6fd/0xa80

Link: https://syzkaller.appspot.com/bug?id=fc0fa5a53db9edd261d56e74325419faf18bd0df
Reported-by: syzbot+f9f76f4a0766420b4a02@syzkaller.appspotmail.com
Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Link: https://lore.kernel.org/r/20211208182742.340542-1-tadeusz.struk@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-09 07:50:23 -08:00
Jianguo Wu 158390e456 udp: using datalen to cap max gso segments
The max number of UDP gso segments is intended to cap to UDP_MAX_SEGMENTS,
this is checked in udp_send_skb():

    if (skb->len > cork->gso_size * UDP_MAX_SEGMENTS) {
        kfree_skb(skb);
        return -EINVAL;
    }

skb->len contains network and transport header len here, we should use
only data len instead.

Fixes: bec1f6f697 ("udp: generate gso with UDP_SEGMENT")
Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/900742e5-81fb-30dc-6e0b-375c6cdd7982@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-09 07:48:07 -08:00
Antoine Tenart 5f1c802ca6 net-sysfs: warn if new queue objects are being created during device unregistration
Calling netdev_queue_update_kobjects is allowed during device
unregistration since commit 5c56580b74 ("net: Adjust TX queue kobjects
if number of queues changes during unregister"). But this is solely to
allow queue unregistrations. Any path attempting to add new queues after
a device started its unregistration should be fixed.

This patch adds a warning to detect such illegal use.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-08 18:36:32 -08:00
Antoine Tenart d7dac08341 net-sysfs: update the queue counts in the unregistration path
When updating Rx and Tx queue kobjects, the queue count should always be
updated to match the queue kobjects count. This was not done in the net
device unregistration path, fix it. Tracking all queue count updates
will allow in a following up patch to detect illegal updates.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-08 18:36:32 -08:00
Eric Dumazet e195e9b5de net, neigh: clear whole pneigh_entry at alloc time
Commit 2c611ad97a ("net, neigh: Extend neigh->flags to 32 bit
to allow for extensions") enables a new KMSAM warning [1]

I think the bug is actually older, because the following intruction
only occurred if ndm->ndm_flags had NTF_PROXY set.

	pn->flags = ndm->ndm_flags;

Let's clear all pneigh_entry fields at alloc time.

[1]
BUG: KMSAN: uninit-value in pneigh_fill_info+0x986/0xb30 net/core/neighbour.c:2593
 pneigh_fill_info+0x986/0xb30 net/core/neighbour.c:2593
 pneigh_dump_table net/core/neighbour.c:2715 [inline]
 neigh_dump_info+0x1e3f/0x2c60 net/core/neighbour.c:2832
 netlink_dump+0xaca/0x16a0 net/netlink/af_netlink.c:2265
 __netlink_dump_start+0xd1c/0xee0 net/netlink/af_netlink.c:2370
 netlink_dump_start include/linux/netlink.h:254 [inline]
 rtnetlink_rcv_msg+0x181b/0x18c0 net/core/rtnetlink.c:5534
 netlink_rcv_skb+0x447/0x800 net/netlink/af_netlink.c:2491
 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:5589
 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
 netlink_unicast+0x1095/0x1360 net/netlink/af_netlink.c:1345
 netlink_sendmsg+0x16f3/0x1870 net/netlink/af_netlink.c:1916
 sock_sendmsg_nosec net/socket.c:704 [inline]
 sock_sendmsg net/socket.c:724 [inline]
 sock_write_iter+0x594/0x690 net/socket.c:1057
 call_write_iter include/linux/fs.h:2162 [inline]
 new_sync_write fs/read_write.c:503 [inline]
 vfs_write+0x1318/0x2030 fs/read_write.c:590
 ksys_write+0x28c/0x520 fs/read_write.c:643
 __do_sys_write fs/read_write.c:655 [inline]
 __se_sys_write fs/read_write.c:652 [inline]
 __x64_sys_write+0xdb/0x120 fs/read_write.c:652
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Uninit was created at:
 slab_post_alloc_hook mm/slab.h:524 [inline]
 slab_alloc_node mm/slub.c:3251 [inline]
 slab_alloc mm/slub.c:3259 [inline]
 __kmalloc+0xc3c/0x12d0 mm/slub.c:4437
 kmalloc include/linux/slab.h:595 [inline]
 pneigh_lookup+0x60f/0xd70 net/core/neighbour.c:766
 arp_req_set_public net/ipv4/arp.c:1016 [inline]
 arp_req_set+0x430/0x10a0 net/ipv4/arp.c:1032
 arp_ioctl+0x8d4/0xb60 net/ipv4/arp.c:1232
 inet_ioctl+0x4ef/0x820 net/ipv4/af_inet.c:947
 sock_do_ioctl net/socket.c:1118 [inline]
 sock_ioctl+0xa3f/0x13e0 net/socket.c:1235
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:874 [inline]
 __se_sys_ioctl+0x2df/0x4a0 fs/ioctl.c:860
 __x64_sys_ioctl+0xd8/0x110 fs/ioctl.c:860
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
 entry_SYSCALL_64_after_hwframe+0x44/0xae

CPU: 1 PID: 20001 Comm: syz-executor.0 Not tainted 5.16.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: 62dd93181a ("[IPV6] NDISC: Set per-entry is_router flag in Proxy NA.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Roopa Prabhu <roopa@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20211206165329.1049835-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-08 17:41:44 -08:00
Jakub Kicinski fd31cb0c6a Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

1) Fix bogus compilter warning in nfnetlink_queue, from Florian Westphal.

2) Don't run conntrack on vrf with !dflt qdisc, from Nicolas Dichtel.

3) Fix nft_pipapo bucket load in AVX2 lookup routine for six 8-bit
   groups, from Stefano Brivio.

4) Break rule evaluation on malformed TCP options.

5) Use socat instead of nc in selftests/netfilter/nft_zones_many.sh,
   also from Florian

6) Fix KCSAN data-race in conntrack timeout updates, from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
  netfilter: conntrack: annotate data-races around ct->timeout
  selftests: netfilter: switch zone stress to socat
  netfilter: nft_exthdr: break evaluation if setting TCP option fails
  selftests: netfilter: Add correctness test for mac,net set type
  nft_set_pipapo: Fix bucket load in AVX2 lookup routine for six 8-bit groups
  vrf: don't run conntrack on vrf with !dflt qdisc
  netfilter: nfnetlink_queue: silence bogus compiler warning
====================

Link: https://lore.kernel.org/r/20211209000847.102598-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-08 17:02:35 -08:00
Jakub Kicinski 6efcdadc15 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
bpf 2021-12-08

We've added 12 non-merge commits during the last 22 day(s) which contain
a total of 29 files changed, 659 insertions(+), 80 deletions(-).

The main changes are:

1) Fix an off-by-two error in packet range markings and also add a batch of
   new tests for coverage of these corner cases, from Maxim Mikityanskiy.

2) Fix a compilation issue on MIPS JIT for R10000 CPUs, from Johan Almbladh.

3) Fix two functional regressions and a build warning related to BTF kfunc
   for modules, from Kumar Kartikeya Dwivedi.

4) Fix outdated code and docs regarding BPF's migrate_disable() use on non-
   PREEMPT_RT kernels, from Sebastian Andrzej Siewior.

5) Add missing includes in order to be able to detangle cgroup vs bpf header
   dependencies, from Jakub Kicinski.

6) Fix regression in BPF sockmap tests caused by missing detachment of progs
   from sockets when they are removed from the map, from John Fastabend.

7) Fix a missing "no previous prototype" warning in x86 JIT caused by BPF
   dispatcher, from Björn Töpel.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Add selftests to cover packet access corner cases
  bpf: Fix the off-by-two error in range markings
  treewide: Add missing includes masked by cgroup -> bpf dependency
  tools/resolve_btfids: Skip unresolved symbol warning for empty BTF sets
  bpf: Fix bpf_check_mod_kfunc_call for built-in modules
  bpf: Make CONFIG_DEBUG_INFO_BTF depend upon CONFIG_BPF_SYSCALL
  mips, bpf: Fix reference to non-existing Kconfig symbol
  bpf: Make sure bpf_disable_instrumentation() is safe vs preemption.
  Documentation/locking/locktypes: Update migrate_disable() bits.
  bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap
  bpf, sockmap: Attach map progs to psock early for feature probes
  bpf, x86: Fix "no previous prototype" warning
====================

Link: https://lore.kernel.org/r/20211208155125.11826-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-08 16:06:44 -08:00
Vladimir Oltean 857fdd74fb net: dsa: eliminate dsa_switch_ops :: port_bridge_tx_fwd_{,un}offload
We don't really need new switch API for these, and with new switches
which intend to add support for this feature, it will become cumbersome
to maintain.

The change consists in restructuring the two drivers that implement this
offload (sja1105 and mv88e6xxx) such that the offload is enabled and
disabled from the ->port_bridge_{join,leave} methods instead of the old
->port_bridge_tx_fwd_{,un}offload.

The only non-trivial change is that mv88e6xxx_map_virtual_bridge_to_pvt()
has been moved to avoid a forward declaration, and the
mv88e6xxx_reg_lock() calls from inside it have been removed, since
locking is now done from mv88e6xxx_port_bridge_{join,leave}.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-08 14:31:16 -08:00
Vladimir Oltean b079922ba2 net: dsa: add a "tx_fwd_offload" argument to ->port_bridge_join
This is a preparation patch for the removal of the DSA switch methods
->port_bridge_tx_fwd_offload() and ->port_bridge_tx_fwd_unoffload().
The plan is for the switch to report whether it offloads TX forwarding
directly as a response to the ->port_bridge_join() method.

This change deals with the noisy portion of converting all existing
function prototypes to take this new boolean pointer argument.
The bool is placed in the cross-chip notifier structure for bridge join,
and a reference to it is provided to drivers. In the next change, DSA
will then actually look at this value instead of calling
->port_bridge_tx_fwd_offload().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-08 14:31:16 -08:00
Vladimir Oltean d3eed0e57d net: dsa: keep the bridge_dev and bridge_num as part of the same structure
The main desire behind this is to provide coherent bridge information to
the fast path without locking.

For example, right now we set dp->bridge_dev and dp->bridge_num from
separate code paths, it is theoretically possible for a packet
transmission to read these two port properties consecutively and find a
bridge number which does not correspond with the bridge device.

Another desire is to start passing more complex bridge information to
dsa_switch_ops functions. For example, with FDB isolation, it is
expected that drivers will need to be passed the bridge which requested
an FDB/MDB entry to be offloaded, and along with that bridge_dev, the
associated bridge_num should be passed too, in case the driver might
want to implement an isolation scheme based on that number.

We already pass the {bridge_dev, bridge_num} pair to the TX forwarding
offload switch API, however we'd like to remove that and squash it into
the basic bridge join/leave API. So that means we need to pass this
pair to the bridge join/leave API.

During dsa_port_bridge_leave, first we unset dp->bridge_dev, then we
call the driver's .port_bridge_leave with what used to be our
dp->bridge_dev, but provided as an argument.

When bridge_dev and bridge_num get folded into a single structure, we
need to preserve this behavior in dsa_port_bridge_leave: we need a copy
of what used to be in dp->bridge.

Switch drivers check bridge membership by comparing dp->bridge_dev with
the provided bridge_dev, but now, if we provide the struct dsa_bridge as
a pointer, they cannot keep comparing dp->bridge to the provided
pointer, since this only points to an on-stack copy. To make this
obvious and prevent driver writers from forgetting and doing stupid
things, in this new API, the struct dsa_bridge is provided as a full
structure (not very large, contains an int and a pointer) instead of a
pointer. An explicit comparison function needs to be used to determine
bridge membership: dsa_port_offloads_bridge().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-08 14:31:16 -08:00
Vladimir Oltean 6a43cba303 net: dsa: export bridging offload helpers to drivers
Move the static inline helpers from net/dsa/dsa_priv.h to
include/net/dsa.h, so that drivers can call functions such as
dsa_port_offloads_bridge_dev(), which will be necessary after the
transition to a more complex bridge structure.

More functions than are needed right now are being moved, but this is
done for uniformity.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-08 14:31:16 -08:00
Vladimir Oltean 936db8a2db net: dsa: rename dsa_port_offloads_bridge to dsa_port_offloads_bridge_dev
Currently the majority of dsa_port_bridge_dev_get() calls in drivers is
just to check whether a port is under the bridge device provided as
argument by the DSA API.

We'd like to change that DSA API so that a more complex structure is
provided as argument. To keep things more generic, and considering that
the new complex structure will be provided by value and not by
reference, direct comparisons between dp->bridge and the provided bridge
will be broken. The generic way to do the checking would simply be to
do something like dsa_port_offloads_bridge(dp, &bridge).

But there's a problem, we already have a function named that way, which
actually takes a bridge_dev net_device as argument. Rename it so that we
can use dsa_port_offloads_bridge for something else.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-08 14:31:15 -08:00
Vladimir Oltean 36cbf39b56 net: dsa: hide dp->bridge_dev and dp->bridge_num in the core behind helpers
The location of the bridge device pointer and number is going to change.
It is not going to be kept individually per port, but in a common
structure allocated dynamically and which will have lockdep validation.

Create helpers to access these elements so that we have a migration path
to the new organization.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-08 14:31:15 -08:00
Vladimir Oltean 947c8746e2 net: dsa: assign a bridge number even without TX forwarding offload
The service where DSA assigns a unique bridge number for each forwarding
domain is useful even for drivers which do not implement the TX
forwarding offload feature.

For example, drivers might use the dp->bridge_num for FDB isolation.

So rename ds->num_fwd_offloading_bridges to ds->max_num_bridges, and
calculate a unique bridge_num for all drivers that set this value.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-08 14:31:14 -08:00
Vladimir Oltean 3f9bb0301d net: dsa: make dp->bridge_num one-based
I have seen too many bugs already due to the fact that we must encode an
invalid dp->bridge_num as a negative value, because the natural tendency
is to check that invalid value using (!dp->bridge_num). Latest example
can be seen in commit 1bec0f0506 ("net: dsa: fix bridge_num not
getting cleared after ports leaving the bridge").

Convert the existing users to assume that dp->bridge_num == 0 is the
encoding for invalid, and valid bridge numbers start from 1.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-08 14:31:14 -08:00
Kees Cook c0e084e342 hv_sock: Extract hvs_send_data() helper that takes only header
When building under -Warray-bounds, the compiler is especially
conservative when faced with casts from a smaller object to a larger
object. While this has found many real bugs, there are some cases that
are currently false positives (like here). With this as one of the last
few instances of the warning in the kernel before -Warray-bounds can be
enabled globally, rearrange the functions so that there is a header-only
version of hvs_send_data(). Silences this warning:

net/vmw_vsock/hyperv_transport.c: In function 'hvs_shutdown_lock_held.constprop':
net/vmw_vsock/hyperv_transport.c:231:32: warning: array subscript 'struct hvs_send_buf[0]' is partly outside array bounds of 'struct vmpipe_proto_header[1]' [-Warray-bounds]
  231 |         send_buf->hdr.pkt_type = 1;
      |         ~~~~~~~~~~~~~~~~~~~~~~~^~~
net/vmw_vsock/hyperv_transport.c:465:36: note: while referencing 'hdr'
  465 |         struct vmpipe_proto_header hdr;
      |                                    ^~~

This change results in no executable instruction differences.

Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20211207063217.2591451-1-keescook@chromium.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 21:49:19 -08:00
Eric Dumazet ada066b2e0 net: sched: act_mirred: add net device refcount tracker
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 20:45:00 -08:00
Eric Dumazet e7c8ab8419 openvswitch: add net device refcount tracker to struct vport
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 20:45:00 -08:00
Eric Dumazet e4b8954074 netlink: add net device refcount tracker to struct ethnl_req_info
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 20:44:59 -08:00
Eric Dumazet b60645248a net/smc: add net device tracker to struct smc_pnetentry
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 20:44:59 -08:00
Eric Dumazet 035f1f2b96 pktgen add net device refcount tracker
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 20:44:59 -08:00
Eric Dumazet 615d069dcf llc: add net device refcount tracker
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 20:44:59 -08:00
Eric Dumazet 66ce07f780 ax25: add net device refcount tracker
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 20:44:59 -08:00
Eric Dumazet e44b14ebae inet: add net device refcount tracker to struct fib_nh_common
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 20:44:59 -08:00
Eric Dumazet 4fc003fe03 net: switchdev: add net device refcount tracker
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 20:44:58 -08:00
Eric Dumazet f12bf6f3f9 net: watchdog: add net device refcount tracker
Add a netdevice_tracker inside struct net_device, to track
the self reference when a device has an active watchdog timer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 20:44:58 -08:00
Eric Dumazet b2dcdc7f73 net: bridge: add net device refcount tracker
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 20:44:58 -08:00
Eric Dumazet 19c9ebf6ed vlan: add net device refcount tracker
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 20:44:58 -08:00
Eric Dumazet 802a7dc5cf netfilter: conntrack: annotate data-races around ct->timeout
(struct nf_conn)->timeout can be read/written locklessly,
add READ_ONCE()/WRITE_ONCE() to prevent load/store tearing.

BUG: KCSAN: data-race in __nf_conntrack_alloc / __nf_conntrack_find_get

write to 0xffff888132e78c08 of 4 bytes by task 6029 on cpu 0:
 __nf_conntrack_alloc+0x158/0x280 net/netfilter/nf_conntrack_core.c:1563
 init_conntrack+0x1da/0xb30 net/netfilter/nf_conntrack_core.c:1635
 resolve_normal_ct+0x502/0x610 net/netfilter/nf_conntrack_core.c:1746
 nf_conntrack_in+0x1c5/0x88f net/netfilter/nf_conntrack_core.c:1901
 ipv6_conntrack_local+0x19/0x20 net/netfilter/nf_conntrack_proto.c:414
 nf_hook_entry_hookfn include/linux/netfilter.h:142 [inline]
 nf_hook_slow+0x72/0x170 net/netfilter/core.c:619
 nf_hook include/linux/netfilter.h:262 [inline]
 NF_HOOK include/linux/netfilter.h:305 [inline]
 ip6_xmit+0xa3a/0xa60 net/ipv6/ip6_output.c:324
 inet6_csk_xmit+0x1a2/0x1e0 net/ipv6/inet6_connection_sock.c:135
 __tcp_transmit_skb+0x132a/0x1840 net/ipv4/tcp_output.c:1402
 tcp_transmit_skb net/ipv4/tcp_output.c:1420 [inline]
 tcp_write_xmit+0x1450/0x4460 net/ipv4/tcp_output.c:2680
 __tcp_push_pending_frames+0x68/0x1c0 net/ipv4/tcp_output.c:2864
 tcp_push_pending_frames include/net/tcp.h:1897 [inline]
 tcp_data_snd_check+0x62/0x2e0 net/ipv4/tcp_input.c:5452
 tcp_rcv_established+0x880/0x10e0 net/ipv4/tcp_input.c:5947
 tcp_v6_do_rcv+0x36e/0xa50 net/ipv6/tcp_ipv6.c:1521
 sk_backlog_rcv include/net/sock.h:1030 [inline]
 __release_sock+0xf2/0x270 net/core/sock.c:2768
 release_sock+0x40/0x110 net/core/sock.c:3300
 sk_stream_wait_memory+0x435/0x700 net/core/stream.c:145
 tcp_sendmsg_locked+0xb85/0x25a0 net/ipv4/tcp.c:1402
 tcp_sendmsg+0x2c/0x40 net/ipv4/tcp.c:1440
 inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:644
 sock_sendmsg_nosec net/socket.c:704 [inline]
 sock_sendmsg net/socket.c:724 [inline]
 __sys_sendto+0x21e/0x2c0 net/socket.c:2036
 __do_sys_sendto net/socket.c:2048 [inline]
 __se_sys_sendto net/socket.c:2044 [inline]
 __x64_sys_sendto+0x74/0x90 net/socket.c:2044
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

read to 0xffff888132e78c08 of 4 bytes by task 17446 on cpu 1:
 nf_ct_is_expired include/net/netfilter/nf_conntrack.h:286 [inline]
 ____nf_conntrack_find net/netfilter/nf_conntrack_core.c:776 [inline]
 __nf_conntrack_find_get+0x1c7/0xac0 net/netfilter/nf_conntrack_core.c:807
 resolve_normal_ct+0x273/0x610 net/netfilter/nf_conntrack_core.c:1734
 nf_conntrack_in+0x1c5/0x88f net/netfilter/nf_conntrack_core.c:1901
 ipv6_conntrack_local+0x19/0x20 net/netfilter/nf_conntrack_proto.c:414
 nf_hook_entry_hookfn include/linux/netfilter.h:142 [inline]
 nf_hook_slow+0x72/0x170 net/netfilter/core.c:619
 nf_hook include/linux/netfilter.h:262 [inline]
 NF_HOOK include/linux/netfilter.h:305 [inline]
 ip6_xmit+0xa3a/0xa60 net/ipv6/ip6_output.c:324
 inet6_csk_xmit+0x1a2/0x1e0 net/ipv6/inet6_connection_sock.c:135
 __tcp_transmit_skb+0x132a/0x1840 net/ipv4/tcp_output.c:1402
 __tcp_send_ack+0x1fd/0x300 net/ipv4/tcp_output.c:3956
 tcp_send_ack+0x23/0x30 net/ipv4/tcp_output.c:3962
 __tcp_ack_snd_check+0x2d8/0x510 net/ipv4/tcp_input.c:5478
 tcp_ack_snd_check net/ipv4/tcp_input.c:5523 [inline]
 tcp_rcv_established+0x8c2/0x10e0 net/ipv4/tcp_input.c:5948
 tcp_v6_do_rcv+0x36e/0xa50 net/ipv6/tcp_ipv6.c:1521
 sk_backlog_rcv include/net/sock.h:1030 [inline]
 __release_sock+0xf2/0x270 net/core/sock.c:2768
 release_sock+0x40/0x110 net/core/sock.c:3300
 tcp_sendpage+0x94/0xb0 net/ipv4/tcp.c:1114
 inet_sendpage+0x7f/0xc0 net/ipv4/af_inet.c:833
 rds_tcp_xmit+0x376/0x5f0 net/rds/tcp_send.c:118
 rds_send_xmit+0xbed/0x1500 net/rds/send.c:367
 rds_send_worker+0x43/0x200 net/rds/threads.c:200
 process_one_work+0x3fc/0x980 kernel/workqueue.c:2298
 worker_thread+0x616/0xa70 kernel/workqueue.c:2445
 kthread+0x2c7/0x2e0 kernel/kthread.c:327
 ret_from_fork+0x1f/0x30

value changed: 0x00027cc2 -> 0x00000000

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 17446 Comm: kworker/u4:5 Tainted: G        W         5.16.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: krdsd rds_send_worker

Note: I chose an arbitrary commit for the Fixes: tag,
because I do not think we need to backport this fix to very old kernels.

Fixes: e37542ba11 ("netfilter: conntrack: avoid possible false sharing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-08 01:29:15 +01:00
Pablo Neira Ayuso 962e5a4035 netfilter: nft_exthdr: break evaluation if setting TCP option fails
Break rule evaluation on malformed TCP options.

Fixes: 99d1712bc4 ("netfilter: exthdr: tcp option set support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-08 01:05:55 +01:00
Stefano Brivio b7e945e228 nft_set_pipapo: Fix bucket load in AVX2 lookup routine for six 8-bit groups
The sixth byte of packet data has to be looked up in the sixth group,
not in the seventh one, even if we load the bucket data into ymm6
(and not ymm5, for convenience of tracking stalls).

Without this fix, matching on a MAC address as first field of a set,
if 8-bit groups are selected (due to a small set size) would fail,
that is, the given MAC address would never match.

Reported-by: Nikita Yushchenko <nikita.yushchenko@virtuozzo.com>
Cc: <stable@vger.kernel.org> # 5.6.x
Fixes: 7400b06396 ("nft_set_pipapo: Introduce AVX2-based lookup implementation")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-By: Nikita Yushchenko <nikita.yushchenko@virtuozzo.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-08 01:05:55 +01:00
Maxim Galaganov 4f6e14bd19 mptcp: support TCP_CORK and TCP_NODELAY
First, add cork and nodelay fields to the mptcp_sock structure
so they can be used in sync_socket_options(), and fill them on setsockopt
while holding the msk socket lock.

Then, on setsockopt set proper tcp_sk(ssk)->nonagle values for subflows
by calling __tcp_sock_set_cork() or __tcp_sock_set_nodelay() on the ssk
while holding the ssk socket lock.

tcp_push_pending_frames() will be invoked on the ssk if a cork was cleared
or nodelay was set. Also set MPTCP_PUSH_PENDING bit by calling
mptcp_check_and_set_pending(). This will lead to __mptcp_push_pending()
being called inside mptcp_release_cb() with new tcp_sk(ssk)->nonagle.

Also add getsockopt support for TCP_CORK and TCP_NODELAY.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Maxim Galaganov <max@internet.ru>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 11:36:31 -08:00
Maxim Galaganov 8b38217a2a mptcp: expose mptcp_check_and_set_pending
Expose the mptcp_check_and_set_pending() function for use inside MPTCP
sockopt code. The next patch will call it when TCP_CORK is cleared or
TCP_NODELAY is set on the MPTCP socket in order to push pending data
from mptcp_release_cb().

Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Maxim Galaganov <max@internet.ru>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 11:36:31 -08:00
Maxim Galaganov 6fadaa5658 tcp: expose __tcp_sock_set_cork and __tcp_sock_set_nodelay
Expose __tcp_sock_set_cork() and __tcp_sock_set_nodelay() for use in
MPTCP setsockopt code -- namely for syncing MPTCP socket options with
subflows inside sync_socket_options() while already holding the subflow
socket lock.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Maxim Galaganov <max@internet.ru>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 11:36:30 -08:00
Florian Westphal 3b1e21eb60 mptcp: getsockopt: add support for IP_TOS
earlier patch added IP_TOS setsockopt support, this allows to get
the value set by earlier setsockopt.

Extends mptcp_put_int_option to handle u8 input/output by
adding required cast.

Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 11:36:30 -08:00
Davide Caratti 602837e847 mptcp: allow changing the "backup" bit by endpoint id
a non-zero 'id' is sufficient to identify MPTCP endpoints: allow changing
the value of 'backup' bit by simply specifying the endpoint id.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/158
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 11:36:30 -08:00
Florian Westphal 644807e3e4 mptcp: add SIOCINQ, OUTQ and OUTQNSD ioctls
Allows to query in-sequence data ready for read(), total bytes in
write queue and total bytes in write queue that have not yet been sent.

v2: remove unneeded READ_ONCE() (Paolo Abeni)
v3: check for new data unconditionally in SIOCINQ ioctl (Mat Martineau)

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07 11:36:29 -08:00