Commit graph

68768 commits

Author SHA1 Message Date
Jakub Kicinski 625e8cb8e0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

1) Fix cgroupv2 from the input path, from Florian Westphal.

2) Fix incorrect return value of nft_parse_register(), from Antoine Tenart.

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables: nft_parse_register can return a negative value
  netfilter: nft_socket: make cgroup match work in input too
====================

Link: https://lore.kernel.org/r/20220412094246.448055-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-12 09:48:53 -07:00
Antoine Tenart 6c6f9f31ec netfilter: nf_tables: nft_parse_register can return a negative value
Since commit 6e1acfa387 ("netfilter: nf_tables: validate registers
coming from userspace.") nft_parse_register can return a negative value,
but the function prototype is still returning an unsigned int.

Fixes: 6e1acfa387 ("netfilter: nf_tables: validate registers coming from userspace.")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-04-12 11:36:37 +02:00
Petr Malat 8467dda0c2 sctp: Initialize daddr on peeled off socket
Function sctp_do_peeloff() wrongly initializes daddr of the original
socket instead of the peeled off socket, which makes getpeername()
return zeroes instead of the primary address. Initialize the new socket
instead.

Fixes: d570ee490f ("[SCTP]: Correctly set daddr for IPv6 sockets during peeloff")
Signed-off-by: Petr Malat <oss@malat.biz>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://lore.kernel.org/r/20220409063611.673193-1-oss@malat.biz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-11 20:33:10 -07:00
Karsten Graul 49b7d376ab net/smc: Fix af_ops of child socket pointing to released memory
Child sockets may inherit the af_ops from the parent listen socket.
When the listen socket is released then the af_ops of the child socket
points to released memory.
Solve that by restoring the original af_ops for child sockets which
inherited the parent af_ops. And clear any inherited user_data of the
parent socket.

Fixes: 8270d9c210 ("net/smc: Limit backlog connections")
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-11 18:28:03 -07:00
Karsten Graul d22f4f9772 net/smc: Fix NULL pointer dereference in smc_pnet_find_ib()
dev_name() was called with dev.parent as argument but without to
NULL-check it before.
Solve this by checking the pointer before the call to dev_name().

Fixes: af5f60c7e3 ("net/smc: allow PCI IDs as ib device names in the pnet table")
Reported-by: syzbot+03e3e228510223dabd34@syzkaller.appspotmail.com
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-11 18:28:03 -07:00
Karsten Graul b1871fd48e net/smc: use memcpy instead of snprintf to avoid out of bounds read
Using snprintf() to convert not null-terminated strings to null
terminated strings may cause out of bounds read in the source string.
Therefore use memcpy() and terminate the target string with a null
afterwards.

Fixes: fa08666255 ("net/smc: add support for user defined EIDs")
Fixes: 3c572145c2 ("net/smc: add generic netlink support for system EID")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-11 18:28:02 -07:00
Nicolas Dichtel e3fa461d8b ipv6: fix panic when forwarding a pkt with no in6 dev
kongweibin reported a kernel panic in ip6_forward() when input interface
has no in6 dev associated.

The following tc commands were used to reproduce this panic:
tc qdisc del dev vxlan100 root
tc qdisc add dev vxlan100 root netem corrupt 5%

CC: stable@vger.kernel.org
Fixes: ccd27f05ae ("ipv6: fix 'disable_policy' for fwd packets")
Reported-by: kongweibin <kongweibin2@huawei.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-11 12:28:43 +01:00
Florian Westphal 05ae2fba82 netfilter: nft_socket: make cgroup match work in input too
cgroupv2 helper function ignores the already-looked up sk
and uses skb->sk instead.

Just pass sk from the calling function instead; this will
make cgroup matching work for udp and tcp in input even when
edemux did not set skb->sk already.

Fixes: e0bb96db96 ("netfilter: nft_socket: add support for cgroupsv2")
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Topi Miettinen <toiwoton@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-04-11 12:12:46 +02:00
Benedikt Spranger e8a64bbaaa net/sched: taprio: Check if socket flags are valid
A user may set the SO_TXTIME socket option to ensure a packet is send
at a given time. The taprio scheduler has to confirm, that it is allowed
to send a packet at that given time, by a check against the packet time
schedule. The scheduler drop the packet, if the gates are closed at the
given send time.

The check, if SO_TXTIME is set, may fail since sk_flags are part of an
union and the union is used otherwise. This happen, if a socket is not
a full socket, like a request socket for example.

Add a check to verify, if the union is used for sk_flags.

Fixes: 4cfd5779bd ("taprio: Add support for txtime-assist mode")
Signed-off-by: Benedikt Spranger <b.spranger@linutronix.de>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-11 10:51:00 +01:00
Marcelo Ricardo Leitner e65812fd22 net/sched: fix initialization order when updating chain 0 head
Currently, when inserting a new filter that needs to sit at the head
of chain 0, it will first update the heads pointer on all devices using
the (shared) block, and only then complete the initialization of the new
element so that it has a "next" element.

This can lead to a situation that the chain 0 head is propagated to
another CPU before the "next" initialization is done. When this race
condition is triggered, packets being matched on that CPU will simply
miss all other filters, and will flow through the stack as if there were
no other filters installed. If the system is using OVS + TC, such
packets will get handled by vswitchd via upcall, which results in much
higher latency and reordering. For other applications it may result in
packet drops.

This is reproducible with a tc only setup, but it varies from system to
system. It could be reproduced with a shared block amongst 10 veth
tunnels, and an ingress filter mirroring packets to another veth.
That's because using the last added veth tunnel to the shared block to
do the actual traffic, it makes the race window bigger and easier to
trigger.

The fix is rather simple, to just initialize the next pointer of the new
filter instance (tp) before propagating the head change.

The fixes tag is pointing to the original code though this issue should
only be observed when using it unlocked.

Fixes: 2190d1d094 ("net: sched: introduce helpers to work with filter chains")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/b97d5f4eaffeeb9d058155bcab63347527261abf.1649341369.git.marcelo.leitner@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-08 14:45:43 -07:00
Xin Long e2d88f9ce6 sctp: use the correct skb for security_sctp_assoc_request
Yi Chen reported an unexpected sctp connection abort, and it occurred when
COOKIE_ECHO is bundled with DATA Fragment by SCTP HW GSO. As the IP header
is included in chunk->head_skb instead of chunk->skb, it failed to check
IP header version in security_sctp_assoc_request().

According to Ondrej, SELinux only looks at IP header (address and IPsec
options) and XFRM state data, and these are all included in head_skb for
SCTP HW GSO packets. So fix it by using head_skb when calling
security_sctp_assoc_request() in processing COOKIE_ECHO.

v1->v2:
  - As Ondrej noticed, chunk->head_skb should also be used for
    security_sctp_assoc_established() in sctp_sf_do_5_1E_ca().

Fixes: e215dab1c4 ("security: call security_sctp_assoc_request in sctp_sf_do_5_1D_ce")
Reported-by: Yi Chen <yiche@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://lore.kernel.org/r/71becb489e51284edf0c11fc15246f4ed4cef5b6.1649337862.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-08 14:38:58 -07:00
Jakub Kicinski 1b808993e1 flow_dissector: fix false-positive __read_overflow2_field() warning
Bounds checking is unhappy that we try to copy both Ethernet
addresses but pass pointer to the first one. Luckily destination
address is the first field so pass the pointer to the entire header,
whatever.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-08 12:16:59 +01:00
Vlad Buslov 2105f700b5 net/sched: flower: fix parsing of ethertype following VLAN header
A tc flower filter matching TCA_FLOWER_KEY_VLAN_ETH_TYPE is expected to
match the L2 ethertype following the first VLAN header, as confirmed by
linked discussion with the maintainer. However, such rule also matches
packets that have additional second VLAN header, even though filter has
both eth_type and vlan_ethtype set to "ipv4". Looking at the code this
seems to be mostly an artifact of the way flower uses flow dissector.
First, even though looking at the uAPI eth_type and vlan_ethtype appear
like a distinct fields, in flower they are all mapped to the same
key->basic.n_proto. Second, flow dissector skips following VLAN header as
no keys for FLOW_DISSECTOR_KEY_CVLAN are set and eventually assigns the
value of n_proto to last parsed header. With these, such filters ignore any
headers present between first VLAN header and first "non magic"
header (ipv4 in this case) that doesn't result
FLOW_DISSECT_RET_PROTO_AGAIN.

Fix the issue by extending flow dissector VLAN key structure with new
'vlan_eth_type' field that matches first ethertype following previously
parsed VLAN header. Modify flower classifier to set the new
flow_dissector_key_vlan->vlan_eth_type with value obtained from
TCA_FLOWER_KEY_VLAN_ETH_TYPE/TCA_FLOWER_KEY_CVLAN_ETH_TYPE uAPIs.

Link: https://lore.kernel.org/all/Yjhgi48BpTGh6dig@nanopsycho/
Fixes: 9399ae9a6c ("net_sched: flower: Add vlan support")
Fixes: d64efd0926 ("net/sched: flower: Add supprt for matching on QinQ vlan headers")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-08 12:07:37 +01:00
Jakub Kicinski 8e9d0d7a76 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2022-04-06

We've added 8 non-merge commits during the last 8 day(s) which contain
a total of 9 files changed, 139 insertions(+), 36 deletions(-).

The main changes are:

1) rethook related fixes, from Jiri and Masami.

2) Fix the case when tracing bpf prog is attached to struct_ops, from Martin.

3) Support dual-stack sockets in bpf_tcp_check_syncookie, from Maxim.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Adjust bpf_tcp_check_syncookie selftest to test dual-stack sockets
  bpf: Support dual-stack sockets in bpf_tcp_check_syncookie
  bpf: selftests: Test fentry tracing a struct_ops program
  bpf: Resolve to prog->aux->dst_prog->type only for BPF_PROG_TYPE_EXT
  rethook: Fix to use WRITE_ONCE() for rethook:: Handler
  selftests/bpf: Fix warning comparing pointer to 0
  bpf: Fix sparse warnings in kprobe_multi_resolve_syms
  bpftool: Explicit errno handling in skeletons
====================

Link: https://lore.kernel.org/r/20220407031245.73026-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-06 21:58:50 -07:00
Maxim Mikityanskiy 2e8702cc0c bpf: Support dual-stack sockets in bpf_tcp_check_syncookie
bpf_tcp_gen_syncookie looks at the IP version in the IP header and
validates the address family of the socket. It supports IPv4 packets in
AF_INET6 dual-stack sockets.

On the other hand, bpf_tcp_check_syncookie looks only at the address
family of the socket, ignoring the real IP version in headers, and
validates only the packet size. This implementation has some drawbacks:

1. Packets are not validated properly, allowing a BPF program to trick
   bpf_tcp_check_syncookie into handling an IPv6 packet on an IPv4
   socket.

2. Dual-stack sockets fail the checks on IPv4 packets. IPv4 clients end
   up receiving a SYNACK with the cookie, but the following ACK gets
   dropped.

This patch fixes these issues by changing the checks in
bpf_tcp_check_syncookie to match the ones in bpf_tcp_gen_syncookie. IP
version from the header is taken into account, and it is validated
properly with address family.

Fixes: 3990408470 ("bpf: add helper to check for a valid SYN cookie")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Arthur Fabre <afabre@cloudflare.com>
Link: https://lore.kernel.org/bpf/20220406124113.2795730-1-maximmi@nvidia.com
2022-04-06 09:44:45 -07:00
Florian Westphal a3ebe92a0f net: ipv6mr: fix unused variable warning with CONFIG_IPV6_PIMSM_V2=n
net/ipv6/ip6mr.c:1656:14: warning: unused variable 'do_wrmifwhole'

Move it to the CONFIG_IPV6_PIMSM_V2 scope where its used.

Fixes: 4b340a5a72 ("net: ip6mr: add support for passing full packet on wrong mif")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-06 15:14:30 +01:00
Eric Dumazet 1946014ca3 rxrpc: fix a race in rxrpc_exit_net()
Current code can lead to the following race:

CPU0                                                 CPU1

rxrpc_exit_net()
                                                     rxrpc_peer_keepalive_worker()
                                                       if (rxnet->live)

  rxnet->live = false;
  del_timer_sync(&rxnet->peer_keepalive_timer);

                                                             timer_reduce(&rxnet->peer_keepalive_timer, jiffies + delay);

  cancel_work_sync(&rxnet->peer_keepalive_work);

rxrpc_exit_net() exits while peer_keepalive_timer is still armed,
leading to use-after-free.

syzbot report was:

ODEBUG: free active (active state 0) object type: timer_list hint: rxrpc_peer_keepalive_timeout+0x0/0xb0
WARNING: CPU: 0 PID: 3660 at lib/debugobjects.c:505 debug_print_object+0x16e/0x250 lib/debugobjects.c:505
Modules linked in:
CPU: 0 PID: 3660 Comm: kworker/u4:6 Not tainted 5.17.0-syzkaller-13993-g88e6c0207623 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: netns cleanup_net
RIP: 0010:debug_print_object+0x16e/0x250 lib/debugobjects.c:505
Code: ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 af 00 00 00 48 8b 14 dd 00 1c 26 8a 4c 89 ee 48 c7 c7 00 10 26 8a e8 b1 e7 28 05 <0f> 0b 83 05 15 eb c5 09 01 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e c3
RSP: 0018:ffffc9000353fb00 EFLAGS: 00010082
RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
RDX: ffff888029196140 RSI: ffffffff815efad8 RDI: fffff520006a7f52
RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff815ea4ae R11: 0000000000000000 R12: ffffffff89ce23e0
R13: ffffffff8a2614e0 R14: ffffffff816628c0 R15: dffffc0000000000
FS:  0000000000000000(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe1f2908924 CR3: 0000000043720000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 __debug_check_no_obj_freed lib/debugobjects.c:992 [inline]
 debug_check_no_obj_freed+0x301/0x420 lib/debugobjects.c:1023
 kfree+0xd6/0x310 mm/slab.c:3809
 ops_free_list.part.0+0x119/0x370 net/core/net_namespace.c:176
 ops_free_list net/core/net_namespace.c:174 [inline]
 cleanup_net+0x591/0xb00 net/core/net_namespace.c:598
 process_one_work+0x996/0x1610 kernel/workqueue.c:2289
 worker_thread+0x665/0x1080 kernel/workqueue.c:2436
 kthread+0x2e9/0x3a0 kernel/kthread.c:376
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:298
 </TASK>

Fixes: ace45bec6d ("rxrpc: Fix firewall route keepalive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: linux-afs@lists.infradead.org
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-06 13:48:51 +01:00
Ilya Maximets 1f30fb9166 net: openvswitch: fix leak of nested actions
While parsing user-provided actions, openvswitch module may dynamically
allocate memory and store pointers in the internal copy of the actions.
So this memory has to be freed while destroying the actions.

Currently there are only two such actions: ct() and set().  However,
there are many actions that can hold nested lists of actions and
ovs_nla_free_flow_actions() just jumps over them leaking the memory.

For example, removal of the flow with the following actions will lead
to a leak of the memory allocated by nf_ct_tmpl_alloc():

  actions:clone(ct(commit),0)

Non-freed set() action may also leak the 'dst' structure for the
tunnel info including device references.

Under certain conditions with a high rate of flow rotation that may
cause significant memory leak problem (2MB per second in reporter's
case).  The problem is also hard to mitigate, because the user doesn't
have direct control over the datapath flows generated by OVS.

Fix that by iterating over all the nested actions and freeing
everything that needs to be freed recursively.

New build time assertion should protect us from this problem if new
actions will be added in the future.

Unfortunately, openvswitch module doesn't use NLA_F_NESTED, so all
attributes has to be explicitly checked.  sample() and clone() actions
are mixing extra attributes into the user-provided action list.  That
prevents some code generalization too.

Fixes: 34ae932a40 ("openvswitch: Make tunnel set action attach a metadata dst")
Link: https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392922.html
Reported-by: Stéphane Graber <stgraber@ubuntu.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-06 13:36:50 +01:00
Ilya Maximets 3f2a3050b4 net: openvswitch: don't send internal clone attribute to the userspace.
'OVS_CLONE_ATTR_EXEC' is an internal attribute that is used for
performance optimization inside the kernel.  It's added by the kernel
while parsing user-provided actions and should not be sent during the
flow dump as it's not part of the uAPI.

The issue doesn't cause any significant problems to the ovs-vswitchd
process, because reported actions are not really used in the
application lifecycle and only supposed to be shown to a human via
ovs-dpctl flow dump.  However, the action list is still incorrect
and causes the following error if the user wants to look at the
datapath flows:

  # ovs-dpctl add-dp system@ovs-system
  # ovs-dpctl add-flow "<flow match>" "clone(ct(commit),0)"
  # ovs-dpctl dump-flows
  <flow match>, packets:0, bytes:0, used:never,
    actions:clone(bad length 4, expected -1 for: action0(01 00 00 00),
                  ct(commit),0)

With the fix:

  # ovs-dpctl dump-flows
  <flow match>, packets:0, bytes:0, used:never,
    actions:clone(ct(commit),0)

Additionally fixed an incorrect attribute name in the comment.

Fixes: b233504033 ("openvswitch: kernel datapath clone action")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Link: https://lore.kernel.org/r/20220404104150.2865736-1-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-05 17:57:54 -07:00
Jakub Kicinski 03eb7daec5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

1) Incorrect comparison in bitmask .reduce, from Jeremy Sowden.

2) Missing GFP_KERNEL_ACCOUNT for dynamically allocated objects,
   from Vasily Averin.

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables: memcg accounting for dynamically allocated objects
  netfilter: bitwise: fix reduce comparisons
====================

Link: https://lore.kernel.org/r/20220405100923.7231-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-05 13:04:04 -07:00
David Ahern 1158f79f82 ipv6: Fix stats accounting in ip6_pkt_drop
VRF devices are the loopbacks for VRFs, and a loopback can not be
assigned to a VRF. Accordingly, the condition in ip6_pkt_drop should
be '||' not '&&'.

Fixes: 1d3fd8a10b ("vrf: Use orig netdev to count Ip6InNoRoutes and a fresh route lookup when sending dest unreach")
Reported-by: Pudak, Filip <Filip.Pudak@windriver.com>
Reported-by: Xiao, Jiguang <Jiguang.Xiao@windriver.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220404150908.2937-1-dsahern@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-05 13:41:27 +02:00
Vasily Averin 42193ffd79 netfilter: nf_tables: memcg accounting for dynamically allocated objects
nft_*.c files whose NFT_EXPR_STATEFUL flag is set on need to
use __GFP_ACCOUNT flag for objects that are dynamically
allocated from the packet path.

Such objects are allocated inside nft_expr_ops->init() callbacks
executed in task context while processing netlink messages.

In addition, this patch adds accounting to nft_set_elem_expr_clone()
used for the same purposes.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-04-05 11:55:46 +02:00
Jamie Bainbridge e3d37210df sctp: count singleton chunks in assoc user stats
Singleton chunks (INIT, HEARTBEAT PMTU probes, and SHUTDOWN-
COMPLETE) are not counted in SCTP_GET_ASOC_STATS "sas_octrlchunks"
counter available to the assoc owner.

These are all control chunks so they should be counted as such.

Add counting of singleton chunks so they are properly accounted for.

Fixes: 196d675934 ("sctp: Add support to per-association statistics via a new SCTP_GET_ASSOC_STATS call")
Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://lore.kernel.org/r/c9ba8785789880cf07923b8a5051e174442ea9ee.1649029663.git.jamie.bainbridge@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-05 09:51:12 +02:00
Nikolay Aleksandrov 6bf92d70e6 net: ipv4: fix route with nexthop object delete warning
FRR folks have hit a kernel warning[1] while deleting routes[2] which is
caused by trying to delete a route pointing to a nexthop id without
specifying nhid but matching on an interface. That is, a route is found
but we hit a warning while matching it. The warning is from
fib_info_nh() in include/net/nexthop.h because we run it on a fib_info
with nexthop object. The call chain is:
 inet_rtm_delroute -> fib_table_delete -> fib_nh_match (called with a
nexthop fib_info and also with fc_oif set thus calling fib_info_nh on
the fib_info and triggering the warning). The fix is to not do any
matching in that branch if the fi has a nexthop object because those are
managed separately. I.e. we should match when deleting without nh spec and
should fail when deleting a nexthop route with old-style nh spec because
nexthop objects are managed separately, e.g.:
 $ ip r show 1.2.3.4/32
 1.2.3.4 nhid 12 via 192.168.11.2 dev dummy0

 $ ip r del 1.2.3.4/32
 $ ip r del 1.2.3.4/32 nhid 12
 <both should work>

 $ ip r del 1.2.3.4/32 dev dummy0
 <should fail with ESRCH>

[1]
 [  523.462226] ------------[ cut here ]------------
 [  523.462230] WARNING: CPU: 14 PID: 22893 at include/net/nexthop.h:468 fib_nh_match+0x210/0x460
 [  523.462236] Modules linked in: dummy rpcsec_gss_krb5 xt_socket nf_socket_ipv4 nf_socket_ipv6 ip6table_raw iptable_raw bpf_preload xt_statistic ip_set ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs xt_mark nf_tables xt_nat veth nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter overlay dm_crypt nfsv3 nfs fscache netfs vhost_net vhost vhost_iotlb tap tun xt_CHECKSUM xt_MASQUERADE xt_conntrack 8021q garp mrp ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bridge stp llc rfcomm snd_seq_dummy snd_hrtimer rpcrdma rdma_cm iw_cm ib_cm ib_core ip6table_filter xt_comment ip6_tables vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) qrtr bnep binfmt_misc xfs vfat fat squashfs loop nvidia_drm(POE) nvidia_modeset(POE) nvidia_uvm(POE) nvidia(POE) intel_rapl_msr intel_rapl_common snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi btusb btrtl iwlmvm uvcvideo btbcm snd_hda_intel edac_mce_amd
 [  523.462274]  videobuf2_vmalloc videobuf2_memops btintel snd_intel_dspcfg videobuf2_v4l2 snd_intel_sdw_acpi bluetooth snd_usb_audio snd_hda_codec mac80211 snd_usbmidi_lib joydev snd_hda_core videobuf2_common kvm_amd snd_rawmidi snd_hwdep snd_seq videodev ccp snd_seq_device libarc4 ecdh_generic mc snd_pcm kvm iwlwifi snd_timer drm_kms_helper snd cfg80211 cec soundcore irqbypass rapl wmi_bmof i2c_piix4 rfkill k10temp pcspkr acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc drm zram ip_tables crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel nvme sp5100_tco r8169 nvme_core wmi ipmi_devintf ipmi_msghandler fuse
 [  523.462300] CPU: 14 PID: 22893 Comm: ip Tainted: P           OE     5.16.18-200.fc35.x86_64 #1
 [  523.462302] Hardware name: Micro-Star International Co., Ltd. MS-7C37/MPG X570 GAMING EDGE WIFI (MS-7C37), BIOS 1.C0 10/29/2020
 [  523.462303] RIP: 0010:fib_nh_match+0x210/0x460
 [  523.462304] Code: 7c 24 20 48 8b b5 90 00 00 00 e8 bb ee f4 ff 48 8b 7c 24 20 41 89 c4 e8 ee eb f4 ff 45 85 e4 0f 85 2e fe ff ff e9 4c ff ff ff <0f> 0b e9 17 ff ff ff 3c 0a 0f 85 61 fe ff ff 48 8b b5 98 00 00 00
 [  523.462306] RSP: 0018:ffffaa53d4d87928 EFLAGS: 00010286
 [  523.462307] RAX: 0000000000000000 RBX: ffffaa53d4d87a90 RCX: ffffaa53d4d87bb0
 [  523.462308] RDX: ffff9e3d2ee6be80 RSI: ffffaa53d4d87a90 RDI: ffffffff920ed380
 [  523.462309] RBP: ffff9e3d2ee6be80 R08: 0000000000000064 R09: 0000000000000000
 [  523.462310] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000031
 [  523.462310] R13: 0000000000000020 R14: 0000000000000000 R15: ffff9e3d331054e0
 [  523.462311] FS:  00007f245517c1c0(0000) GS:ffff9e492ed80000(0000) knlGS:0000000000000000
 [  523.462313] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [  523.462313] CR2: 000055e5dfdd8268 CR3: 00000003ef488000 CR4: 0000000000350ee0
 [  523.462315] Call Trace:
 [  523.462316]  <TASK>
 [  523.462320]  fib_table_delete+0x1a9/0x310
 [  523.462323]  inet_rtm_delroute+0x93/0x110
 [  523.462325]  rtnetlink_rcv_msg+0x133/0x370
 [  523.462327]  ? _copy_to_iter+0xb5/0x6f0
 [  523.462330]  ? rtnl_calcit.isra.0+0x110/0x110
 [  523.462331]  netlink_rcv_skb+0x50/0xf0
 [  523.462334]  netlink_unicast+0x211/0x330
 [  523.462336]  netlink_sendmsg+0x23f/0x480
 [  523.462338]  sock_sendmsg+0x5e/0x60
 [  523.462340]  ____sys_sendmsg+0x22c/0x270
 [  523.462341]  ? import_iovec+0x17/0x20
 [  523.462343]  ? sendmsg_copy_msghdr+0x59/0x90
 [  523.462344]  ? __mod_lruvec_page_state+0x85/0x110
 [  523.462348]  ___sys_sendmsg+0x81/0xc0
 [  523.462350]  ? netlink_seq_start+0x70/0x70
 [  523.462352]  ? __dentry_kill+0x13a/0x180
 [  523.462354]  ? __fput+0xff/0x250
 [  523.462356]  __sys_sendmsg+0x49/0x80
 [  523.462358]  do_syscall_64+0x3b/0x90
 [  523.462361]  entry_SYSCALL_64_after_hwframe+0x44/0xae
 [  523.462364] RIP: 0033:0x7f24552aa337
 [  523.462365] Code: 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
 [  523.462366] RSP: 002b:00007fff7f05a838 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 [  523.462368] RAX: ffffffffffffffda RBX: 000000006245bf91 RCX: 00007f24552aa337
 [  523.462368] RDX: 0000000000000000 RSI: 00007fff7f05a8a0 RDI: 0000000000000003
 [  523.462369] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
 [  523.462370] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000001
 [  523.462370] R13: 00007fff7f05ce08 R14: 0000000000000000 R15: 000055e5dfdd1040
 [  523.462373]  </TASK>
 [  523.462374] ---[ end trace ba537bc16f6bf4ed ]---

[2] https://github.com/FRRouting/frr/issues/6412

Fixes: 4c7e8084fd ("ipv4: Plumb support for nexthop object in a fib_info")
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-01 12:09:17 +01:00
Matt Johnston 4a9dda1c1d mctp: Use output netdev to allocate skb headroom
Previously the skb was allocated with headroom MCTP_HEADER_MAXLEN,
but that isn't sufficient if we are using devs that are not MCTP
specific.

This also adds a check that the smctp_halen provided to sendmsg for
extended addressing is the correct size for the netdev.

Fixes: 833ef3b91d ("mctp: Populate socket implementation")
Reported-by: Matthew Rinaldi <mjrinal@g.clemson.edu>
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-01 12:04:15 +01:00
Matt Johnston 60be976ac4 mctp: Fix check for dev_hard_header() result
dev_hard_header() returns the length of the header, so
we need to test for negative errors rather than non-zero.

Fixes: 889b7da23a ("mctp: Add initial routing framework")
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-01 12:04:15 +01:00
Vladimir Oltean 066dfc4290 Revert "net: dsa: stop updating master MTU from master.c"
This reverts commit a1ff94c297.

Switch drivers that don't implement ->port_change_mtu() will cause the
DSA master to remain with an MTU of 1500, since we've deleted the other
code path. In turn, this causes a regression for those systems, where
MTU-sized traffic can no longer be terminated.

Revert the change taking into account the fact that rtnl_lock() is now
taken top-level from the callers of dsa_master_setup() and
dsa_master_teardown(). Also add a comment in order for it to be
absolutely clear why it is still needed.

Fixes: a1ff94c297 ("net: dsa: stop updating master MTU from master.c")
Reported-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-01 11:59:01 +01:00
Jean-Philippe Brucker 1effe8ca4e skbuff: fix coalescing for page_pool fragment recycling
Fix a use-after-free when using page_pool with page fragments. We
encountered this problem during normal RX in the hns3 driver:

(1) Initially we have three descriptors in the RX queue. The first one
    allocates PAGE1 through page_pool, and the other two allocate one
    half of PAGE2 each. Page references look like this:

                RX_BD1 _______ PAGE1
                RX_BD2 _______ PAGE2
                RX_BD3 _________/

(2) Handle RX on the first descriptor. Allocate SKB1, eventually added
    to the receive queue by tcp_queue_rcv().

(3) Handle RX on the second descriptor. Allocate SKB2 and pass it to
    netif_receive_skb():

    netif_receive_skb(SKB2)
      ip_rcv(SKB2)
        SKB3 = skb_clone(SKB2)

    SKB2 and SKB3 share a reference to PAGE2 through
    skb_shinfo()->dataref. The other ref to PAGE2 is still held by
    RX_BD3:

                      SKB2 ---+- PAGE2
                      SKB3 __/   /
                RX_BD3 _________/

 (3b) Now while handling TCP, coalesce SKB3 with SKB1:

      tcp_v4_rcv(SKB3)
        tcp_try_coalesce(to=SKB1, from=SKB3)    // succeeds
        kfree_skb_partial(SKB3)
          skb_release_data(SKB3)                // drops one dataref

                      SKB1 _____ PAGE1
                           \____
                      SKB2 _____ PAGE2
                                 /
                RX_BD3 _________/

    In skb_try_coalesce(), __skb_frag_ref() takes a page reference to
    PAGE2, where it should instead have increased the page_pool frag
    reference, pp_frag_count. Without coalescing, when releasing both
    SKB2 and SKB3, a single reference to PAGE2 would be dropped. Now
    when releasing SKB1 and SKB2, two references to PAGE2 will be
    dropped, resulting in underflow.

 (3c) Drop SKB2:

      af_packet_rcv(SKB2)
        consume_skb(SKB2)
          skb_release_data(SKB2)                // drops second dataref
            page_pool_return_skb_page(PAGE2)    // drops one pp_frag_count

                      SKB1 _____ PAGE1
                           \____
                                 PAGE2
                                 /
                RX_BD3 _________/

(4) Userspace calls recvmsg()
    Copies SKB1 and releases it. Since SKB3 was coalesced with SKB1, we
    release the SKB3 page as well:

    tcp_eat_recv_skb(SKB1)
      skb_release_data(SKB1)
        page_pool_return_skb_page(PAGE1)
        page_pool_return_skb_page(PAGE2)        // drops second pp_frag_count

(5) PAGE2 is freed, but the third RX descriptor was still using it!
    In our case this causes IOMMU faults, but it would silently corrupt
    memory if the IOMMU was disabled.

Change the logic that checks whether pp_recycle SKBs can be coalesced.
We still reject differing pp_recycle between 'from' and 'to' SKBs, but
in order to avoid the situation described above, we also reject
coalescing when both 'from' and 'to' are pp_recycled and 'from' is
cloned.

The new logic allows coalescing a cloned pp_recycle SKB into a page
refcounted one, because in this case the release (4) will drop the right
reference, the one taken by skb_try_coalesce().

Fixes: 53e0961da1 ("page_pool: add frag page recycling support in page pool")
Suggested-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-01 11:57:58 +01:00
Ziyang Xuan 9381fe8c84 net/tls: fix slab-out-of-bounds bug in decrypt_internal
The memory size of tls_ctx->rx.iv for AES128-CCM is 12 setting in
tls_set_sw_offload(). The return value of crypto_aead_ivsize()
for "ccm(aes)" is 16. So memcpy() require 16 bytes from 12 bytes
memory space will trigger slab-out-of-bounds bug as following:

==================================================================
BUG: KASAN: slab-out-of-bounds in decrypt_internal+0x385/0xc40 [tls]
Read of size 16 at addr ffff888114e84e60 by task tls/10911

Call Trace:
 <TASK>
 dump_stack_lvl+0x34/0x44
 print_report.cold+0x5e/0x5db
 ? decrypt_internal+0x385/0xc40 [tls]
 kasan_report+0xab/0x120
 ? decrypt_internal+0x385/0xc40 [tls]
 kasan_check_range+0xf9/0x1e0
 memcpy+0x20/0x60
 decrypt_internal+0x385/0xc40 [tls]
 ? tls_get_rec+0x2e0/0x2e0 [tls]
 ? process_rx_list+0x1a5/0x420 [tls]
 ? tls_setup_from_iter.constprop.0+0x2e0/0x2e0 [tls]
 decrypt_skb_update+0x9d/0x400 [tls]
 tls_sw_recvmsg+0x3c8/0xb50 [tls]

Allocated by task 10911:
 kasan_save_stack+0x1e/0x40
 __kasan_kmalloc+0x81/0xa0
 tls_set_sw_offload+0x2eb/0xa20 [tls]
 tls_setsockopt+0x68c/0x700 [tls]
 __sys_setsockopt+0xfe/0x1b0

Replace the crypto_aead_ivsize() with prot->iv_size + prot->salt_size
when memcpy() iv value in TLS_1_3_VERSION scenario.

Fixes: f295b3ae9f ("net/tls: Add support of AES128-CCM based ciphers")
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-01 11:53:35 +01:00
Linus Torvalds 2975dbdc39 Networking fixes for 5.18-rc1 and rethook patches.
Features:
 
  - kprobes: rethook: x86: replace kretprobe trampoline with rethook
 
 Current release - regressions:
 
  - sfc: avoid null-deref on systems without NUMA awareness
    in the new queue sizing code
 
 Current release - new code bugs:
 
  - vxlan: do not feed vxlan_vnifilter_dump_dev with non-vxlan devices
 
  - eth: lan966x: fix null-deref on PHY pointer in timestamp ioctl
    when interface is down
 
 Previous releases - always broken:
 
  - openvswitch: correct neighbor discovery target mask field
    in the flow dump
 
  - wireguard: ignore v6 endpoints when ipv6 is disabled and fix a leak
 
  - rxrpc: fix call timer start racing with call destruction
 
  - rxrpc: fix null-deref when security type is rxrpc_no_security
 
  - can: fix UAF bugs around echo skbs in multiple drivers
 
 Misc:
 
  - docs: move netdev-FAQ to the "process" section of the documentation
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmJF3S0ACgkQMUZtbf5S
 IruIvA/+NZx+c+fBBrbjOh63avRL7kYIqIDREf+v6lh4ZXmbrp22xalcjIdxgWeK
 vAiYfYmzZblWAGkilcvPG3blCBc+9b+YE+pPJXFe60Huv3eYpjKfgTKwQOg/lIeM
 8MfPP7eBwcJ/ltSTRtySRl9LYgyVcouP9rAVJavFVYrvuDYunwhfChswVfGCYon8
 42O4nRwrtkTE1MjHD8HS3YxvwGlo+iIyhsxgG/gWx8F2xeIG22H6adzjDXcCQph8
 air/awrJ4enYkVMRokGNfNppK9Z3vjJDX5xha3CREpvXNPe0F24cAE/L8XqyH7+r
 /bXP5y9VC9mmEO7x4Le3VmDhOJGbCOtR89gTlevftDRdSIrbNHffZhbPW48tR7o8
 NJFlhiSJb4HEMN0q7BmxnWaKlbZUlvLEXLuU5ytZE/G7i+nETULlunfZrCD4eNYH
 gBGYhiob2I/XotJA9QzG/RDyaFwDaC/VARsyv37PSeBAl/yrEGAeP7DsKkKX/ayg
 LM9ItveqHXK30J0xr3QJA8s49EkIYejjYR3l0hQ9esf9QvGK99dE/fo44Apf3C3A
 Lz6XpnRc5Xd7tZ9Aopwb3FqOH6WR9Hq9Qlbk0qifsL/2sRbatpuZbbDK6L3CR3Ir
 WFNcOoNbbqv85kCKFXFjj0jdpoNa9Yej8XFkMkVSkM3sHImYmYQ=
 =5Bvy
 -----END PGP SIGNATURE-----

Merge tag 'net-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull more networking updates from Jakub Kicinski:
 "Networking fixes and rethook patches.

  Features:

   - kprobes: rethook: x86: replace kretprobe trampoline with rethook

  Current release - regressions:

   - sfc: avoid null-deref on systems without NUMA awareness in the new
     queue sizing code

  Current release - new code bugs:

   - vxlan: do not feed vxlan_vnifilter_dump_dev with non-vxlan devices

   - eth: lan966x: fix null-deref on PHY pointer in timestamp ioctl when
     interface is down

  Previous releases - always broken:

   - openvswitch: correct neighbor discovery target mask field in the
     flow dump

   - wireguard: ignore v6 endpoints when ipv6 is disabled and fix a leak

   - rxrpc: fix call timer start racing with call destruction

   - rxrpc: fix null-deref when security type is rxrpc_no_security

   - can: fix UAF bugs around echo skbs in multiple drivers

  Misc:

   - docs: move netdev-FAQ to the 'process' section of the
     documentation"

* tag 'net-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (57 commits)
  vxlan: do not feed vxlan_vnifilter_dump_dev with non vxlan devices
  openvswitch: Add recirc_id to recirc warning
  rxrpc: fix some null-ptr-deref bugs in server_key.c
  rxrpc: Fix call timer start racing with call destruction
  net: hns3: fix software vlan talbe of vlan 0 inconsistent with hardware
  net: hns3: fix the concurrency between functions reading debugfs
  docs: netdev: move the netdev-FAQ to the process pages
  docs: netdev: broaden the new vs old code formatting guidelines
  docs: netdev: call out the merge window in tag checking
  docs: netdev: add missing back ticks
  docs: netdev: make the testing requirement more stringent
  docs: netdev: add a question about re-posting frequency
  docs: netdev: rephrase the 'should I update patchwork' question
  docs: netdev: rephrase the 'Under review' question
  docs: netdev: shorten the name and mention msgid for patch status
  docs: netdev: note that RFC postings are allowed any time
  docs: netdev: turn the net-next closed into a Warning
  docs: netdev: move the patch marking section up
  docs: netdev: minor reword
  docs: netdev: replace references to old archives
  ...
2022-03-31 11:23:31 -07:00
Stéphane Graber ea07af2e71 openvswitch: Add recirc_id to recirc warning
When hitting the recirculation limit, the kernel would currently log
something like this:

[   58.586597] openvswitch: ovs-system: deferred action limit reached, drop recirc action

Which isn't all that useful to debug as we only have the interface name
to go on but can't track it down to a specific flow.

With this change, we now instead get:

[   58.586597] openvswitch: ovs-system: deferred action limit reached, drop recirc action (recirc_id=0x9e)

Which can now be correlated with the flow entries from OVS.

Suggested-by: Frode Nordahl <frode.nordahl@canonical.com>
Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
Tested-by: Stephane Graber <stgraber@ubuntu.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://lore.kernel.org/r/20220330194244.3476544-1-stgraber@ubuntu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-31 08:52:48 -07:00
Jakub Kicinski 46b556205d linux-can-fixes-for-5.18-20220331
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEBsvAIBsPu6mG7thcrX5LkNig010FAmJFXq0THG1rbEBwZW5n
 dXRyb25peC5kZQAKCRCtfkuQ2KDTXc+GB/9hPO+0SN80Fz93rDE8M6wI1rY6famJ
 mqcaNmSDGuX9jMxC3CLugw8rckfXKb0kicjq8xLqAU7PX67h4nShnZCmHq+4RnJW
 AZIV4sUOgeMcDohMPY0b8BIyS1JDwjR2L9vrFjeJm/BlDC2mkbRC2pXr6Pq9Xg4Y
 aBaNA94v9UMQvTdPK8UmHOvl3M6gPV3ggBiCRM5xxnYGkrQFApgtCsXMvizbBUGS
 tNLEOGYD/YAZKAiZfEnaTKqklUCGPdm4vuiP+4F9+2ovBK38CiiJSx/fF6aLkZy+
 RdLpjrsqjva1cPSxaE3tAm45KOF5vm3hrBGWgslJz+0CN+tBWy/Q2G4s
 =yO1B
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-5.18-20220331' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2022-03-31

The first patch is by Oliver Hartkopp and fixes MSG_PEEK feature in
the CAN ISOTP protocol (broken in net-next for v5.18 only).

Tom Rix's patch for the mcp251xfd driver fixes the propagation of an
error value in case of an error.

A patch by me for the m_can driver fixes a use-after-free in the xmit
handler for m_can IP cores v3.0.x.

Hangyu Hua contributes 3 patches fixing the same double free in the
error path of the xmit handler in the ems_usb, usb_8dev and mcba_usb
USB CAN driver.

Pavel Skripkin contributes a patch for the mcba_usb driver to properly
check the endpoint type.

The last patch is by me and fixes a mem leak in the gs_usb, which was
introduced in net-next for v5.18.

* tag 'linux-can-fixes-for-5.18-20220331' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
  can: gs_usb: gs_make_candev(): fix memory leak for devices with extended bit timing configuration
  can: mcba_usb: properly check endpoint type
  can: mcba_usb: mcba_usb_start_xmit(): fix double dev_kfree_skb in error path
  can: usb_8dev: usb_8dev_start_xmit(): fix double dev_kfree_skb() in error path
  can: ems_usb: ems_usb_start_xmit(): fix double dev_kfree_skb() in error path
  can: m_can: m_can_tx_handler(): fix use after free of skb
  can: mcp251xfd: mcp251xfd_register_get_dev_id(): fix return of error value
  can: isotp: restore accidentally removed MSG_PEEK feature
====================

Link: https://lore.kernel.org/r/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-31 08:36:18 -07:00
Xiaolong Huang ff8376ade4 rxrpc: fix some null-ptr-deref bugs in server_key.c
Some function calls are not implemented in rxrpc_no_security, there are
preparse_server_key, free_preparse_server_key and destroy_server_key.
When rxrpc security type is rxrpc_no_security, user can easily trigger a
null-ptr-deref bug via ioctl. So judgment should be added to prevent it

The crash log:
user@syzkaller:~$ ./rxrpc_preparse_s
[   37.956878][T15626] BUG: kernel NULL pointer dereference, address: 0000000000000000
[   37.957645][T15626] #PF: supervisor instruction fetch in kernel mode
[   37.958229][T15626] #PF: error_code(0x0010) - not-present page
[   37.958762][T15626] PGD 4aadf067 P4D 4aadf067 PUD 4aade067 PMD 0
[   37.959321][T15626] Oops: 0010 [#1] PREEMPT SMP
[   37.959739][T15626] CPU: 0 PID: 15626 Comm: rxrpc_preparse_ Not tainted 5.17.0-01442-gb47d5a4f6b8d #43
[   37.960588][T15626] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
[   37.961474][T15626] RIP: 0010:0x0
[   37.961787][T15626] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[   37.962480][T15626] RSP: 0018:ffffc9000d9abdc0 EFLAGS: 00010286
[   37.963018][T15626] RAX: ffffffff84335200 RBX: ffff888012a1ce80 RCX: 0000000000000000
[   37.963727][T15626] RDX: 0000000000000000 RSI: ffffffff84a736dc RDI: ffffc9000d9abe48
[   37.964425][T15626] RBP: ffffc9000d9abe48 R08: 0000000000000000 R09: 0000000000000002
[   37.965118][T15626] R10: 000000000000000a R11: f000000000000000 R12: ffff888013145680
[   37.965836][T15626] R13: 0000000000000000 R14: ffffffffffffffec R15: ffff8880432aba80
[   37.966441][T15626] FS:  00007f2177907700(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
[   37.966979][T15626] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   37.967384][T15626] CR2: ffffffffffffffd6 CR3: 000000004aaf1000 CR4: 00000000000006f0
[   37.967864][T15626] Call Trace:
[   37.968062][T15626]  <TASK>
[   37.968240][T15626]  rxrpc_preparse_s+0x59/0x90
[   37.968541][T15626]  key_create_or_update+0x174/0x510
[   37.968863][T15626]  __x64_sys_add_key+0x139/0x1d0
[   37.969165][T15626]  do_syscall_64+0x35/0xb0
[   37.969451][T15626]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   37.969824][T15626] RIP: 0033:0x43a1f9

Signed-off-by: Xiaolong Huang <butterflyhuangxx@gmail.com>
Tested-by: Xiaolong Huang <butterflyhuangxx@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: http://lists.infradead.org/pipermail/linux-afs/2022-March/005069.html
Fixes: 12da59fcab ("rxrpc: Hand server key parsing off to the security class")
Link: https://lore.kernel.org/r/164865013439.2941502.8966285221215590921.stgit@warthog.procyon.org.uk
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-03-31 15:21:31 +02:00
David Howells 4a7f62f919 rxrpc: Fix call timer start racing with call destruction
The rxrpc_call struct has a timer used to handle various timed events
relating to a call.  This timer can get started from the packet input
routines that are run in softirq mode with just the RCU read lock held.
Unfortunately, because only the RCU read lock is held - and neither ref or
other lock is taken - the call can start getting destroyed at the same time
a packet comes in addressed to that call.  This causes the timer - which
was already stopped - to get restarted.  Later, the timer dispatch code may
then oops if the timer got deallocated first.

Fix this by trying to take a ref on the rxrpc_call struct and, if
successful, passing that ref along to the timer.  If the timer was already
running, the ref is discarded.

The timer completion routine can then pass the ref along to the call's work
item when it queues it.  If the timer or work item where already
queued/running, the extra ref is discarded.

Fixes: a158bdd324 ("rxrpc: Fix call timeouts")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: http://lists.infradead.org/pipermail/linux-afs/2022-March/005073.html
Link: https://lore.kernel.org/r/164865115696.2943015.11097991776647323586.stgit@warthog.procyon.org.uk
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-03-31 12:25:25 +02:00
Oliver Hartkopp e382fea8ae can: isotp: restore accidentally removed MSG_PEEK feature
In commit 42bf50a179 ("can: isotp: support MSG_TRUNC flag when
reading from socket") a new check for recvmsg flags has been
introduced that only checked for the flags that are handled in
isotp_recvmsg() itself.

This accidentally removed the MSG_PEEK feature flag which is processed
later in the call chain in __skb_try_recv_from_queue().

Add MSG_PEEK to the set of valid flags to restore the feature.

Fixes: 42bf50a179 ("can: isotp: support MSG_TRUNC flag when reading from socket")
Link: https://github.com/linux-can/can-utils/issues/347#issuecomment-1079554254
Link: https://lore.kernel.org/all/20220328113611.3691-1-socketcan@hartkopp.net
Reported-by: Derek Will <derekrobertwill@gmail.com>
Suggested-by: Derek Will <derekrobertwill@gmail.com>
Tested-by: Derek Will <derekrobertwill@gmail.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-03-31 09:46:58 +02:00
Jakub Kicinski 77c9387c0c Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2022-03-29

We've added 16 non-merge commits during the last 1 day(s) which contain
a total of 24 files changed, 354 insertions(+), 187 deletions(-).

The main changes are:

1) x86 specific bits of fprobe/rethook, from Masami and Peter.

2) ice/xsk fixes, from Maciej and Magnus.

3) Various small fixes, from Andrii, Yonghong, Geliang and others.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Fix clang compilation errors
  ice: xsk: Fix indexing in ice_tx_xsk_pool()
  ice: xsk: Stop Rx processing when ntc catches ntu
  ice: xsk: Eliminate unnecessary loop iteration
  xsk: Do not write NULL in SW ring at allocation failure
  x86,kprobes: Fix optprobe trampoline to generate complete pt_regs
  x86,rethook: Fix arch_rethook_trampoline() to generate a complete pt_regs
  x86,rethook,kprobes: Replace kretprobe with rethook on x86
  kprobes: Use rethook for kretprobe if possible
  bpftool: Fix generated code in codegen_asserts
  selftests/bpf: fix selftest after random: Urandom_read tracepoint removal
  bpf: Fix maximum permitted number of arguments check
  bpf: Sync comments for bpf_get_stack
  fprobe: Fix sparse warning for acccessing __rcu ftrace_hash
  fprobe: Fix smatch type mismatch warning
  bpf/bpftool: Add unprivileged_bpf_disabled check against value of 2
====================

Link: https://lore.kernel.org/r/20220329234924.39053-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-29 18:59:15 -07:00
Linus Torvalds 965181d7ef NFS client updates for Linux 5.18
Highlights include:
 
 Features:
 - Switch NFS to use readahead instead of the obsolete readpages.
 - Readdir fixes to improve cacheability of large directories when there
   are multiple readers and writers.
 - Readdir performance improvements when doing a seekdir() immediately
   after opening the directory (common when re-exporting NFS).
 - NFS swap improvements from Neil Brown.
 - Loosen up memory allocation to permit direct reclaim and write back
   in cases where there is no danger of deadlocking the writeback code or
   NFS swap.
 - Avoid sillyrename when the NFSv4 server claims to support the
   necessary features to recover the unlinked but open file after reboot.
 
 Bugfixes:
 - Patch from Olga to add a mount option to control NFSv4.1 session
   trunking discovery, and default it to being off.
 - Fix a lockup in nfs_do_recoalesce().
 - Two fixes for list iterator variables being used when pointing to the
   list head.
 - Fix a kernel memory scribble when reading from a non-socket transport
   in /sys/kernel/sunrpc.
 - Fix a race where reconnecting to a server could leave the TCP socket
   stuck forever in the connecting state.
 - Patch from Neil to fix a shutdown race which can leave the SUNRPC
   transport timer primed after we free the struct xprt itself.
 - Patch from Xin Xiong to fix reference count leaks in the NFSv4.2 copy
   offload.
 - Sunrpc patch from Olga to avoid resending a task on an offlined
   transport.
 
 Cleanups:
 - Patches from Dave Wysochanski to clean up the fscache code
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmJDLLUACgkQZwvnipYK
 APJZ5Q/9FH5S38RcUOxpFS+XaVdmxltJTMQv9Bz5mcyljq6PwRjTn73Qs7MfBQfD
 k90O9Pn/EXgQpFWNW4o/saQGLAO1MbVaesdNNF92Le9NziMz7lr95fsFXOqpg9EQ
 rEVOImWinv/N0+P0yv5tgjx6Fheu70L4dnwsxepnmtv2R1VjXOVchK7WjsV+9qey
 Xvn15OzhtVCGhRvvKfmGuGW+2I0D29tRNn2X0Pi+LjxAEQ0We/rcsJeGDMGYtbeC
 2U4ADk7KB0kJq04WYCH8k8tTrljL3PUONzO6QP1lltmAZ0FOwXf/AjyaD3SML2Xu
 A1yc2bEn3/S6wCQ3BiMxoE7UCSewg/ZyQD7B9GjZ/nWYNXREhLKvtVZt+ULTCA6o
 UKAI2vlEHljzQI2YfbVVZFyL8KCZqi+BFcJ/H/dtFKNazIkiefkaF+703Pflazis
 HjhA5G6IT0Pk6qZ6OBp+Z/4kCDYZYXXi9ysiNuMPWCiYEfZzt7ocUjm44uRAi+vR
 Qpf0V01Rg4iplTj3YFXqb9TwxrudzC7AcQ7hbLJia0kD/8djG+aGZs3SJWeO+VuE
 Xa41oxAx9jqOSRvSKPvj7nbexrwn8PtQI2NtDRqjPXlg6+Jbk23u8lIEVhsCH877
 xoMUHtTKLyqECskzRN4nh7qoBD3QfiLuI74aDcPtGEuXzU1+Ox8=
 =jeGt
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Features:

   - Switch NFS to use readahead instead of the obsolete readpages.

   - Readdir fixes to improve cacheability of large directories when
     there are multiple readers and writers.

   - Readdir performance improvements when doing a seekdir() immediately
     after opening the directory (common when re-exporting NFS).

   - NFS swap improvements from Neil Brown.

   - Loosen up memory allocation to permit direct reclaim and write back
     in cases where there is no danger of deadlocking the writeback code
     or NFS swap.

   - Avoid sillyrename when the NFSv4 server claims to support the
     necessary features to recover the unlinked but open file after
     reboot.

  Bugfixes:

   - Patch from Olga to add a mount option to control NFSv4.1 session
     trunking discovery, and default it to being off.

   - Fix a lockup in nfs_do_recoalesce().

   - Two fixes for list iterator variables being used when pointing to
     the list head.

   - Fix a kernel memory scribble when reading from a non-socket
     transport in /sys/kernel/sunrpc.

   - Fix a race where reconnecting to a server could leave the TCP
     socket stuck forever in the connecting state.

   - Patch from Neil to fix a shutdown race which can leave the SUNRPC
     transport timer primed after we free the struct xprt itself.

   - Patch from Xin Xiong to fix reference count leaks in the NFSv4.2
     copy offload.

   - Sunrpc patch from Olga to avoid resending a task on an offlined
     transport.

  Cleanups:

   - Patches from Dave Wysochanski to clean up the fscache code"

* tag 'nfs-for-5.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (91 commits)
  NFSv4/pNFS: Fix another issue with a list iterator pointing to the head
  NFS: Don't loop forever in nfs_do_recoalesce()
  SUNRPC: Don't return error values in sysfs read of closed files
  SUNRPC: Do not dereference non-socket transports in sysfs
  NFSv4.1: don't retry BIND_CONN_TO_SESSION on session error
  SUNRPC don't resend a task on an offlined transport
  NFS: replace usage of found with dedicated list iterator variable
  SUNRPC: avoid race between mod_timer() and del_timer_sync()
  pNFS/files: Ensure pNFS allocation modes are consistent with nfsiod
  pNFS/flexfiles: Ensure pNFS allocation modes are consistent with nfsiod
  NFSv4/pnfs: Ensure pNFS allocation modes are consistent with nfsiod
  NFS: Avoid writeback threads getting stuck in mempool_alloc()
  NFS: nfsiod should not block forever in mempool_alloc()
  SUNRPC: Make the rpciod and xprtiod slab allocation modes consistent
  SUNRPC: Fix unx_lookup_cred() allocation
  NFS: Fix memory allocation in rpc_alloc_task()
  NFS: Fix memory allocation in rpc_malloc()
  SUNRPC: Improve accuracy of socket ENOBUFS determination
  SUNRPC: Replace internal use of SOCKWQ_ASYNC_NOSPACE
  SUNRPC: Fix socket waits for write buffer space
  ...
2022-03-29 18:55:37 -07:00
Jeremy Sowden 3181821317 netfilter: bitwise: fix reduce comparisons
The `nft_bitwise_reduce` and `nft_bitwise_fast_reduce` functions should
compare the bitwise operation in `expr` with the tracked operation
associated with the destination register of `expr`.  However, instead of
being called on `expr` and `track->regs[priv->dreg].selector`,
`nft_expr_priv` is called on `expr` twice, so both reduce functions
return true even when the operations differ.

Fixes: be5650f8f4 ("netfilter: nft_bitwise: track register operations")
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-03-29 23:33:46 +02:00
Duoming Zhou 82e31755e5 ax25: Fix UAF bugs in ax25 timers
There are race conditions that may lead to UAF bugs in
ax25_heartbeat_expiry(), ax25_t1timer_expiry(), ax25_t2timer_expiry(),
ax25_t3timer_expiry() and ax25_idletimer_expiry(), when we call
ax25_release() to deallocate ax25_dev.

One of the UAF bugs caused by ax25_release() is shown below:

      (Thread 1)                    |      (Thread 2)
ax25_dev_device_up() //(1)          |
...                                 | ax25_kill_by_device()
ax25_bind()          //(2)          |
ax25_connect()                      | ...
 ax25_std_establish_data_link()     |
  ax25_start_t1timer()              | ax25_dev_device_down() //(3)
   mod_timer(&ax25->t1timer,..)     |
                                    | ax25_release()
   (wait a time)                    |  ...
                                    |  ax25_dev_put(ax25_dev) //(4)FREE
   ax25_t1timer_expiry()            |
    ax25->ax25_dev->values[..] //USE|  ...
     ...                            |

We increase the refcount of ax25_dev in position (1) and (2), and
decrease the refcount of ax25_dev in position (3) and (4).
The ax25_dev will be freed in position (4) and be used in
ax25_t1timer_expiry().

The fail log is shown below:
==============================================================

[  106.116942] BUG: KASAN: use-after-free in ax25_t1timer_expiry+0x1c/0x60
[  106.116942] Read of size 8 at addr ffff88800bda9028 by task swapper/0/0
[  106.116942] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.17.0-06123-g0905eec574
[  106.116942] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-14
[  106.116942] Call Trace:
...
[  106.116942]  ax25_t1timer_expiry+0x1c/0x60
[  106.116942]  call_timer_fn+0x122/0x3d0
[  106.116942]  __run_timers.part.0+0x3f6/0x520
[  106.116942]  run_timer_softirq+0x4f/0xb0
[  106.116942]  __do_softirq+0x1c2/0x651
...

This patch adds del_timer_sync() in ax25_release(), which could ensure
that all timers stop before we deallocate ax25_dev.

Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-03-29 10:24:34 +02:00
Duoming Zhou 5352a76130 ax25: fix UAF bug in ax25_send_control()
There are UAF bugs in ax25_send_control(), when we call ax25_release()
to deallocate ax25_dev. The possible race condition is shown below:

      (Thread 1)              |     (Thread 2)
ax25_dev_device_up() //(1)    |
                              | ax25_kill_by_device()
ax25_bind()          //(2)    |
ax25_connect()                | ...
 ax25->state = AX25_STATE_1   |
 ...                          | ax25_dev_device_down() //(3)

      (Thread 3)
ax25_release()                |
 ax25_dev_put()  //(4) FREE   |
 case AX25_STATE_1:           |
  ax25_send_control()         |
   alloc_skb()       //USE    |

The refcount of ax25_dev increases in position (1) and (2), and
decreases in position (3) and (4). The ax25_dev will be freed
before dereference sites in ax25_send_control().

The following is part of the report:

[  102.297448] BUG: KASAN: use-after-free in ax25_send_control+0x33/0x210
[  102.297448] Read of size 8 at addr ffff888009e6e408 by task ax25_close/602
[  102.297448] Call Trace:
[  102.303751]  ax25_send_control+0x33/0x210
[  102.303751]  ax25_release+0x356/0x450
[  102.305431]  __sock_release+0x6d/0x120
[  102.305431]  sock_close+0xf/0x20
[  102.305431]  __fput+0x11f/0x420
[  102.305431]  task_work_run+0x86/0xd0
[  102.307130]  get_signal+0x1075/0x1220
[  102.308253]  arch_do_signal_or_restart+0x1df/0xc00
[  102.308253]  exit_to_user_mode_prepare+0x150/0x1e0
[  102.308253]  syscall_exit_to_user_mode+0x19/0x50
[  102.308253]  do_syscall_64+0x48/0x90
[  102.308253]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  102.308253] RIP: 0033:0x405ae7

This patch defers the free operation of ax25_dev and net_device after
all corresponding dereference sites in ax25_release() to avoid UAF.

Fixes: 9fd75b66b8 ("ax25: Fix refcount leaks caused by ax25_cb_del()")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-03-29 10:24:34 +02:00
Martin Varghese f19c44452b openvswitch: Fixed nd target mask field in the flow dump.
IPv6 nd target mask was not getting populated in flow dump.

In the function __ovs_nla_put_key the icmp code mask field was checked
instead of icmp code key field to classify the flow as neighbour discovery.

ufid:bdfbe3e5-60c2-43b0-a5ff-dfcac1c37328, recirc_id(0),dp_hash(0/0),
skb_priority(0/0),in_port(ovs-nm1),skb_mark(0/0),ct_state(0/0),
ct_zone(0/0),ct_mark(0/0),ct_label(0/0),
eth(src=00:00:00:00:00:00/00:00:00:00:00:00,
dst=00:00:00:00:00:00/00:00:00:00:00:00),
eth_type(0x86dd),
ipv6(src=::/::,dst=::/::,label=0/0,proto=58,tclass=0/0,hlimit=0/0,frag=no),
icmpv6(type=135,code=0),
nd(target=2001::2/::,
sll=00:00:00:00:00:00/00:00:00:00:00:00,
tll=00:00:00:00:00:00/00:00:00:00:00:00),
packets:10, bytes:860, used:0.504s, dp:ovs, actions:ovs-nm2

Fixes: e64457191a (openvswitch: Restructure datapath.c and flow.c)
Signed-off-by: Martin Varghese <martin.varghese@nokia.com>
Link: https://lore.kernel.org/r/20220328054148.3057-1-martinvarghesenokia@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-03-29 09:47:41 +02:00
Magnus Karlsson a95a4d9b39 xsk: Do not write NULL in SW ring at allocation failure
For the case when xp_alloc_batch() is used but the batched allocation
cannot be used, there is a slow path that uses the non-batched
xp_alloc(). When it fails to allocate an entry, it returns NULL. The
current code wrote this NULL into the entry of the provided results
array (pointer to the driver SW ring usually) and returned. This might
not be what the driver expects and to make things simpler, just write
successfully allocated xdp_buffs into the SW ring,. The driver might
have information in there that is still important after an allocation
failure.

Note that at this point in time, there are no drivers using
xp_alloc_batch() that could trigger this slow path. But one might get
added.

Fixes: 47e4075df3 ("xsk: Batched buffer allocation for the pool")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220328142123.170157-2-maciej.fijalkowski@intel.com
2022-03-28 19:56:28 -07:00
Linus Torvalds d717e4cae0 Networking fixes, including fixes from netfilter.
Current release - regressions:
 
  - llc: only change llc->dev when bind() succeeds, fix null-deref
 
 Current release - new code bugs:
 
  - smc: fix a memory leak in smc_sysctl_net_exit()
 
  - dsa: realtek: make interface drivers depend on OF
 
 Previous releases - regressions:
 
  - sched: act_ct: fix ref leak when switching zones
 
 Previous releases - always broken:
 
  - netfilter: egress: report interface as outgoing
 
  - vsock/virtio: enable VQs early on probe and finish the setup
    before using them
 
 Misc:
 
  - memcg: enable accounting for nft objects
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmJCSMMACgkQMUZtbf5S
 Irt4eQ//fTAC/7mBmT8uoUJMZlrRckSDnJ/Y1ukgOQrjbwcgeRi0PK1cy2oGmU4w
 mRZ8zhskVpmzodPuduCIdmsdE2PaWTCFoVRC52QH1HffCRbj1mRK9vrf94q0TP9+
 jqzaIOhKyWKGMgYQGObIFbojnF4H1wm+tIXcEVWzxivS/2yY4W/3hdBIblBO++r5
 c9vxO//qzGH1kGDCWfahuJSTvZBpQ3HTmjGLC1F8xTh8RkR7MGQyGCQ984j+DClb
 PJJQXeV/Zoyxvrzv14MU5Ms9+lsgH2pyBdVzvN8p2QSwSaU8CsbbM05I4lB5mT/b
 tGBYNreMmuQbXRxNVoxaZOTgQqEtTgH+AKJ9L0f2Es6Ftp5TTrFZZA97lO0/qzMj
 NGbxa0p7tlNyOGKDxyUw6SB1+kqqgR2a6skk4XnQ6CAH7AvxSFOxvt63mjeJfCY7
 +j5Lxtm+a/RpVt6Djsvpwq12lKiootcbEyMoUKxKeQ+4I08z6W6hoS1zjUDeMDM6
 q8eDXsxpZgGF6k7x3eKkOWKLMVeQ1cv0CjGaCoTXCqtGZTixRll3v6I6/Oh405Gw
 18fZkIC4TjRdXyfA23n7MzyukjOjmbzn5Kx01lfiMYFeiS/tMwFt/W+ka836j0R6
 gzUdEHLEZdPN699WP4fRrxmIjGlGpEpl02WDEFrP+LDdFCYHCzY=
 =sOIu
 -----END PGP SIGNATURE-----

Merge tag 'net-5.18-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from netfilter.

  Current release - regressions:

   - llc: only change llc->dev when bind() succeeds, fix null-deref

  Current release - new code bugs:

   - smc: fix a memory leak in smc_sysctl_net_exit()

   - dsa: realtek: make interface drivers depend on OF

  Previous releases - regressions:

   - sched: act_ct: fix ref leak when switching zones

  Previous releases - always broken:

   - netfilter: egress: report interface as outgoing

   - vsock/virtio: enable VQs early on probe and finish the setup before
     using them

  Misc:

   - memcg: enable accounting for nft objects"

* tag 'net-5.18-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (39 commits)
  Revert "selftests: net: Add tls config dependency for tls selftests"
  net/smc: Send out the remaining data in sndbuf before close
  net: move net_unlink_todo() out of the header
  net: dsa: bcm_sf2_cfp: fix an incorrect NULL check on list iterator
  net: bnxt_ptp: fix compilation error
  selftests: net: Add tls config dependency for tls selftests
  memcg: enable accounting for nft objects
  net/sched: act_ct: fix ref leak when switching zones
  net/smc: fix a memory leak in smc_sysctl_net_exit()
  selftests: tls: skip cmsg_to_pipe tests with TLS=n
  octeontx2-af: initialize action variable
  net: sparx5: switchdev: fix possible NULL pointer dereference
  net/x25: Fix null-ptr-deref caused by x25_disconnect
  qlcnic: dcb: default to returning -EOPNOTSUPP
  net: sparx5: depends on PTP_1588_CLOCK_OPTIONAL
  net: hns3: fix phy can not link up when autoneg off and reset
  net: hns3: add NULL pointer check for hns3_set/get_ringparam()
  net: hns3: add netdev reset check for hns3_set_tunable()
  net: hns3: clean residual vf config after disable sriov
  net: hns3: add max order judgement for tx spare buffer
  ...
2022-03-28 17:02:04 -07:00
Wen Gu 906b3d6491 net/smc: Send out the remaining data in sndbuf before close
The current autocork algorithms will delay the data transmission
in BH context to smc_release_cb() when sock_lock is hold by user.

So there is a possibility that when connection is being actively
closed (sock_lock is hold by user now), some corked data still
remains in sndbuf, waiting to be sent by smc_release_cb(). This
will cause:

- smc_close_stream_wait(), which is called under the sock_lock,
  has a high probability of timeout because data transmission is
  delayed until sock_lock is released.

- Unexpected data sends may happen after connction closed and use
  the rtoken which has been deleted by remote peer through
  LLC_DELETE_RKEY messages.

So this patch will try to send out the remaining corked data in
sndbuf before active close process, to ensure data integrity and
avoid unexpected data transmission after close.

Reported-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Fixes: 6b88af839d ("net/smc: don't send in the BH context if sock_owned_by_user")
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/1648447836-111521-1-git-send-email-guwen@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-28 16:06:27 -07:00
Johannes Berg f32404ae1b net: move net_unlink_todo() out of the header
There's no reason for this to be in netdevice.h, it's all
just used in dev.c. Also make it no longer inline and let
the compiler decide to do that by itself.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20220325225023.f49b9056fe1c.I6b901a2df00000837a9bd251a8dd259bd23f5ded@changeid
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-28 14:40:08 -07:00
Linus Torvalds a701f370b5 xen: branch for v5.18-rc1
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYkF9UwAKCRCAXGG7T9hj
 vsXpAPwKXI4WIQcvnVCdULQfuXpA1TbD5XZuS9OuiN/OxWHbzAEA1VHWTmS+tpZ1
 ptOyoGhAWhTGeplToobDSGz5qTXEPAI=
 =FaKX
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen updates from Juergen Gross:

 - A bunch of minor cleanups

 - A fix for kexec in Xen dom0 when executed on a high cpu number

 - A fix for resuming after suspend of a Xen guest with assigned PCI
   devices

 - A fix for a crash due to not disabled preemption when resuming as Xen
   dom0

* tag 'for-linus-5.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen: fix is_xen_pmu()
  xen: don't hang when resuming PCI device
  arch:x86:xen: Remove unnecessary assignment in xen_apic_read()
  xen/grant-table: remove readonly parameter from functions
  xen/grant-table: remove gnttab_*transfer*() functions
  drivers/xen: use helper macro __ATTR_RW
  x86/xen: Fix kerneldoc warning
  xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32
  xen: use time_is_before_eq_jiffies() instead of open coding it
2022-03-28 14:32:39 -07:00
David S. Miller 2aa2f88c97 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Incorrect output device in nf_egress hook, from Phill Sutter.

2) Preserve liberal flag in TCP conntrack state, reported by Sven Auhagen.

3) Use GFP_KERNEL_ACCOUNT flag for nf_tables objects, from Vasily Averin.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-28 08:57:10 -07:00
Vasily Averin 33758c8914 memcg: enable accounting for nft objects
nftables replaces iptables, but it lacks memcg accounting.

This patch account most of the memory allocation associated with nft
and should protect the host from misusing nft inside a memcg restricted
container.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-03-28 10:11:23 +02:00
Marcelo Ricardo Leitner bcb74e132a net/sched: act_ct: fix ref leak when switching zones
When switching zones or network namespaces without doing a ct clear in
between, it is now leaking a reference to the old ct entry. That's
because tcf_ct_skb_nfct_cached() returns false and
tcf_ct_flow_table_lookup() may simply overwrite it.

The fix is to, as the ct entry is not reusable, free it already at
tcf_ct_skb_nfct_cached().

Reported-by: Florian Westphal <fw@strlen.de>
Fixes: 2f131de361 ("net/sched: act_ct: Fix flow table lookup after ct clear or switching zones")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-26 17:00:51 -07:00
Eric Dumazet 5ae6acf1d0 net/smc: fix a memory leak in smc_sysctl_net_exit()
Recently added smc_sysctl_net_exit() forgot to free
the memory allocated from smc_sysctl_net_init()
for non initial network namespace.

Fixes: 462791bbfa ("net/smc: add sysctl interface for SMC")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Tony Lu <tonylu@linux.alibaba.com>
Cc: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-26 14:17:01 -07:00