Commit graph

67075 commits

Author SHA1 Message Date
Jakub Kicinski c52ef04d59 devlink: make all symbols GPL-only
devlink_alloc() and devlink_register() are both GPL.
A non-GPL module won't get far, so for consistency
we can make all symbols GPL without risking any real
life breakage.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29 13:46:26 +01:00
Ivan Vecera 829e050eea net: bridge: fix uninitialized variables when BRIDGE_CFM is disabled
Function br_get_link_af_size_filtered() calls br_cfm_{,peer}_mep_count()
that return a count. When BRIDGE_CFM is not enabled these functions
simply return -EOPNOTSUPP but do not modify count parameter and
calling function then works with uninitialized variables.
Modify these inline functions to return zero in count parameter.

Fixes: b6d0425b81 ("bridge: cfm: Netlink Notifications.")
Cc: Henrik Bjoernlund <henrik.bjoernlund@microchip.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29 13:31:32 +01:00
Jeremy Kerr 67737c4572 mctp: Pass flow data & flow release events to drivers
Now that we have an extension for MCTP data in skbs, populate the flow
when a key has been created for the packet, and add a device driver
operation to inform of flow destruction.

Includes a fix for a warning with test builds:
Reported-by: kernel test robot <lkp@intel.com>

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29 13:23:51 +01:00
Jeremy Kerr 78476d315e mctp: Add flow extension to skb
This change adds a new skb extension for MCTP, to represent a
request/response flow.

The intention is to use this in a later change to allow i2c controllers
to correctly configure a multiplexer over a flow.

Since we have a cleanup function in the core path (if an extension is
present), we'll need to make CONFIG_MCTP a bool, rather than a tristate.

Includes a fix for a build warning with clang:
Reported-by: kernel test robot <lkp@intel.com>

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29 13:23:51 +01:00
Jeremy Kerr 212c10c3c6 mctp: Return new key from mctp_alloc_local_tag
In a future change, we will want the key available for future use after
allocating a new tag. This change returns the key from
mctp_alloc_local_tag, rather than just key->tag.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29 13:23:50 +01:00
Xin Long 75cf662c64 sctp: return true only for pathmtu update in sctp_transport_pl_toobig
sctp_transport_pl_toobig() supposes to return true only if there's
pathmtu update, so that in sctp_icmp_frag_needed() it would call
sctp_assoc_sync_pmtu() and sctp_retransmit(). This patch is to fix
these return places in sctp_transport_pl_toobig().

Fixes: 8369640831 ("sctp: do state transition when receiving an icmp TOOBIG packet")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29 12:21:23 +01:00
Xin Long 40171248bb sctp: allow IP fragmentation when PLPMTUD enters Error state
Currently when PLPMTUD enters Error state, transport pathmtu will be set
to MIN_PLPMTU(512) while probe is continuing with BASE_PLPMTU(1200). It
will cause pathmtu to stay in a very small value, even if the real pmtu
is some value like 1000.

RFC8899 doesn't clearly say how to set the value in Error state. But one
possibility could be keep using BASE_PLPMTU for the real pmtu, but allow
to do IP fragmentation when it's in Error state.

As it says in rfc8899#section-5.4:

   Some paths could be unable to sustain packets of the BASE_PLPMTU
   size.  The Error State could be implemented to provide robustness to
   such paths.  This allows fallback to a smaller than desired PLPMTU
   rather than suffer connectivity failure.  This could utilize methods
   such as endpoint IP fragmentation to enable the PL sender to
   communicate using packets smaller than the BASE_PLPMTU.

This patch is to set pmtu to BASE_PLPMTU instead of MIN_PLPMTU for Error
state in sctp_transport_pl_send/toobig(), and set packet ipfragok for
non-probe packets when it's in Error state.

Fixes: 1dc68c1945 ("sctp: do state transition when PROBE_COUNT == MAX_PROBES on HB send path")
Reported-by: Ying Xu <yinxu@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29 12:21:23 +01:00
Vladimir Oltean 326b212e9c net: bridge: switchdev: consistent function naming
Rename all recently imported functions in br_switchdev.c to start with a
br_switchdev_* prefix.

br_fdb_replay_one() -> br_switchdev_fdb_replay_one()
br_fdb_replay() -> br_switchdev_fdb_replay()
br_vlan_replay_one() -> br_switchdev_vlan_replay_one()
br_vlan_replay() -> br_switchdev_vlan_replay()
struct br_mdb_complete_info -> struct br_switchdev_mdb_complete_info
br_mdb_complete() -> br_switchdev_mdb_complete()
br_mdb_switchdev_host_port() -> br_switchdev_host_mdb_one()
br_mdb_switchdev_host() -> br_switchdev_host_mdb()
br_mdb_replay_one() -> br_switchdev_mdb_replay_one()
br_mdb_replay() -> br_switchdev_mdb_replay()
br_mdb_queue_one() -> br_switchdev_mdb_queue_one()

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28 20:05:57 -07:00
Vladimir Oltean 9776457c78 net: bridge: mdb: move all switchdev logic to br_switchdev.c
The following functions:

br_mdb_complete
br_switchdev_mdb_populate
br_mdb_replay_one
br_mdb_queue_one
br_mdb_replay
br_mdb_switchdev_host_port
br_mdb_switchdev_host
br_switchdev_mdb_notify

are only accessible from code paths where CONFIG_NET_SWITCHDEV is
enabled. So move them to br_switchdev.c, in order for that code to be
compiled out if that config option is disabled.

Note that br_switchdev.c gets build regardless of whether
CONFIG_BRIDGE_IGMP_SNOOPING is enabled or not, whereas br_mdb.c only got
built when CONFIG_BRIDGE_IGMP_SNOOPING was enabled. So to preserve
correct compilation with CONFIG_BRIDGE_IGMP_SNOOPING being disabled, we
must now place an #ifdef around these functions in br_switchdev.c.
The offending bridge data structures that need this are
br->multicast_lock and br->mdb_list, these are also compiled out of
struct net_bridge when CONFIG_BRIDGE_IGMP_SNOOPING is turned off.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28 20:05:57 -07:00
Vladimir Oltean 9ae9ff994b net: bridge: split out the switchdev portion of br_mdb_notify
Similar to fdb_notify() and br_switchdev_fdb_notify(), split the
switchdev specific logic from br_mdb_notify() into a different function.
This will be moved later in br_switchdev.c.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28 20:05:57 -07:00
Vladimir Oltean 4a6849e461 net: bridge: move br_vlan_replay to br_switchdev.c
br_vlan_replay() is relevant only if CONFIG_NET_SWITCHDEV is enabled, so
move it to br_switchdev.c.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28 20:05:57 -07:00
Vladimir Oltean c5f6e5ebc2 net: bridge: provide shim definition for br_vlan_flags
br_vlan_replay() needs this, and we're preparing to move it to
br_switchdev.c, which will be compiled regardless of whether or not
CONFIG_BRIDGE_VLAN_FILTERING is enabled.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28 20:05:57 -07:00
Jakub Kicinski 7df621a3ee Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
include/net/sock.h
  7b50ecfcc6 ("net: Rename ->stream_memory_read to ->sock_is_readable")
  4c1e34c0db ("vsock: Enable y2038 safe timeval for timeout")

drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c
  0daa55d033 ("octeontx2-af: cn10k: debugfs for dumping LMTST map table")
  e77bcdd1f6 ("octeontx2-af: Display all enabled PF VF rsrc_alloc entries.")

Adjacent code addition in both cases, keep both.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28 10:43:58 -07:00
Davide Caratti f7cc8890f3 mptcp: fix corrupt receiver key in MPC + data + checksum
using packetdrill it's possible to observe that the receiver key contains
random values when clients transmit MP_CAPABLE with data and checksum (as
specified in RFC8684 §3.1). Fix the layout of mptcp_out_options, to avoid
using the skb extension copy when writing the MP_CAPABLE sub-option.

Fixes: d7b2690837 ("mptcp: shrink mptcp_out_options struct")
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/233
Reported-by: Poorva Sonparote <psonparo@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Link: https://lore.kernel.org/r/20211027203855.264600-1-mathew.j.martineau@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28 08:19:06 -07:00
Leon Romanovsky ee775b5695 devlink: Simplify internal devlink params implementation
Reduce extra indirection from devlink_params_*() API. Such change
makes it clear that we can drop devlink->lock from these flows, because
everything is executed when the devlink is not registered yet.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28 14:55:17 +01:00
Daniel Jordan 1d9d6fd21a net/tls: Fix flipped sign in async_wait.err assignment
sk->sk_err contains a positive number, yet async_wait.err wants the
opposite.  Fix the missed sign flip, which Jakub caught by inspection.

Fixes: a42055e8d2 ("net/tls: Add support for async encryption of records for performance")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28 14:41:20 +01:00
Daniel Jordan da353fac65 net/tls: Fix flipped sign in tls_err_abort() calls
sk->sk_err appears to expect a positive value, a convention that ktls
doesn't always follow and that leads to memory corruption in other code.
For instance,

    [kworker]
    tls_encrypt_done(..., err=<negative error from crypto request>)
      tls_err_abort(.., err)
        sk->sk_err = err;

    [task]
    splice_from_pipe_feed
      ...
        tls_sw_do_sendpage
          if (sk->sk_err) {
            ret = -sk->sk_err;  // ret is positive

    splice_from_pipe_feed (continued)
      ret = actor(...)  // ret is still positive and interpreted as bytes
                        // written, resulting in underflow of buf->len and
                        // sd->len, leading to huge buf->offset and bogus
                        // addresses computed in later calls to actor()

Fix all tls_err_abort() callers to pass a negative error code
consistently and centralize the error-prone sign flip there, throwing in
a warning to catch future misuse and uninlining the function so it
really does only warn once.

Cc: stable@vger.kernel.org
Fixes: c46234ebb4 ("tls: RX path for ktls")
Reported-by: syzbot+b187b77c8474f9648fae@syzkaller.appspotmail.com
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28 14:41:20 +01:00
Maxime Chevallier ee046d9a22 net: ipconfig: Release the rtnl_lock while waiting for carrier
While waiting for a carrier to come on one of the netdevices, some
devices will require to take the rtnl lock at some point to fully
initialize all parts of the link.

That's the case for SFP, where the rtnl is taken when a module gets
detected. This prevents mounting an NFS rootfs over an SFP link.

This means that while ipconfig waits for carriers to be detected, no SFP
modules can be detected in the meantime, it's only detected after
ipconfig times out.

This commit releases the rtnl_lock while waiting for the carrier to come
up, and re-takes it to check the for the init device and carrier status.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28 14:36:41 +01:00
Maxim Mikityanskiy 648a991cf3 sch_htb: Add extack messages for EOPNOTSUPP errors
In order to make the "Operation not supported" message clearer to the
user, add extack messages explaining why exactly adding offloaded HTB
could be not supported in each case.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28 14:34:03 +01:00
Wen Gu f3a3a0fe0b net/smc: Correct spelling mistake to TCPF_SYN_RECV
There should use TCPF_SYN_RECV instead of TCP_SYN_RECV.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28 13:04:28 +01:00
Tony Lu c4a146c7cf net/smc: Fix smc_link->llc_testlink_time overflow
The value of llc_testlink_time is set to the value stored in
net->ipv4.sysctl_tcp_keepalive_time when linkgroup init. The value of
sysctl_tcp_keepalive_time is already jiffies, so we don't need to
multiply by HZ, which would cause smc_link->llc_testlink_time overflow,
and test_link send flood.

Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28 13:04:28 +01:00
Alexander Kuznetsov 06e6c88fba ipv6: enable net.ipv6.route.max_size sysctl in network namespace
We want to increase route cache size in network namespace
created with user namespace. Currently ipv6 route settings
are disabled for non-initial network namespaces.
We can allow this sysctl and it will be safe since
commit <6126891c6d4f> because route cache account to kmem,
that is why users from user namespace can not DOS system.

Signed-off-by: Alexander Kuznetsov <wwfq@yandex-team.ru>
Acked-by: Dmitry Yakunin <zeil@yandex-team.ru>
Acked-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28 12:53:39 +01:00
Eric Dumazet 8b7d8c2bdb tcp: do not clear TCP_SKB_CB(skb)->sacked if already zero
Freshly allocated skbs have zero in skb->cb[] already.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28 12:44:39 +01:00
Eric Dumazet 4f2266748e tcp: do not clear skb->csum if already zero
Freshly allocated skbs have their csum field cleared already.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28 12:44:39 +01:00
Eric Dumazet a52fe46ef1 tcp: factorize ip_summed setting
Setting skb->ip_summed to CHECKSUM_PARTIAL can be centralized
in tcp_stream_alloc_skb() and __mptcp_do_alloc_tx_skb()
instead of being done multiple times.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28 12:44:39 +01:00
Eric Dumazet f401da475f tcp: no longer set skb->reserved_tailroom
TCP/MPTCP sendmsg() no longer puts payload in skb->head,
we can remove not needed code.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28 12:44:38 +01:00
Eric Dumazet bd44631471 tcp: remove dead code from tcp_collapse_retrans()
TCP sendmsg() no longer puts payload in skb->head,
remove some dead code from tcp_collapse_retrans().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28 12:44:38 +01:00
Eric Dumazet 27728ba80f tcp: cleanup tcp_remove_empty_skb() use
All tcp_remove_empty_skb() callers now use tcp_write_queue_tail()
for the skb argument, we can therefore factorize code.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28 12:44:38 +01:00
Eric Dumazet 3ded97bc41 tcp: remove dead code from tcp_sendmsg_locked()
TCP sendmsg() no longer puts payload in skb head, we can remove
dead code.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28 12:44:38 +01:00
luo penghao ad57dae8a6 xfrm: Remove redundant fields and related parentheses
The variable err is not necessary in such places. It should be revmoved
for the simplicity of the code. This will cause the double parentheses
to be redundant, and the inner parentheses should be deleted.

The clang_analyzer complains as follows:

net/xfrm/xfrm_input.c:533: warning:
net/xfrm/xfrm_input.c:563: warning:

Although the value stored to 'err' is used in the enclosing expression,
the value is never actually read from 'err'.

Changes in v2:

Modify the title, because v2 removes the brackets.
Remove extra parentheses.

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: luo penghao <luo.penghao@zte.com.cn>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-10-28 07:56:02 +02:00
Geliang Tang b8e0def397 mptcp: drop unused sk in mptcp_push_release
Since mptcp_set_timeout() had removed from mptcp_push_release() in
commit 33d41c9cd7 ("mptcp: more accurate timeout"), the argument
sk in mptcp_push_release() became useless. Let's drop it.

Fixes: 33d41c9cd7 ("mptcp: more accurate timeout")
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27 18:20:30 -07:00
Paolo Abeni 6511882cdd mptcp: allocate fwd memory separately on the rx and tx path
All the mptcp receive path is protected by the msk socket
spinlock. As consequences, the tx path has to play a few tricks to
allocate the forward memory without acquiring the spinlock multiple
times, making the overall TX path quite complex.

This patch tries to clean-up a bit the tx path, using completely
separated fwd memory allocation, for the rx and the tx path.

The forward memory allocated in the rx path is now accounted in
msk->rmem_fwd_alloc and is (still) protected by the msk socket spinlock.

To cope with the above we provide a few MPTCP-specific variants for
the helpers to charge, uncharge, reclaim and free the forward memory
in the receive path.

msk->sk_forward_alloc now accounts only the forward memory for the tx
path, we can use the plain core sock helper to manipulate it and drop
quite a bit of complexity.

On memory pressure, both rx and tx fwd memories are reclaimed.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27 18:20:29 -07:00
Paolo Abeni 292e6077b0 net: introduce sk_forward_alloc_get()
A later patch will change the MPTCP memory accounting schema
in such a way that MPTCP sockets will encode the total amount of
forward allocated memory in two separate fields (one for tx and
one for rx).

MPTCP sockets will use their own helper to provide the accurate
amount of fwd allocated memory.

To allow the above, this patch adds a new, optional, sk method to
fetch the fwd memory, wrap the call in a new helper and use it
where it is appropriate.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27 18:20:29 -07:00
Eric Dumazet 9dfc685e02 inet: remove races in inet{6}_getname()
syzbot reported data-races in inet_getname() multiple times,
it is time we fix this instead of pretending applications
should not trigger them.

getsockname() and getpeername() are not really considered fast path.

v2: added the missing BPF_CGROUP_RUN_SA_PROG() declaration
    needed when CONFIG_CGROUP_BPF=n, as reported by
    kernel test robot <lkp@intel.com>

syzbot typical report:

BUG: KCSAN: data-race in __inet_hash_connect / inet_getname

write to 0xffff888136d66cf8 of 2 bytes by task 14374 on cpu 1:
 __inet_hash_connect+0x7ec/0x950 net/ipv4/inet_hashtables.c:831
 inet_hash_connect+0x85/0x90 net/ipv4/inet_hashtables.c:853
 tcp_v4_connect+0x782/0xbb0 net/ipv4/tcp_ipv4.c:275
 __inet_stream_connect+0x156/0x6e0 net/ipv4/af_inet.c:664
 inet_stream_connect+0x44/0x70 net/ipv4/af_inet.c:728
 __sys_connect_file net/socket.c:1896 [inline]
 __sys_connect+0x254/0x290 net/socket.c:1913
 __do_sys_connect net/socket.c:1923 [inline]
 __se_sys_connect net/socket.c:1920 [inline]
 __x64_sys_connect+0x3d/0x50 net/socket.c:1920
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x44/0xa0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

read to 0xffff888136d66cf8 of 2 bytes by task 14408 on cpu 0:
 inet_getname+0x11f/0x170 net/ipv4/af_inet.c:790
 __sys_getsockname+0x11d/0x1b0 net/socket.c:1946
 __do_sys_getsockname net/socket.c:1961 [inline]
 __se_sys_getsockname net/socket.c:1958 [inline]
 __x64_sys_getsockname+0x3e/0x50 net/socket.c:1958
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x44/0xa0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

value changed: 0x0000 -> 0xdee0

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 14408 Comm: syz-executor.3 Not tainted 5.15.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Link: https://lore.kernel.org/r/20211026213014.3026708-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27 18:20:21 -07:00
Yajun Deng b859a360d8 xdp: Remove redundant warning
There is a warning in xdp_rxq_info_unreg_mem_model() when reg_state isn't
equal to REG_STATE_REGISTERED, so the warning in xdp_rxq_info_unreg() is
redundant.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/r/20211027013856.1866-1-yajun.deng@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27 18:13:57 -07:00
Seth Forshee 85c0c3eb9a net: sch: simplify condtion for selecting mini_Qdisc_pair buffer
The only valid values for a miniq pointer are NULL or a pointer to
miniq1 or miniq2, so testing for miniq_old != &miniq1 is functionally
equivalent to testing that it is NULL or equal to &miniq2.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Seth Forshee <sforshee@digitalocean.com>
Link: https://lore.kernel.org/r/20211026183721.137930-1-seth@forshee.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27 17:09:26 -07:00
Seth Forshee 267463823a net: sch: eliminate unnecessary RCU waits in mini_qdisc_pair_swap()
Currently rcu_barrier() is used to ensure that no readers of the
inactive mini_Qdisc buffer remain before it is reused. This waits for
any pending RCU callbacks to complete, when all that is actually
required is to wait for one RCU grace period to elapse after the buffer
was made inactive. This means that using rcu_barrier() may result in
unnecessary waits.

To improve this, store the current RCU state when a buffer is made
inactive and use poll_state_synchronize_rcu() to check whether a full
grace period has elapsed before reusing it. If a full grace period has
not elapsed, wait for a grace period to elapse, and in the non-RT case
use synchronize_rcu_expedited() to hasten it.

Since this approach eliminates the RCU callback it is no longer
necessary to synchronize_rcu() in the tp_head==NULL case. However, the
RCU state should still be saved for the previously active buffer.

Before this change I would typically see mini_qdisc_pair_swap() take
tens of milliseconds to complete. After this change it typcially
finishes in less than 1 ms, and often it takes just a few microseconds.

Thanks to Paul for walking me through the options for improving this.

Cc: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Seth Forshee <sforshee@digitalocean.com>
Link: https://lore.kernel.org/r/20211026130700.121189-1-seth@forshee.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27 17:09:26 -07:00
Arnd Bergmann f25c0515c5 net: sched: gred: dynamically allocate tc_gred_qopt_offload
The tc_gred_qopt_offload structure has grown too big to be on the
stack for 32-bit architectures after recent changes.

net/sched/sch_gred.c:903:13: error: stack frame size (1180) exceeds limit (1024) in 'gred_destroy' [-Werror,-Wframe-larger-than]
net/sched/sch_gred.c:310:13: error: stack frame size (1212) exceeds limit (1024) in 'gred_offload' [-Werror,-Wframe-larger-than]

Use dynamic allocation per qdisc to avoid this.

Fixes: 50dc9a8572 ("net: sched: Merge Qdisc::bstats and Qdisc::cpu_bstats data types")
Fixes: 67c9e6270f ("net: sched: Protect Qdisc::bstats with u64_stats")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20211026100711.nalhttf6mbe6sudx@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27 12:06:52 -07:00
Leon Romanovsky c5e0321e43 Revert "devlink: Remove not-executed trap policer notifications"
This reverts commit 22849b5ea5 as it
revealed that mlxsw and netdevsim (copy/paste from mlxsw) reregisters
devlink objects during another devlink user triggered command.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27 11:58:07 -07:00
Leon Romanovsky fb9d19c2d8 Revert "devlink: Remove not-executed trap group notifications"
This reverts commit 8bbeed4858 as it
revealed that mlxsw and netdevsim (copy/paste from mlxsw) reregisters
devlink objects during another devlink user triggered command.

Fixes: 22849b5ea5 ("devlink: Remove not-executed trap policer notifications")
Reported-by: syzbot+93d5accfaefceedf43c1@syzkaller.appspotmail.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27 11:58:07 -07:00
Jakub Kicinski afe8ca110c Two fixes:
* bridge vs. 4-addr mode check was wrong
  * management frame registrations locking was
    wrong, causing list corruption/crashes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmF5Y8AACgkQB8qZga/f
 l8SEGQ/+NJOZHSSQWcgfBpXzZ2ZN1Mk1Sgi/ZLSE/FbfbN3y/WQWgWR06RZ9djtE
 6NzrIqMX7kuynRcTTFEYUT4ezl2XTb8CneLXqxHYKR8PXyRhyzo76bcxSCekp7Mr
 4N4Rfaj8VAJ7Q+/skiWAevnFyhGBD9uG4KV78ytsBiCURQBW9MCHm+UObBImVxcH
 0azqLZ+99++rXtraQhDSZZneY+/EKUIdpqM5iNVvx6X/Q+n8+Fb/0XcYmGE46TWH
 ZuHj+Vp8jJyNS8fsykL2O77sdiw5mnEVvYGumh8FfsP4fxndH8XSjRyMRfv1f4SU
 sKvsz2wcyFmIKisdZ9EA7uqnDXEDjodTO9OUCgg2VhI3TFJQ7A0ycU0yan1EnhPZ
 kVt/NkNyZgFFmedhN+XYaVXXda17qXHeeh6yWC1shO55H2e80Fj8//ETps1Jau1c
 u+xmf5DIKBnOsG+GnyYUJ3V9wLRedB0HI4KOIYjt2Nu/Vwp39Y/yK0HQ06uxLiGV
 QSOHSCjIhgVFmeez967UjIwDZesUxhVWA6M263qqkxWC75WmcDswuqkcWqgAcTat
 GEIfQi4W9FtBatLeny+uKcai++25cVJc+RZxQdKSDSfyocLPtvjZRpypBIYtMYgN
 PPp8xqMWFOd40TzK+R7+mu6RDop4UnlwAHAPwIbBGUX7n5IW410=
 =iErV
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-net-2021-10-27' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Two fixes:
 * bridge vs. 4-addr mode check was wrong
 * management frame registrations locking was
   wrong, causing list corruption/crashes
====================

Link: https://lore.kernel.org/r/20211027143756.91711-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-27 08:13:15 -07:00
Vladimir Oltean 716a30a97a net: switchdev: merge switchdev_handle_fdb_{add,del}_to_device
To reduce code churn, the same patch makes multiple changes, since they
all touch the same lines:

1. The implementations for these two are identical, just with different
   function pointers. Reduce duplications and name the function pointers
   "mod_cb" instead of "add_cb" and "del_cb". Pass the event as argument.

2. Drop the "const" attribute from "orig_dev". If the driver needs to
   check whether orig_dev belongs to itself and then
   call_switchdev_notifiers(orig_dev, SWITCHDEV_FDB_OFFLOADED), it
   can't, because call_switchdev_notifiers takes a non-const struct
   net_device *.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-27 14:54:02 +01:00
Vladimir Oltean fab9eca884 net: bridge: create a common function for populating switchdev FDB entries
There are two places where a switchdev FDB entry is constructed, one is
br_switchdev_fdb_notify() and the other is br_fdb_replay(). One uses a
struct initializer, and the other declares the structure as
uninitialized and populates the elements one by one.

One problem when introducing new members of struct
switchdev_notifier_fdb_info is that there is a risk for one of these
functions to run with an uninitialized value.

So centralize the logic of populating such structure into a dedicated
function. Being the primary location where these structures are created,
using an uninitialized variable and populating the members one by one
should be fine, since this one function is supposed to assign values to
all its members.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-27 14:54:02 +01:00
Vladimir Oltean 5cda5272a4 net: bridge: move br_fdb_replay inside br_switchdev.c
br_fdb_replay is only called from switchdev code paths, so it makes
sense to be disabled if switchdev is not enabled in the first place.

As opposed to br_mdb_replay and br_vlan_replay which might be turned off
depending on bridge support for multicast and VLANs, FDB support is
always on. So moving br_mdb_replay and br_vlan_replay inside
br_switchdev.c would mean adding some #ifdef's in br_switchdev.c, so we
keep those where they are.

The reason for the movement is that in future changes there will be some
code reuse between br_switchdev_fdb_notify and br_fdb_replay.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-27 14:54:02 +01:00
Vladimir Oltean 9574fb5580 net: bridge: reduce indentation level in fdb_create
We can express the same logic without an "if" condition as big as the
function, just return early if the kmem_cache_alloc() call fails.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-27 14:54:02 +01:00
Vladimir Oltean f6814fdcfe net: bridge: rename br_fdb_insert to br_fdb_add_local
br_fdb_insert() is a wrapper over fdb_insert() that also takes the
bridge hash_lock.

With fdb_insert() being renamed to fdb_add_local(), rename
br_fdb_insert() to br_fdb_add_local().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-27 14:54:02 +01:00
Vladimir Oltean 4731b6d6b2 net: bridge: rename fdb_insert to fdb_add_local
fdb_insert() is not a descriptive name for this function, and also easy
to confuse with __br_fdb_add(), fdb_add_entry(), br_fdb_update().
Even more confusingly, it is not even related in any way with those
functions, neither one calls the other.

Since fdb_insert() basically deals with the creation of a BR_FDB_LOCAL
entry and is called only from functions where that is the intention:

- br_fdb_changeaddr
- br_fdb_change_mac_address
- br_fdb_insert

then rename it to fdb_add_local(), because its removal counterpart is
called fdb_delete_local().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-27 14:54:02 +01:00
Vladimir Oltean 5f94a5e276 net: bridge: remove fdb_insert forward declaration
fdb_insert() has a forward declaration because its first caller,
br_fdb_changeaddr(), is declared before fdb_create(), a function which
fdb_insert() needs.

This patch moves the 2 functions above br_fdb_changeaddr() and deletes
the forward declaration for fdb_insert().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-27 14:54:02 +01:00
Vladimir Oltean 4682048af0 net: bridge: remove fdb_notify forward declaration
fdb_notify() has a forward declaration because its first caller,
fdb_delete(), is declared before 3 functions that fdb_notify() needs:
fdb_to_nud(), fdb_fill_info() and fdb_nlmsg_size().

This patch moves the aforementioned 4 functions above fdb_delete() and
deletes the forward declaration.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-27 14:54:02 +01:00
Ben Ben-ishay 54b2b3ecca net: Prevent HW-GRO and LRO features operate together
LRO and HW-GRO are mutually exclusive, this commit adds this restriction
in netdev_fix_feature. HW-GRO is preferred, that means in case both
HW-GRO and LRO features are requested, LRO is cleared.

Signed-off-by: Ben Ben-ishay <benishay@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-26 19:30:38 -07:00
Jakub Kicinski 440ffcdd9d Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2021-10-26

We've added 12 non-merge commits during the last 7 day(s) which contain
a total of 23 files changed, 118 insertions(+), 98 deletions(-).

The main changes are:

1) Fix potential race window in BPF tail call compatibility check, from Toke Høiland-Jørgensen.

2) Fix memory leak in cgroup fs due to missing cgroup_bpf_offline(), from Quanyang Wang.

3) Fix file descriptor reference counting in generic_map_update_batch(), from Xu Kuohai.

4) Fix bpf_jit_limit knob to the max supported limit by the arch's JIT, from Lorenz Bauer.

5) Fix BPF sockmap ->poll callbacks for UDP and AF_UNIX sockets, from Cong Wang and Yucong Sun.

6) Fix BPF sockmap concurrency issue in TCP on non-blocking sendmsg calls, from Liu Jian.

7) Fix build failure of INODE_STORAGE and TASK_STORAGE maps on !CONFIG_NET, from Tejun Heo.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Fix potential race in tail call compatibility check
  bpf: Move BPF_MAP_TYPE for INODE_STORAGE and TASK_STORAGE outside of CONFIG_NET
  selftests/bpf: Use recv_timeout() instead of retries
  net: Implement ->sock_is_readable() for UDP and AF_UNIX
  skmsg: Extract and reuse sk_msg_is_readable()
  net: Rename ->stream_memory_read to ->sock_is_readable
  tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function
  cgroup: Fix memory leak caused by missing cgroup_bpf_offline
  bpf: Fix error usage of map_fd and fdget() in generic_map_update_batch()
  bpf: Prevent increasing bpf_jit_limit above max
  bpf: Define bpf_jit_alloc_exec_limit for arm64 JIT
  bpf: Define bpf_jit_alloc_exec_limit for riscv JIT
====================

Link: https://lore.kernel.org/r/20211026201920.11296-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-26 14:38:55 -07:00
Cong Wang af49338895 net: Implement ->sock_is_readable() for UDP and AF_UNIX
Yucong noticed we can't poll() sockets in sockmap even
when they are the destination sockets of redirections.
This is because we never poll any psock queues in ->poll(),
except for TCP. With ->sock_is_readable() now we can
overwrite >sock_is_readable(), invoke and implement it for
both UDP and AF_UNIX sockets.

Reported-by: Yucong Sun <sunyucong@gmail.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211008203306.37525-4-xiyou.wangcong@gmail.com
2021-10-26 12:29:33 -07:00
Cong Wang fb4e0a5e73 skmsg: Extract and reuse sk_msg_is_readable()
tcp_bpf_sock_is_readable() is pretty much generic,
we can extract it and reuse it for non-TCP sockets.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211008203306.37525-3-xiyou.wangcong@gmail.com
2021-10-26 12:29:33 -07:00
Cong Wang 7b50ecfcc6 net: Rename ->stream_memory_read to ->sock_is_readable
The proto ops ->stream_memory_read() is currently only used
by TCP to check whether psock queue is empty or not. We need
to rename it before reusing it for non-TCP protocols, and
adjust the exsiting users accordingly.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211008203306.37525-2-xiyou.wangcong@gmail.com
2021-10-26 12:29:33 -07:00
Liu Jian cd9733f5d7 tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function
With two Msgs, msgA and msgB and a user doing nonblocking sendmsg calls (or
multiple cores) on a single socket 'sk' we could get the following flow.

 msgA, sk                               msgB, sk
 -----------                            ---------------
 tcp_bpf_sendmsg()
 lock(sk)
 psock = sk->psock
                                        tcp_bpf_sendmsg()
                                        lock(sk) ... blocking
tcp_bpf_send_verdict
if (psock->eval == NONE)
   psock->eval = sk_psock_msg_verdict
 ..
 < handle SK_REDIRECT case >
   release_sock(sk)                     < lock dropped so grab here >
   ret = tcp_bpf_sendmsg_redir
                                        psock = sk->psock
                                        tcp_bpf_send_verdict
 lock_sock(sk) ... blocking on B
                                        if (psock->eval == NONE) <- boom.
                                         psock->eval will have msgA state

The problem here is we dropped the lock on msgA and grabbed it with msgB.
Now we have old state in psock and importantly psock->eval has not been
cleared. So msgB will run whatever action was done on A and the verdict
program may never see it.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Liu Jian <liujian56@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20211012052019.184398-1-liujian56@huawei.com
2021-10-26 12:25:55 -07:00
Vladimir Oltean 425d19cede net: dsa: stop calling dev_hold in dsa_slave_fdb_event
Now that we guarantee that SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE events have
finished executing by the time we leave our bridge upper interface,
we've established a stronger boundary condition for how long the
dsa_slave_switchdev_event_work() might run.

As such, it is no longer possible for DSA slave interfaces to become
unregistered, since they are still bridge ports.

So delete the unnecessary dev_hold() and dev_put().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26 15:07:35 +01:00
Vladimir Oltean d7d0d423db net: dsa: flush switchdev workqueue when leaving the bridge
DSA is preparing to offer switch drivers an API through which they can
associate each FDB entry with a struct net_device *bridge_dev. This can
be used to perform FDB isolation (the FDB lookup performed on the
ingress of a standalone, or bridged port, should not find an FDB entry
that is present in the FDB of another bridge).

In preparation of that work, DSA needs to ensure that by the time we
call the switch .port_fdb_add and .port_fdb_del methods, the
dp->bridge_dev pointer is still valid, i.e. the port is still a bridge
port.

This is not guaranteed because the SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE API
requires drivers that must have sleepable context to handle those events
to schedule the deferred work themselves. DSA does this through the
dsa_owq.

It can happen that a port leaves a bridge, del_nbp() flushes the FDB on
that port, SWITCHDEV_FDB_DEL_TO_DEVICE is notified in atomic context,
DSA schedules its deferred work, but del_nbp() finishes unlinking the
bridge as a master from the port before DSA's deferred work is run.

Fundamentally, the port must not be unlinked from the bridge until all
FDB deletion deferred work items have been flushed. The bridge must wait
for the completion of these hardware accesses.

An attempt has been made to address this issue centrally in switchdev by
making SWITCHDEV_FDB_DEL_TO_DEVICE deferred (=> blocking) at the switchdev
level, which would offer implicit synchronization with del_nbp:

https://patchwork.kernel.org/project/netdevbpf/cover/20210820115746.3701811-1-vladimir.oltean@nxp.com/

but it seems that any attempt to modify switchdev's behavior and make
the events blocking there would introduce undesirable side effects in
other switchdev consumers.

The most undesirable behavior seems to be that
switchdev_deferred_process_work() takes the rtnl_mutex itself, which
would be worse off than having the rtnl_mutex taken individually from
drivers which is what we have now (except DSA which has removed that
lock since commit 0faf890fc5 ("net: dsa: drop rtnl_lock from
dsa_slave_switchdev_event_work")).

So to offer the needed guarantee to DSA switch drivers, I have come up
with a compromise solution that does not require switchdev rework:
we already have a hook at the last moment in time when the bridge is
still an upper of ours: the NETDEV_PRECHANGEUPPER handler. We can flush
the dsa_owq manually from there, which makes all FDB deletions
synchronous.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26 15:07:35 +01:00
Jeremy Kerr 99ce45d5e7 mctp: Implement extended addressing
This change allows an extended address struct - struct sockaddr_mctp_ext
- to be passed to sendmsg/recvmsg. This allows userspace to specify
output ifindex and physical address information (for sendmsg) or receive
the input ifindex/physaddr for incoming messages (for recvmsg). This is
typically used by userspace for MCTP address discovery and assignment
operations.

The extended addressing facility is conditional on a new sockopt:
MCTP_OPT_ADDR_EXT; userspace must explicitly enable addressing before
the kernel will consume/populate the extended address data.

Includes a fix for an uninitialised var:
Reported-by: kernel test robot <lkp@intel.com>

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26 14:58:45 +01:00
Andreas Oetken eafaa88b3e net: hsr: Add support for redbox supervision frames
added support for the redbox supervision frames
as defined in the IEC-62439-3:2018.

Signed-off-by: Andreas Oetken <andreas.oetken@siemens-energy.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26 14:52:17 +01:00
Pavel Skripkin 6f68cd6348 net: batman-adv: fix error handling
Syzbot reported ODEBUG warning in batadv_nc_mesh_free(). The problem was
in wrong error handling in batadv_mesh_init().

Before this patch batadv_mesh_init() was calling batadv_mesh_free() in case
of any batadv_*_init() calls failure. This approach may work well, when
there is some kind of indicator, which can tell which parts of batadv are
initialized; but there isn't any.

All written above lead to cleaning up uninitialized fields. Even if we hide
ODEBUG warning by initializing bat_priv->nc.work, syzbot was able to hit
GPF in batadv_nc_purge_paths(), because hash pointer in still NULL. [1]

To fix these bugs we can unwind batadv_*_init() calls one by one.
It is good approach for 2 reasons: 1) It fixes bugs on error handling
path 2) It improves the performance, since we won't call unneeded
batadv_*_free() functions.

So, this patch makes all batadv_*_init() clean up all allocated memory
before returning with an error to no call correspoing batadv_*_free()
and open-codes batadv_mesh_free() with proper order to avoid touching
uninitialized fields.

Link: https://lore.kernel.org/netdev/000000000000c87fbd05cef6bcb0@google.com/ [1]
Reported-and-tested-by: syzbot+28b0702ada0bf7381f58@syzkaller.appspotmail.com
Fixes: c6c8fea297 ("net: Add batman-adv meshing protocol")
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Acked-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26 14:47:12 +01:00
Eric Dumazet c4322884ed tcp: remove unneeded code from tcp_stream_alloc_skb()
Aligning @size argument to 4 bytes is not needed.

The header alignment has nothing to do with @size.

It really depends on skb->head alignment and MAX_TCP_HEADER.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26 14:45:12 +01:00
Eric Dumazet 8a794df693 tcp: use MAX_TCP_HEADER in tcp_stream_alloc_skb
Both IPv4 and IPv6 uses same reserve, no need risking
cache line misses to fetch its value.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26 14:45:12 +01:00
Eric Dumazet f8dd3b8d70 tcp: rename sk_stream_alloc_skb
sk_stream_alloc_skb() is only used by TCP.

Rename it to make this clear, and move its declaration
to include/net/tcp.h

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26 14:45:11 +01:00
Max VA fa40d9734a tipc: fix size validations for the MSG_CRYPTO type
The function tipc_crypto_key_rcv is used to parse MSG_CRYPTO messages
to receive keys from other nodes in the cluster in order to decrypt any
further messages from them.
This patch verifies that any supplied sizes in the message body are
valid for the received message.

Fixes: 1ef6f7c939 ("tipc: add automatic session key exchange")
Signed-off-by: Max VA <maxv@sentinelone.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26 13:43:07 +01:00
Florian Westphal 8e0538d8ee netfilter: conntrack: skip confirmation and nat hooks in postrouting for vrf
The VRF driver invokes netfilter for output+postrouting hooks so that users
can create rules that check for 'oif $vrf' rather than lower device name.

Afterwards, ip stack calls those hooks again.

This is a problem when conntrack is used with IP masquerading.
masquerading has an internal check that re-validates the output
interface to account for route changes.

This check will trigger in the vrf case.

If the -j MASQUERADE rule matched on the first iteration, then round 2
finds state->out->ifindex != nat->masq_index: the latter is the vrf
index, but out->ifindex is the lower device.

The packet gets dropped and the conntrack entry is invalidated.

This change makes conntrack postrouting skip the nat hooks.
Also skip confirmation.  This allows the second round
(postrouting invocation from ipv4/ipv6) to create nat bindings.

This also prevents the second round from seeing packets that had their
source address changed by the nat hook.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26 13:21:09 +01:00
Jon Maxwell cf12e6f912 tcp: don't free a FIN sk_buff in tcp_remove_empty_skb()
v1: Implement a more general statement as recommended by Eric Dumazet. The
sequence number will be advanced, so this check will fix the FIN case and
other cases.

A customer reported sockets stuck in the CLOSING state. A Vmcore revealed that
the write_queue was not empty as determined by tcp_write_queue_empty() but the
sk_buff containing the FIN flag had been freed and the socket was zombied in
that state. Corresponding pcaps show no FIN from the Linux kernel on the wire.

Some instrumentation was added to the kernel and it was found that there is a
timing window where tcp_sendmsg() can run after tcp_send_fin().

tcp_sendmsg() will hit an error, for example:

1269 ▹       if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))↩
1270 ▹       ▹       goto do_error;↩

tcp_remove_empty_skb() will then free the FIN sk_buff as "skb->len == 0". The
TCP socket is now wedged in the FIN-WAIT-1 state because the FIN is never sent.

If the other side sends a FIN packet the socket will transition to CLOSING and
remain that way until the system is rebooted.

Fix this by checking for the FIN flag in the sk_buff and don't free it if that
is the case. Testing confirmed that fixed the issue.

Fixes: fdfc5c8594 ("tcp: remove empty skb from write queue in error cases")
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Reported-by: Monir Zouaoui <Monir.Zouaoui@mail.schwarz>
Reported-by: Simon Stier <simon.stier@mail.schwarz>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26 13:10:04 +01:00
Cyril Strejc 9122a70a63 net: multicast: calculate csum of looped-back and forwarded packets
During a testing of an user-space application which transmits UDP
multicast datagrams and utilizes multicast routing to send the UDP
datagrams out of defined network interfaces, I've found a multicast
router does not fill-in UDP checksum into locally produced, looped-back
and forwarded UDP datagrams, if an original output NIC the datagrams
are sent to has UDP TX checksum offload enabled.

The datagrams are sent malformed out of the NIC the datagrams have been
forwarded to.

It is because:

1. If TX checksum offload is enabled on the output NIC, UDP checksum
   is not calculated by kernel and is not filled into skb data.

2. dev_loopback_xmit(), which is called solely by
   ip_mc_finish_output(), sets skb->ip_summed = CHECKSUM_UNNECESSARY
   unconditionally.

3. Since 35fc92a9 ("[NET]: Allow forwarding of ip_summed except
   CHECKSUM_COMPLETE"), the ip_summed value is preserved during
   forwarding.

4. If ip_summed != CHECKSUM_PARTIAL, checksum is not calculated during
   a packet egress.

The minimum fix in dev_loopback_xmit():

1. Preserves skb->ip_summed CHECKSUM_PARTIAL. This is the
   case when the original output NIC has TX checksum offload enabled.
   The effects are:

     a) If the forwarding destination interface supports TX checksum
        offloading, the NIC driver is responsible to fill-in the
        checksum.

     b) If the forwarding destination interface does NOT support TX
        checksum offloading, checksums are filled-in by kernel before
        skb is submitted to the NIC driver.

     c) For local delivery, checksum validation is skipped as in the
        case of CHECKSUM_UNNECESSARY, thanks to skb_csum_unnecessary().

2. Translates ip_summed CHECKSUM_NONE to CHECKSUM_UNNECESSARY. It
   means, for CHECKSUM_NONE, the behavior is unmodified and is there
   to skip a looped-back packet local delivery checksum validation.

Signed-off-by: Cyril Strejc <cyril.strejc@skoda.cz>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26 13:09:22 +01:00
Eric Dumazet 12c8691de3 ipv6/tcp: small drop monitor changes
Two kfree_skb() calls must be replaced by consume_skb()
for skbs that are not technically dropped.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-25 18:02:14 -07:00
Eric Dumazet 020e71a3cf ipv4: guard IP_MINTTL with a static key
RFC 5082 IP_MINTTL option is rarely used on hosts.

Add a static key to remove from TCP fast path useless code,
and potential cache line miss to fetch inet_sk(sk)->min_ttl

Note that once ip4_min_ttl static key has been enabled,
it stays enabled until next boot.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-25 18:02:14 -07:00
Eric Dumazet 14834c4f4e ipv4: annotate data races arount inet->min_ttl
No report yet from KCSAN, yet worth documenting the races.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-25 18:02:13 -07:00
Eric Dumazet 790eb67374 ipv6: guard IPV6_MINHOPCOUNT with a static key
RFC 5082 IPV6_MINHOPCOUNT is rarely used on hosts.

Add a static key to remove from TCP fast path useless code,
and potential cache line miss to fetch tcp_inet6_sk(sk)->min_hopcount

Note that once ip6_min_hopcount static key has been enabled,
it stays enabled until next boot.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-25 18:02:13 -07:00
Eric Dumazet cc17c3c8e8 ipv6: annotate data races around np->min_hopcount
No report yet from KCSAN, yet worth documenting the races.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-25 18:02:13 -07:00
Eric Dumazet ef57c1610d ipv6: move inet6_sk(sk)->rx_dst_cookie to sk->sk_rx_dst_cookie
Increase cache locality by moving rx_dst_coookie next to sk->sk_rx_dst

This removes one or two cache line misses in IPv6 early demux (TCP/UDP)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-25 18:02:12 -07:00
Eric Dumazet 0c0a5ef809 tcp: move inet->rx_dst_ifindex to sk->sk_rx_dst_ifindex
Increase cache locality by moving rx_dst_ifindex next to sk->sk_rx_dst

This is part of an effort to reduce cache line misses in TCP fast path.

This removes one cache line miss in early demux.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-25 18:02:12 -07:00
Jakub Kicinski a1916d3446 bluetooth: use dev_addr_set()
Commit 406f42fa0d ("net-next: When a bond have a massive amount
of VLANs...") introduced a rbtree for faster Ethernet address look
up. To maintain netdev->dev_addr in this tree we need to make all
the writes to it go through appropriate helpers.

Reviewed-by: Marcel Holtmann <marcel@holtmann.org>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-25 11:01:29 -07:00
Jakub Kicinski 08c181f052 bluetooth: use eth_hw_addr_set()
Commit 406f42fa0d ("net-next: When a bond have a massive amount
of VLANs...") introduced a rbtree for faster Ethernet address look
up. To maintain netdev->dev_addr in this tree we need to make all
the writes to it go through appropriate helpers.

Convert bluetooth from memcpy(... ETH_ADDR) to eth_hw_addr_set():

  @@
  expression dev, np;
  @@
  - memcpy(dev->dev_addr, np, ETH_ALEN)
  + eth_hw_addr_set(dev, np)

Reviewed-by: Marcel Holtmann <marcel@holtmann.org>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-25 11:01:24 -07:00
Xin Long f7a1e76d0f net-sysfs: initialize uid and gid before calling net_ns_get_ownership
Currently in net_ns_get_ownership() it may not be able to set uid or gid
if make_kuid or make_kgid returns an invalid value, and an uninit-value
issue can be triggered by this.

This patch is to fix it by initializing the uid and gid before calling
net_ns_get_ownership(), as it does in kobject_get_ownership()

Fixes: e6dee9f389 ("net-sysfs: add netdev_change_owner()")
Reported-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-25 16:17:32 +01:00
Michael Chan 0c57eeecc5 net: Prevent infinite while loop in skb_tx_hash()
Drivers call netdev_set_num_tc() and then netdev_set_tc_queue()
to set the queue count and offset for each TC.  So the queue count
and offset for the TCs may be zero for a short period after dev->num_tc
has been set.  If a TX packet is being transmitted at this time in the
code path netdev_pick_tx() -> skb_tx_hash(), skb_tx_hash() may see
nonzero dev->num_tc but zero qcount for the TC.  The while loop that
keeps looping while hash >= qcount will not end.

Fix it by checking the TC's qcount to be nonzero before using it.

Fixes: eadec877ce ("net: Add support for subordinate traffic classes to netdev_pick_tx")
Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-25 15:58:01 +01:00
Tianjia Zhang 3fb59a5de5 net/tls: getsockopt supports complete algorithm list
AES_CCM_128 and CHACHA20_POLY1305 are already supported by tls,
similar to setsockopt, getsockopt also needs to support these
two algorithms.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-25 15:55:30 +01:00
Janusz Dziedzic 689a0a9f50 cfg80211: correct bridge/4addr mode check
Without the patch we fail:

$ sudo brctl addbr br0
$ sudo brctl addif br0 wlp1s0
$ sudo iw wlp1s0 set 4addr on
command failed: Device or resource busy (-16)

Last command failed but iface was already in 4addr mode.

Fixes: ad4bb6f888 ("cfg80211: disallow bridging managed/adhoc interfaces")
Signed-off-by: Janusz Dziedzic <janusz.dziedzic@gmail.com>
Link: https://lore.kernel.org/r/20211024201546.614379-1-janusz.dziedzic@gmail.com
[add fixes tag, fix indentation, edit commit log]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-10-25 15:23:20 +02:00
Johannes Berg 09b1d5dc6c cfg80211: fix management registrations locking
The management registrations locking was broken, the list was
locked for each wdev, but cfg80211_mgmt_registrations_update()
iterated it without holding all the correct spinlocks, causing
list corruption.

Rather than trying to fix it with fine-grained locking, just
move the lock to the wiphy/rdev (still need the list on each
wdev), we already need to hold the wdev lock to change it, so
there's no contention on the lock in any case. This trivially
fixes the bug since we hold one wdev's lock already, and now
will hold the lock that protects all lists.

Cc: stable@vger.kernel.org
Reported-by: Jouni Malinen <j@w1.fi>
Fixes: 6cd536fe62 ("cfg80211: change internal management frame registration API")
Link: https://lore.kernel.org/r/20211025133111.5cf733eab0f4.I7b0abb0494ab712f74e2efcd24bb31ac33f7eee9@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-10-25 15:20:22 +02:00
Vladimir Oltean 0faf890fc5 net: dsa: drop rtnl_lock from dsa_slave_switchdev_event_work
After talking with Ido Schimmel, it became clear that rtnl_lock is not
actually required for anything that is done inside the
SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE deferred work handlers.

The reason why it was probably added by Arkadi Sharshevsky in commit
c9eb3e0f87 ("net: dsa: Add support for learning FDB through
notification") was to offer the same locking/serialization guarantees as
.ndo_fdb_{add,del} and avoid reworking any drivers.

DSA has implemented .ndo_fdb_add and .ndo_fdb_del until commit
b117e1e8a8 ("net: dsa: delete dsa_legacy_fdb_add and
dsa_legacy_fdb_del") - that is to say, until fairly recently.

But those methods have been deleted, so now we are free to drop the
rtnl_lock as well.

Note that exposing DSA switch drivers to an unlocked method which was
previously serialized by the rtnl_mutex is a potentially dangerous
affair. Driver writers couldn't ensure that their internal locking
scheme does the right thing even if they wanted.

We could err on the side of paranoia and introduce a switch-wide lock
inside the DSA framework, but that seems way overreaching. Instead, we
could check as many drivers for regressions as we can, fix those first,
then let this change go in once it is assumed to be fairly safe.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-25 12:59:42 +01:00
Vladimir Oltean 338a3a4745 net: dsa: introduce locking for the address lists on CPU and DSA ports
Now that the rtnl_mutex is going away for dsa_port_{host_,}fdb_{add,del},
no one is serializing access to the address lists that DSA keeps for the
purpose of reference counting on shared ports (CPU and cascade ports).

It can happen for one dsa_switch_do_fdb_del to do list_del on a dp->fdbs
element while another dsa_switch_do_fdb_{add,del} is traversing dp->fdbs.
We need to avoid that.

Currently dp->mdbs is not at risk, because dsa_switch_do_mdb_{add,del}
still runs under the rtnl_mutex. But it would be nice if it would not
depend on that being the case. So let's introduce a mutex per port (the
address lists are per port too) and share it between dp->mdbs and
dp->fdbs.

The place where we put the locking is interesting. It could be tempting
to put a DSA-level lock which still serializes calls to
.port_fdb_{add,del}, but it would still not avoid concurrency with other
driver code paths that are currently under rtnl_mutex (.port_fdb_dump,
.port_fast_age). So it would add a very false sense of security (and
adding a global switch-wide lock in DSA to resynchronize with the
rtnl_lock is also counterproductive and hard).

So the locking is intentionally done only where the dp->fdbs and dp->mdbs
lists are traversed. That means, from a driver perspective, that
.port_fdb_add will be called with the dp->addr_lists_lock mutex held on
the CPU port, but not held on user ports. This is done so that driver
writers are not encouraged to rely on any guarantee offered by
dp->addr_lists_lock.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-25 12:59:42 +01:00
Vladimir Oltean 232deb3f95 net: dsa: avoid refcount warnings when ->port_{fdb,mdb}_del returns error
At present, when either of ds->ops->port_fdb_del() or ds->ops->port_mdb_del()
return a non-zero error code, we attempt to save the day and keep the
data structure associated with that switchdev object, as the deletion
procedure did not complete.

However, the way in which we do this is suspicious to the checker in
lib/refcount.c, who thinks it is buggy to increment a refcount that
became zero, and that this is indicative of a use-after-free.

Fixes: 161ca59d39 ("net: dsa: reference count the MDB entries at the cross-chip notifier level")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-25 12:59:41 +01:00
David S. Miller 2d7e73f09f Revert "Merge branch 'dsa-rtnl'"
This reverts commit 965e6b262f, reversing
changes made to 4d98bb0d7e.
2021-10-25 12:59:25 +01:00
David S. Miller 12f241f264 linux-can-next-for-5.16-20211024
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEK3kIWJt9yTYMP3ehqclaivrt76kFAmF1wXcTHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRCpyVqK+u3vqYLYB/wL73REeL8jRXXOwwcOHYksqJAGkzh4
 H+3IYOAqjx6mipGVfNUP9EmWWtkx8VgXUknGiEEOFocqLhGviTaElbSEmqt1T9ZM
 sqq8rbtRk/WazQoV04/Gk32LYzWkEuI+NBl4J0x96GsUMV50p8rjA5/Qhi5kFevC
 Euev02aOr6vU2Hat/mupxqrqVKbFYq1ehN7SzmmafbZjiuvFoSyJAVksiUtKm2J+
 52j+A//wzVuPNayQePmLIxM+VK4UqZbZ9cMYjJHZTETb6EhRas20zbNjCZCdMu93
 QyfsGdz5Y3XHToYoPqp4bXwIHrwbExi2/BI9MmrehFyKxVQ5/2LaHoFE
 =ts6r
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-next-for-5.16-20211024' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2021-10-24

this is a pull request of 15 patches for net-next/master.

The first patch is by Thomas Gleixner and makes use of
hrtimer_forward_now() in the CAN broad cast manager (bcm).

The next patch is by me and changes the type of the variables used in
the CAN bit timing calculation can_fixup_bittiming() to unsigned int.

Vincent Mailhol provides 6 patches targeting the CAN device
infrastructure. The CAN-FD specific Transmitter Delay Compensation
(TDC) is updated and configuration via the CAN netlink interface is
added.

Qing Wang's patch updates the at91 and janz-ican3 drivers to use
sysfs_emit() instead of snprintf() in the sysfs show functions.

Geert Uytterhoeven's patch drops the unneeded ARM dependency from the
rar Kconfig.

Cai Huoqing's patch converts the mscan driver to make use of the
dev_err_probe() helper function.

A patch by me against the gsusb driver changes the printf format
strings to use %u to print unsigned values.

Stephane Grosjean's patch updates the peak_usb CAN-FD driver to use
the 64 bit timestamps provided by the hardware.

The last 2 patches target the xilinx_can driver. Michal Simek provides
a patch that removes repeated word from the kernel-doc and Dongliang
Mu's patch removes a redundant netif_napi_del() from the xcan_remove()
function.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-25 12:48:38 +01:00
Thomas Gleixner 9b44a927e1 can: bcm: Use hrtimer_forward_now()
hrtimer_forward_now() provides the same functionality as the open coded
hrimer_forward() invocation. Prepares for removal of hrtimer_forward() from
the public interfaces.

Link: https://lore.kernel.org/all/20210923153339.684546907@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: linux-can@vger.kernel.org
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: netdev@vger.kernel.org
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-10-24 16:24:28 +02:00
Jakub Kicinski d6b3daf24e net: atm: use address setting helpers
Get it ready for constant netdev->dev_addr.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-24 13:59:45 +01:00
Jakub Kicinski 5520fb42a0 net: caif: get ready for const netdev->dev_addr
Get it ready for constant netdev->dev_addr.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-24 13:59:45 +01:00
Jakub Kicinski 39c19fb9b4 net: hsr: get ready for const netdev->dev_addr
hsr_create_self_node() may get netdev->dev_addr
passed as argument, netdev->dev_addr will be
const soon.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-24 13:59:44 +01:00
Jakub Kicinski efd38f75bb net: rtnetlink: use __dev_addr_set()
Get it ready for constant netdev->dev_addr.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-24 13:59:44 +01:00
Jakub Kicinski 5fd348a050 net: core: constify mac addrs in selftests
Get it ready for constant netdev->dev_addr.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-24 13:59:44 +01:00
Sean Anderson 4973056cce net: convert users of bitmap_foo() to linkmode_foo()
This converts instances of
	bitmap_foo(args..., __ETHTOOL_LINK_MODE_MASK_NBITS)
to
	linkmode_foo(args...)

I manually fixed up some lines to prevent them from being excessively
long. Otherwise, this change was generated with the following semantic
patch:

// Generated with
// echo linux/linkmode.h > includes
// git grep -Flf includes include/ | cut -f 2- -d / | cat includes - \
// | sort | uniq | tee new_includes | wc -l && mv new_includes includes
// and repeating until the number stopped going up
@i@
@@

(
 #include <linux/acpi_mdio.h>
|
 #include <linux/brcmphy.h>
|
 #include <linux/dsa/loop.h>
|
 #include <linux/dsa/sja1105.h>
|
 #include <linux/ethtool.h>
|
 #include <linux/ethtool_netlink.h>
|
 #include <linux/fec.h>
|
 #include <linux/fs_enet_pd.h>
|
 #include <linux/fsl/enetc_mdio.h>
|
 #include <linux/fwnode_mdio.h>
|
 #include <linux/linkmode.h>
|
 #include <linux/lsm_audit.h>
|
 #include <linux/mdio-bitbang.h>
|
 #include <linux/mdio.h>
|
 #include <linux/mdio-mux.h>
|
 #include <linux/mii.h>
|
 #include <linux/mii_timestamper.h>
|
 #include <linux/mlx5/accel.h>
|
 #include <linux/mlx5/cq.h>
|
 #include <linux/mlx5/device.h>
|
 #include <linux/mlx5/driver.h>
|
 #include <linux/mlx5/eswitch.h>
|
 #include <linux/mlx5/fs.h>
|
 #include <linux/mlx5/port.h>
|
 #include <linux/mlx5/qp.h>
|
 #include <linux/mlx5/rsc_dump.h>
|
 #include <linux/mlx5/transobj.h>
|
 #include <linux/mlx5/vport.h>
|
 #include <linux/of_mdio.h>
|
 #include <linux/of_net.h>
|
 #include <linux/pcs-lynx.h>
|
 #include <linux/pcs/pcs-xpcs.h>
|
 #include <linux/phy.h>
|
 #include <linux/phy_led_triggers.h>
|
 #include <linux/phylink.h>
|
 #include <linux/platform_data/bcmgenet.h>
|
 #include <linux/platform_data/xilinx-ll-temac.h>
|
 #include <linux/pxa168_eth.h>
|
 #include <linux/qed/qed_eth_if.h>
|
 #include <linux/qed/qed_fcoe_if.h>
|
 #include <linux/qed/qed_if.h>
|
 #include <linux/qed/qed_iov_if.h>
|
 #include <linux/qed/qed_iscsi_if.h>
|
 #include <linux/qed/qed_ll2_if.h>
|
 #include <linux/qed/qed_nvmetcp_if.h>
|
 #include <linux/qed/qed_rdma_if.h>
|
 #include <linux/sfp.h>
|
 #include <linux/sh_eth.h>
|
 #include <linux/smsc911x.h>
|
 #include <linux/soc/nxp/lpc32xx-misc.h>
|
 #include <linux/stmmac.h>
|
 #include <linux/sunrpc/svc_rdma.h>
|
 #include <linux/sxgbe_platform.h>
|
 #include <net/cfg80211.h>
|
 #include <net/dsa.h>
|
 #include <net/mac80211.h>
|
 #include <net/selftests.h>
|
 #include <rdma/ib_addr.h>
|
 #include <rdma/ib_cache.h>
|
 #include <rdma/ib_cm.h>
|
 #include <rdma/ib_hdrs.h>
|
 #include <rdma/ib_mad.h>
|
 #include <rdma/ib_marshall.h>
|
 #include <rdma/ib_pack.h>
|
 #include <rdma/ib_pma.h>
|
 #include <rdma/ib_sa.h>
|
 #include <rdma/ib_smi.h>
|
 #include <rdma/ib_umem.h>
|
 #include <rdma/ib_umem_odp.h>
|
 #include <rdma/ib_verbs.h>
|
 #include <rdma/iw_cm.h>
|
 #include <rdma/mr_pool.h>
|
 #include <rdma/opa_addr.h>
|
 #include <rdma/opa_port_info.h>
|
 #include <rdma/opa_smi.h>
|
 #include <rdma/opa_vnic.h>
|
 #include <rdma/rdma_cm.h>
|
 #include <rdma/rdma_cm_ib.h>
|
 #include <rdma/rdmavt_cq.h>
|
 #include <rdma/rdma_vt.h>
|
 #include <rdma/rdmavt_qp.h>
|
 #include <rdma/rw.h>
|
 #include <rdma/tid_rdma_defs.h>
|
 #include <rdma/uverbs_ioctl.h>
|
 #include <rdma/uverbs_named_ioctl.h>
|
 #include <rdma/uverbs_std_types.h>
|
 #include <rdma/uverbs_types.h>
|
 #include <soc/mscc/ocelot.h>
|
 #include <soc/mscc/ocelot_ptp.h>
|
 #include <soc/mscc/ocelot_vcap.h>
|
 #include <trace/events/ib_mad.h>
|
 #include <trace/events/rdma_core.h>
|
 #include <trace/events/rdma.h>
|
 #include <trace/events/rpcrdma.h>
|
 #include <uapi/linux/ethtool.h>
|
 #include <uapi/linux/ethtool_netlink.h>
|
 #include <uapi/linux/mdio.h>
|
 #include <uapi/linux/mii.h>
)

@depends on i@
expression list args;
@@

(
- bitmap_zero(args, __ETHTOOL_LINK_MODE_MASK_NBITS)
+ linkmode_zero(args)
|
- bitmap_copy(args, __ETHTOOL_LINK_MODE_MASK_NBITS)
+ linkmode_copy(args)
|
- bitmap_and(args, __ETHTOOL_LINK_MODE_MASK_NBITS)
+ linkmode_and(args)
|
- bitmap_or(args, __ETHTOOL_LINK_MODE_MASK_NBITS)
+ linkmode_or(args)
|
- bitmap_empty(args, ETHTOOL_LINK_MODE_MASK_NBITS)
+ linkmode_empty(args)
|
- bitmap_andnot(args, __ETHTOOL_LINK_MODE_MASK_NBITS)
+ linkmode_andnot(args)
|
- bitmap_equal(args, __ETHTOOL_LINK_MODE_MASK_NBITS)
+ linkmode_equal(args)
|
- bitmap_intersects(args, __ETHTOOL_LINK_MODE_MASK_NBITS)
+ linkmode_intersects(args)
|
- bitmap_subset(args, __ETHTOOL_LINK_MODE_MASK_NBITS)
+ linkmode_subset(args)
)

Add missing linux/mii.h include to mellanox. -DaveM

Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-24 13:58:52 +01:00
Vladimir Oltean 5cdfde49a0 net: dsa: drop rtnl_lock from dsa_slave_switchdev_event_work
After talking with Ido Schimmel, it became clear that rtnl_lock is not
actually required for anything that is done inside the
SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE deferred work handlers.

The reason why it was probably added by Arkadi Sharshevsky in commit
c9eb3e0f87 ("net: dsa: Add support for learning FDB through
notification") was to offer the same locking/serialization guarantees as
.ndo_fdb_{add,del} and avoid reworking any drivers.

DSA has implemented .ndo_fdb_add and .ndo_fdb_del until commit
b117e1e8a8 ("net: dsa: delete dsa_legacy_fdb_add and
dsa_legacy_fdb_del") - that is to say, until fairly recently.

But those methods have been deleted, so now we are free to drop the
rtnl_lock as well.

Note that exposing DSA switch drivers to an unlocked method which was
previously serialized by the rtnl_mutex is a potentially dangerous
affair. Driver writers couldn't ensure that their internal locking
scheme does the right thing even if they wanted.

We could err on the side of paranoia and introduce a switch-wide lock
inside the DSA framework, but that seems way overreaching. Instead, we
could check as many drivers for regressions as we can, fix those first,
then let this change go in once it is assumed to be fairly safe.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-24 13:47:44 +01:00
Vladimir Oltean d3bd892437 net: dsa: introduce locking for the address lists on CPU and DSA ports
Now that the rtnl_mutex is going away for dsa_port_{host_,}fdb_{add,del},
no one is serializing access to the address lists that DSA keeps for the
purpose of reference counting on shared ports (CPU and cascade ports).

It can happen for one dsa_switch_do_fdb_del to do list_del on a dp->fdbs
element while another dsa_switch_do_fdb_{add,del} is traversing dp->fdbs.
We need to avoid that.

Currently dp->mdbs is not at risk, because dsa_switch_do_mdb_{add,del}
still runs under the rtnl_mutex. But it would be nice if it would not
depend on that being the case. So let's introduce a mutex per port (the
address lists are per port too) and share it between dp->mdbs and
dp->fdbs.

The place where we put the locking is interesting. It could be tempting
to put a DSA-level lock which still serializes calls to
.port_fdb_{add,del}, but it would still not avoid concurrency with other
driver code paths that are currently under rtnl_mutex (.port_fdb_dump,
.port_fast_age). So it would add a very false sense of security (and
adding a global switch-wide lock in DSA to resynchronize with the
rtnl_lock is also counterproductive and hard).

So the locking is intentionally done only where the dp->fdbs and dp->mdbs
lists are traversed. That means, from a driver perspective, that
.port_fdb_add will be called with the dp->addr_lists_lock mutex held on
the CPU port, but not held on user ports. This is done so that driver
writers are not encouraged to rely on any guarantee offered by
dp->addr_lists_lock.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-24 13:47:44 +01:00
Lorenz Bauer fadb7ff1a6 bpf: Prevent increasing bpf_jit_limit above max
Restrict bpf_jit_limit to the maximum supported by the arch's JIT.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211014142554.53120-4-lmb@cloudflare.com
2021-10-22 17:23:53 -07:00
Leon Romanovsky 7a690ad499 devlink: Clean not-executed param notifications
The parameters are registered before devlink_register() and all the
notifications are delayed. This patch removes not-possible parameters
notifications along with addition of code annotation logic.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-22 16:15:42 -07:00
Leon Romanovsky 8bbeed4858 devlink: Remove not-executed trap group notifications
The trap logic is registered before devlink_register() and all the
notifications are delayed. This patch removes not-possible trap group
notifications along with addition of code annotation logic.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-22 16:15:41 -07:00
Leon Romanovsky 22849b5ea5 devlink: Remove not-executed trap policer notifications
The trap policer logic is registered before devlink_register() and all the
notifications are delayed. This patch removes not-possible notifications
along with addition of code annotation logic.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-22 16:15:41 -07:00
Leon Romanovsky 99ad92eff7 devlink: Delete obsolete parameters publish API
The change of devlink_register() to be last devlink command together
with delayed notification logic made the publish API to be obsolete.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-22 16:15:41 -07:00