Commit graph

755 commits

Author SHA1 Message Date
Vasily Averin 54970a2fbb net: drop bogus skb with CHECKSUM_PARTIAL and offset beyond end of trimmed packet
syzbot reproduces BUG_ON in skb_checksum_help():
tun creates (bogus) skb with huge partial-checksummed area and
small ip packet inside. Then ip_rcv trims the skb based on size
of internal ip packet, after that csum offset points beyond of
trimmed skb. Then checksum_tg() called via netfilter hook
triggers BUG_ON:

        offset = skb_checksum_start_offset(skb);
        BUG_ON(offset >= skb_headlen(skb));

To work around the problem this patch forces pskb_trim_rcsum_slow()
to return -EINVAL in described scenario. It allows its callers to
drop such kind of packets.

Link: https://syzkaller.appspot.com/bug?id=b419a5ca95062664fe1a60b764621eb4526e2cd0
Reported-by: syzbot+7010af67ced6105e5ab6@syzkaller.appspotmail.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/1b2494af-2c56-8ee2-7bc0-923fcad1cdf8@virtuozzo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 18:41:01 -08:00
Jakub Kicinski 55fd59b003 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:
	drivers/net/ethernet/ibm/ibmvnic.c

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 15:44:09 -08:00
Davide Caratti 13de4ed9e3 net: skbuff: ensure LSE is pullable before decrementing the MPLS ttl
skb_mpls_dec_ttl() reads the LSE without ensuring that it is contained in
the skb "linear" area. Fix this calling pskb_may_pull() before reading the
current ttl.

Found by code inspection.

Fixes: 2a2ea50870 ("net: sched: add mpls manipulation actions to TC")
Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/53659f28be8bc336c113b5254dc637cc76bbae91.1606987074.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 11:13:21 -08:00
Marco Elver fa69ee5aa4 net: switch to storing KCOV handle directly in sk_buff
It turns out that usage of skb extensions can cause memory leaks. Ido
Schimmel reported: "[...] there are instances that blindly overwrite
'skb->extensions' by invoking skb_copy_header() after __alloc_skb()."

Therefore, give up on using skb extensions for KCOV handle, and instead
directly store kcov_handle in sk_buff.

Fixes: 6370cc3bbd ("net: add kcov handle to skb extensions")
Fixes: 85ce50d337 ("net: kcov: don't select SKB_EXTENSIONS when there is no NET")
Fixes: 97f53a08cb ("net: linux/skbuff.h: combine SKB_EXTENSIONS + KCOV handling")
Link: https://lore.kernel.org/linux-wireless/20201121160941.GA485907@shredder.lan/
Reported-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20201125224840.2014773-1-elver@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 11:26:19 -08:00
Jakub Kicinski 5c39f26e67 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Trivial conflict in CAN, keep the net-next + the byteswap wrapper.

Conflicts:
	drivers/net/can/usb/gs_usb.c

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 18:25:27 -08:00
Willem de Bruijn 985f733742 sock: set sk_err to ee_errno on dequeue from errq
When setting sk_err, set it to ee_errno, not ee_origin.

Commit f5f99309fa ("sock: do not set sk_err in
sock_dequeue_err_skb") disabled updating sk_err on errq dequeue,
which is correct for most error types (origins):

  -       sk->sk_err = err;

Commit 38b257938a ("sock: reset sk_err when the error queue is
empty") reenabled the behavior for IMCP origins, which do require it:

  +       if (icmp_next)
  +               sk->sk_err = SKB_EXT_ERR(skb_next)->ee.ee_origin;

But read from ee_errno.

Fixes: 38b257938a ("sock: reset sk_err when the error queue is empty")
Reported-by: Ayush Ranjan <ayushranjan@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Link: https://lore.kernel.org/r/20201126151220.2819322-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 11:07:06 -08:00
Yunsheng Lin 6454eca81e net: Use lockdep_assert_in_softirq() in napi_consume_skb()
Use napi_consume_skb() to assert the case when it is not called
in a atomic softirq context.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-25 15:08:38 -08:00
Mauro Carvalho Chehab c1639be98b net: datagram: fix some kernel-doc markups
Some identifiers have different names between their prototypes
and the kernel-doc markup.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-17 14:15:03 -08:00
Alexander Lobakin 8be33ecfc1 net: skb_vlan_untag(): don't reset transport offset if set by GRO layer
Similar to commit fda55eca5a
("net: introduce skb_transport_header_was_set()"), avoid resetting
transport offsets that were already set by GRO layer. This not only
mirrors the behavior of __netif_receive_skb_core(), but also makes
sense when it comes to UDP GSO fraglists forwarding: transport offset
of such skbs is set only once by GRO receive callback and remains
untouched and correct up to the xmitting driver in 1:1 case, but
becomes junk after untagging in ingress VLAN case and breaks UDP
GSO offload. This does not happen after this change, and all types
of forwarding of UDP GSO fraglists work as expected.

Since v1 [1]:
 - keep the code 1:1 with __netif_receive_skb_core() (Jakub).

[1] https://lore.kernel.org/netdev/zYurwsZRN7BkqSoikWQLVqHyxz18h4LhHU4NFa2Vw@cp4-web-038.plabs.ch

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Link: https://lore.kernel.org/r/7JgIkgEztzt0W6ZtC9V9Cnk5qfkrUFYcpN871syCi8@cp4-web-040.plabs.ch
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-09 20:03:55 -08:00
Aleksandr Nogikh 6370cc3bbd net: add kcov handle to skb extensions
Remote KCOV coverage collection enables coverage-guided fuzzing of the
code that is not reachable during normal system call execution. It is
especially helpful for fuzzing networking subsystems, where it is
common to perform packet handling in separate work queues even for the
packets that originated directly from the user space.

Enable coverage-guided frame injection by adding kcov remote handle to
skb extensions. Default initialization in __alloc_skb and
__build_skb_around ensures that no socket buffer that was generated
during a system call will be missed.

Code that is of interest and that performs packet processing should be
annotated with kcov_remote_start()/kcov_remote_stop().

An alternative approach is to determine kcov_handle solely on the
basis of the device/interface that received the specific socket
buffer. However, in this case it would be impossible to distinguish
between packets that originated during normal background network
processes or were intentionally injected from the user space.

Signed-off-by: Aleksandr Nogikh <nogikh@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-02 18:01:34 -08:00
Linus Torvalds 9ff9b0d392 networking changes for the 5.10 merge window
Add redirect_neigh() BPF packet redirect helper, allowing to limit stack
 traversal in common container configs and improving TCP back-pressure.
 Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.
 
 Expand netlink policy support and improve policy export to user space.
 (Ge)netlink core performs request validation according to declared
 policies. Expand the expressiveness of those policies (min/max length
 and bitmasks). Allow dumping policies for particular commands.
 This is used for feature discovery by user space (instead of kernel
 version parsing or trial and error).
 
 Support IGMPv3/MLDv2 multicast listener discovery protocols in bridge.
 
 Allow more than 255 IPv4 multicast interfaces.
 
 Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
 packets of TCPv6.
 
 In Multi-patch TCP (MPTCP) support concurrent transmission of data
 on multiple subflows in a load balancing scenario. Enhance advertising
 addresses via the RM_ADDR/ADD_ADDR options.
 
 Support SMC-Dv2 version of SMC, which enables multi-subnet deployments.
 
 Allow more calls to same peer in RxRPC.
 
 Support two new Controller Area Network (CAN) protocols -
 CAN-FD and ISO 15765-2:2016.
 
 Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
 kernel problem.
 
 Add TC actions for implementing MPLS L2 VPNs.
 
 Improve nexthop code - e.g. handle various corner cases when nexthop
 objects are removed from groups better, skip unnecessary notifications
 and make it easier to offload nexthops into HW by converting
 to a blocking notifier.
 
 Support adding and consuming TCP header options by BPF programs,
 opening the doors for easy experimental and deployment-specific
 TCP option use.
 
 Reorganize TCP congestion control (CC) initialization to simplify life
 of TCP CC implemented in BPF.
 
 Add support for shipping BPF programs with the kernel and loading them
 early on boot via the User Mode Driver mechanism, hence reusing all the
 user space infra we have.
 
 Support sleepable BPF programs, initially targeting LSM and tracing.
 
 Add bpf_d_path() helper for returning full path for given 'struct path'.
 
 Make bpf_tail_call compatible with bpf-to-bpf calls.
 
 Allow BPF programs to call map_update_elem on sockmaps.
 
 Add BPF Type Format (BTF) support for type and enum discovery, as
 well as support for using BTF within the kernel itself (current use
 is for pretty printing structures).
 
 Support listing and getting information about bpf_links via the bpf
 syscall.
 
 Enhance kernel interfaces around NIC firmware update. Allow specifying
 overwrite mask to control if settings etc. are reset during update;
 report expected max time operation may take to users; support firmware
 activation without machine reboot incl. limits of how much impact
 reset may have (e.g. dropping link or not).
 
 Extend ethtool configuration interface to report IEEE-standard
 counters, to limit the need for per-vendor logic in user space.
 
 Adopt or extend devlink use for debug, monitoring, fw update
 in many drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw,
 mv88e6xxx, dpaa2-eth).
 
 In mlxsw expose critical and emergency SFP module temperature alarms.
 Refactor port buffer handling to make the defaults more suitable and
 support setting these values explicitly via the DCBNL interface.
 
 Add XDP support for Intel's igb driver.
 
 Support offloading TC flower classification and filtering rules to
 mscc_ocelot switches.
 
 Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
 fixed interval period pulse generator and one-step timestamping in
 dpaa-eth.
 
 Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
 offload.
 
 Add Lynx PHY/PCS MDIO module, and convert various drivers which have
 this HW to use it. Convert mvpp2 to split PCS.
 
 Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
 7-port Mediatek MT7531 IP.
 
 Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
 and wcn3680 support in wcn36xx.
 
 Improve performance for packets which don't require much offloads
 on recent Mellanox NICs by 20% by making multiple packets share
 a descriptor entry.
 
 Move chelsio inline crypto drivers (for TLS and IPsec) from the crypto
 subtree to drivers/net. Move MDIO drivers out of the phy directory.
 
 Clean up a lot of W=1 warnings, reportedly the actively developed
 subsections of networking drivers should now build W=1 warning free.
 
 Make sure drivers don't use in_interrupt() to dynamically adapt their
 code. Convert tasklets to use new tasklet_setup API (sadly this
 conversion is not yet complete).
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl+ItRwACgkQMUZtbf5S
 IrtTMg//UxpdR/MirT1DatBU0K/UGAZY82hV7F/UC8tPgjfHZeHvWlDFxfi3YP81
 PtPKbhRZ7DhwBXefUp6nY3UdvjftrJK2lJm8prJUPSsZRye8Wlcb7y65q7/P2y2U
 Efucyopg6RUrmrM0DUsIGYGJgylQLHnMYUl/keCsD4t5Bp4ksyi9R2t5eitGoWzh
 r3QGdbSa0AuWx4iu0i+tqp6Tj0ekMBMXLVb35dtU1t0joj2KTNEnSgABN3prOa8E
 iWYf2erOau68Ogp3yU3miCy0ZU4p/7qGHTtzbcp677692P/ekak6+zmfHLT9/Pjy
 2Stq2z6GoKuVxdktr91D9pA3jxG4LxSJmr0TImcGnXbvkMP3Ez3g9RrpV5fn8j6F
 mZCH8TKZAoD5aJrAJAMkhZmLYE1pvDa7KolSk8WogXrbCnTEb5Nv8FHTS1Qnk3yl
 wSKXuvutFVNLMEHCnWQLtODbTST9DI/aOi6EctPpuOA/ZyL1v3pl+gfp37S+LUTe
 owMnT/7TdvKaTD0+gIyU53M6rAWTtr5YyRQorX9awIu/4Ha0F0gYD7BJZQUGtegp
 HzKt59NiSrFdbSH7UdyemdBF4LuCgIhS7rgfeoUXMXmuPHq7eHXyHZt5dzPPa/xP
 81P0MAvdpFVwg8ij2yp2sHS7sISIRKq17fd1tIewUabxQbjXqPc=
 =bc1U
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:

 - Add redirect_neigh() BPF packet redirect helper, allowing to limit
   stack traversal in common container configs and improving TCP
   back-pressure.

   Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.

 - Expand netlink policy support and improve policy export to user
   space. (Ge)netlink core performs request validation according to
   declared policies. Expand the expressiveness of those policies
   (min/max length and bitmasks). Allow dumping policies for particular
   commands. This is used for feature discovery by user space (instead
   of kernel version parsing or trial and error).

 - Support IGMPv3/MLDv2 multicast listener discovery protocols in
   bridge.

 - Allow more than 255 IPv4 multicast interfaces.

 - Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
   packets of TCPv6.

 - In Multi-patch TCP (MPTCP) support concurrent transmission of data on
   multiple subflows in a load balancing scenario. Enhance advertising
   addresses via the RM_ADDR/ADD_ADDR options.

 - Support SMC-Dv2 version of SMC, which enables multi-subnet
   deployments.

 - Allow more calls to same peer in RxRPC.

 - Support two new Controller Area Network (CAN) protocols - CAN-FD and
   ISO 15765-2:2016.

 - Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
   kernel problem.

 - Add TC actions for implementing MPLS L2 VPNs.

 - Improve nexthop code - e.g. handle various corner cases when nexthop
   objects are removed from groups better, skip unnecessary
   notifications and make it easier to offload nexthops into HW by
   converting to a blocking notifier.

 - Support adding and consuming TCP header options by BPF programs,
   opening the doors for easy experimental and deployment-specific TCP
   option use.

 - Reorganize TCP congestion control (CC) initialization to simplify
   life of TCP CC implemented in BPF.

 - Add support for shipping BPF programs with the kernel and loading
   them early on boot via the User Mode Driver mechanism, hence reusing
   all the user space infra we have.

 - Support sleepable BPF programs, initially targeting LSM and tracing.

 - Add bpf_d_path() helper for returning full path for given 'struct
   path'.

 - Make bpf_tail_call compatible with bpf-to-bpf calls.

 - Allow BPF programs to call map_update_elem on sockmaps.

 - Add BPF Type Format (BTF) support for type and enum discovery, as
   well as support for using BTF within the kernel itself (current use
   is for pretty printing structures).

 - Support listing and getting information about bpf_links via the bpf
   syscall.

 - Enhance kernel interfaces around NIC firmware update. Allow
   specifying overwrite mask to control if settings etc. are reset
   during update; report expected max time operation may take to users;
   support firmware activation without machine reboot incl. limits of
   how much impact reset may have (e.g. dropping link or not).

 - Extend ethtool configuration interface to report IEEE-standard
   counters, to limit the need for per-vendor logic in user space.

 - Adopt or extend devlink use for debug, monitoring, fw update in many
   drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx,
   dpaa2-eth).

 - In mlxsw expose critical and emergency SFP module temperature alarms.
   Refactor port buffer handling to make the defaults more suitable and
   support setting these values explicitly via the DCBNL interface.

 - Add XDP support for Intel's igb driver.

 - Support offloading TC flower classification and filtering rules to
   mscc_ocelot switches.

 - Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
   fixed interval period pulse generator and one-step timestamping in
   dpaa-eth.

 - Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
   offload.

 - Add Lynx PHY/PCS MDIO module, and convert various drivers which have
   this HW to use it. Convert mvpp2 to split PCS.

 - Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
   7-port Mediatek MT7531 IP.

 - Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
   and wcn3680 support in wcn36xx.

 - Improve performance for packets which don't require much offloads on
   recent Mellanox NICs by 20% by making multiple packets share a
   descriptor entry.

 - Move chelsio inline crypto drivers (for TLS and IPsec) from the
   crypto subtree to drivers/net. Move MDIO drivers out of the phy
   directory.

 - Clean up a lot of W=1 warnings, reportedly the actively developed
   subsections of networking drivers should now build W=1 warning free.

 - Make sure drivers don't use in_interrupt() to dynamically adapt their
   code. Convert tasklets to use new tasklet_setup API (sadly this
   conversion is not yet complete).

* tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2583 commits)
  Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH"
  net, sockmap: Don't call bpf_prog_put() on NULL pointer
  bpf, selftest: Fix flaky tcp_hdr_options test when adding addr to lo
  bpf, sockmap: Add locking annotations to iterator
  netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements
  net: fix pos incrementment in ipv6_route_seq_next
  net/smc: fix invalid return code in smcd_new_buf_create()
  net/smc: fix valid DMBE buffer sizes
  net/smc: fix use-after-free of delayed events
  bpfilter: Fix build error with CONFIG_BPFILTER_UMH
  cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr
  net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info
  bpf: Fix register equivalence tracking.
  rxrpc: Fix loss of final ack on shutdown
  rxrpc: Fix bundle counting for exclusive connections
  netfilter: restore NF_INET_NUMHOOKS
  ibmveth: Identify ingress large send packets.
  ibmveth: Switch order of ibmveth_helper calls.
  cxgb4: handle 4-tuple PEDIT to NAT mode translation
  selftests: Add VRF route leaking tests
  ...
2020-10-15 18:42:13 -07:00
Linus Torvalds c90578360c Merge branch 'work.csum_and_copy' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull copy_and_csum cleanups from Al Viro:
 "Saner calling conventions for csum_and_copy_..._user() and friends"

[ Removing 800+ lines of code and cleaning stuff up is good  - Linus ]

* 'work.csum_and_copy' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ppc: propagate the calling conventions change down to csum_partial_copy_generic()
  amd64: switch csum_partial_copy_generic() to new calling conventions
  sparc64: propagate the calling convention changes down to __csum_partial_copy_...()
  xtensa: propagate the calling conventions change down into csum_partial_copy_generic()
  mips: propagate the calling convention change down into __csum_partial_copy_..._user()
  mips: __csum_partial_copy_kernel() has no users left
  mips: csum_and_copy_{to,from}_user() are never called under KERNEL_DS
  sparc32: propagate the calling conventions change down to __csum_partial_copy_sparc_generic()
  i386: propagate the calling conventions change down to csum_partial_copy_generic()
  sh: propage the calling conventions change down to csum_partial_copy_generic()
  m68k: get rid of zeroing destination on error in csum_and_copy_from_user()
  arm: propagate the calling convention changes down to csum_partial_copy_from_user()
  alpha: propagate the calling convention changes down to csum_partial_copy.c helpers
  saner calling conventions for csum_and_copy_..._user()
  csum_and_copy_..._user(): pass 0xffffffff instead of 0 as initial sum
  csum_partial_copy_nocheck(): drop the last argument
  unify generic instances of csum_partial_copy_nocheck()
  icmp_push_reply(): reorder adding the checksum up
  skb_copy_and_csum_bits(): don't bother with the last argument
2020-10-12 16:24:13 -07:00
Vladimir Oltean 302af7c604 net: always dump full packets with skb_dump
Currently skb_dump has a restriction to only dump full packet for the
first 5 socket buffers, then only headers will be printed. Remove this
arbitrary and confusing restriction, which is only documented vaguely
("up to") in the comments above the prototype.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-06 06:14:02 -07:00
David S. Miller 8b0308fe31 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Rejecting non-native endian BTF overlapped with the addition
of support for it.

The rest were more simple overlapping changes, except the
renesas ravb binding update, which had to follow a file
move as well as a YAML conversion.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-05 18:40:01 -07:00
Guillaume Nault 4296adc3e3 net/core: check length before updating Ethertype in skb_mpls_{push,pop}
Openvswitch allows to drop a packet's Ethernet header, therefore
skb_mpls_push() and skb_mpls_pop() might be called with ethernet=true
and mac_len=0. In that case the pointer passed to skb_mod_eth_type()
doesn't point to an Ethernet header and the new Ethertype is written at
unexpected locations.

Fix this by verifying that mac_len is big enough to contain an Ethernet
header.

Fixes: fa4e0f8855 ("net/sched: fix corrupted L2 header with MPLS 'push' and 'pop' actions")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-04 15:09:26 -07:00
Guillaume Nault 19fbcb36a3 net/sched: act_vlan: Add {POP,PUSH}_ETH actions
Implement TCA_VLAN_ACT_POP_ETH and TCA_VLAN_ACT_PUSH_ETH, to
respectively pop and push a base Ethernet header at the beginning of a
frame.

POP_ETH is just a matter of pulling ETH_HLEN bytes. VLAN tags, if any,
must be stripped before calling POP_ETH.

PUSH_ETH is restricted to skbs with no mac_header, and only the MAC
addresses can be configured. The Ethertype is automatically set from
skb->protocol. These restrictions ensure that all skb's fields remain
consistent, so that this action can't confuse other part of the
networking stack (like GSO).

Since openvswitch already had these actions, consolidate the code in
skbuff.c (like for vlan and mpls push/pop).

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-03 17:28:45 -07:00
Yunsheng Lin 1f14bd99e6 net: remove unnecessary NULL checking in napi_consume_skb()
When budget is non-zero, skb_unref() has already handled the
NULL checking.

When budget is zero, the dev_consume_skb_any() has handled NULL
checking in __dev_kfree_skb_irq(), or dev_kfree_skb() which also
ultimately call skb_unref().

So remove the unnecessary checking in napi_consume_skb().

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-20 14:17:20 -07:00
Jakub Kicinski 44a8c4f33c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
We got slightly different patches removing a double word
in a comment in net/ipv4/raw.c - picked the version from net.

Simple conflict in drivers/net/ethernet/ibm/ibmvnic.c. Use cached
values instead of VNIC login response buffer (following what
commit 507ebe6444 ("ibmvnic: Fix use-after-free of VNIC login
response buffer") did).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-04 21:28:59 -07:00
Herbert Xu be769db2f9 net: Get rid of consume_skb when tracing is off
The function consume_skb is only meaningful when tracing is enabled.
This patch makes it conditional on CONFIG_TRACEPOINTS.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 16:12:26 -07:00
David S. Miller 7611cbb900 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-08-23 11:48:27 -07:00
Al Viro cc44c17baf csum_partial_copy_nocheck(): drop the last argument
It's always 0.  Note that we theoretically could use ~0U as well -
result will be the same modulo 0xffff, _if_ the damn thing did the
right thing for any value of initial sum; later we'll make use of
that when convenient.

However, unlike csum_and_copy_..._user(), there are instances that
did not work for arbitrary initial sums; c6x is one such.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-20 15:45:14 -04:00
Al Viro 8d5930dfb7 skb_copy_and_csum_bits(): don't bother with the last argument
it's always 0

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-20 15:45:13 -04:00
Miaohe Lin eabe861881 net: handle the return value of pskb_carve_frag_list() correctly
pskb_carve_frag_list() may return -ENOMEM in pskb_carve_inside_nonlinear().
we should handle this correctly or we would get wrong sk_buff.

Fixes: 6fa01ccd88 ("skbuff: Add pskb_extract() helper function")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-18 15:56:56 -07:00
Miaohe Lin e3ec1e8ca0 net: eliminate meaningless memcpy to data in pskb_carve_inside_nonlinear()
The frags of skb_shared_info of the data is assigned in following loop. It
is meaningless to do a memcpy of frags here.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-18 15:55:24 -07:00
Miaohe Lin 55eff0eb74 net: Fix potential wrong skb->protocol in skb_vlan_untag()
We may access the two bytes after vlan_hdr in vlan_set_encap_proto(). So
we should pull VLAN_HLEN + sizeof(unsigned short) in skb_vlan_untag() or
we may access the wrong data.

Fixes: 0d5501c1c8 ("net: Always untag vlan-tagged traffic on input.")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-16 15:33:49 -07:00
Miaohe Lin 11f920d2aa net: Use helper function ip_is_fragment()
Use helper function ip_is_fragment() to check ip fragment.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-08 14:24:53 -07:00
Miaohe Lin 2f631133c4 net: Pass NULL to skb_network_protocol() when we don't care about vlan depth
When we don't care about vlan depth, we could pass NULL instead of the
address of a unused local variable to skb_network_protocol() as a param.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-03 15:38:31 -07:00
Miaohe Lin c15fc199b3 net: Use __skb_pagelen() directly in skb_cow_data()
In fact, skb_pagelen() - skb_headlen() is equal to __skb_pagelen(), use it
directly to avoid unnecessary skb_headlen() call.

Also fix the CHECK note of checkpatch.pl:
    Comparison to NULL could be written "!__pskb_pull_tail"

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-03 15:38:31 -07:00
Yousuk Seung 48040793fa tcp: add earliest departure time to SCM_TIMESTAMPING_OPT_STATS
This change adds TCP_NLA_EDT to SCM_TIMESTAMPING_OPT_STATS that reports
the earliest departure time(EDT) of the timestamped skb. By tracking EDT
values of the skb from different timestamps, we can observe when and how
much the value changed. This allows to measure the precise delay
injected on the sender host e.g. by a bpf-base throttler.

Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-31 17:00:44 -07:00
Eric Dumazet 4f65e2f483 net: unexport skb_gro_receive()
skb_gro_receive() used to be used by SCTP, it is no longer the case.

skb_gro_receive_list() is in the same category : never used from modules.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 19:55:36 -07:00
Florian Westphal 4930f4831b net: allow __skb_ext_alloc to sleep
mptcp calls this from the transmit side, from process context.
Allow a sleeping allocation instead of unconditional GFP_ATOMIC.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17 12:35:34 -07:00
Jesper Dangaard Brouer 41a46913be net: fix skb_panic to output real address
In skb_panic() the real pointer values are really needed to diagnose
issues, e.g. data and head are related (to calculate headroom). The
hashed versions of the addresses doesn't make much sense here. The
patch use the printk specifier %px to print the actual address.

The printk documentation on %px:
https://www.kernel.org/doc/html/latest/core-api/printk-formats.html#unmodified-addresses

Fixes: ad67b74d24 ("printk: hash addresses printed with %p")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 15:19:59 -07:00
David S. Miller 5a470b1a63 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-03-30 20:48:43 -07:00
Florian Westphal cf673ed0e0 net: fix fraglist segmentation reference count leak
Xin Long says:
 On udp rx path udp_rcv_segment() may do segment where the frag skbs
 will get the header copied from the head skb in skb_segment_list()
 by calling __copy_skb_header(), which could overwrite the frag skbs'
 extensions by __skb_ext_copy() and cause a leak.

 This issue was found after loading esp_offload where a sec path ext
 is set in the skb.

Fix this by discarding head state of the fraglist skb before replacing
its contents.

Fixes: 3a1296a38d ("net: Support GRO/GSO fraglist chaining.")
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Reported-by: Xiumei Mu <xmu@redhat.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 10:41:58 -07:00
Yadu Kishore 1454c9fa91 net: Make skb_segment not to compute checksum if network controller supports checksumming
Problem:
TCP checksum in the output path is not being offloaded during GSO
in the following case:
The network driver does not support scatter-gather but supports
checksum offload with NETIF_F_HW_CSUM.

Cause:
skb_segment calls skb_copy_and_csum_bits if the network driver
does not announce NETIF_F_SG. It does not check if the driver
supports NETIF_F_HW_CSUM.
So for devices which might want to offload checksum but do not support SG
there is currently no way to do so if GSO is enabled.

Solution:
In skb_segment check if the network controller does checksum and if so
call skb_copy_bits instead of skb_copy_and_csum_bits.

Testing:
Without the patch, ran iperf TCP traffic with NETIF_F_HW_CSUM enabled
in the network driver. Observed the TCP checksum offload is not happening
since the skbs received by the driver in the output path have
skb->ip_summed set to CHECKSUM_NONE.

With the patch ran iperf TCP traffic and observed that TCP checksum
is being offloaded with skb->ip_summed set to CHECKSUM_PARTIAL.
Also tested with the patch by disabling NETIF_F_HW_CSUM in the driver
to cover the newly introduced if-else code path in skb_segment.

Link: https://lore.kernel.org/netdev/CA+FuTSeYGYr3Umij+Mezk9CUcaxYwqEe5sPSuXF8jPE2yMFJAw@mail.gmail.com
Signed-off-by: Yadu Kishore <kyk.segfault@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23 12:59:53 -07:00
Kees Cook 161d179261 net: core: Distribute switch variables for initialization
Variables declared in a switch statement before any case statements
cannot be automatically initialized with compiler instrumentation (as
they are not part of any execution flow). With GCC's proposed automatic
stack variable initialization feature, this triggers a warning (and they
don't get initialized). Clang's automatic stack variable initialization
(via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
doesn't initialize such variables[1]. Note that these warnings (or silent
skipping) happen before the dead-store elimination optimization phase,
so even when the automatic initializations are later elided in favor of
direct initializations, the warnings remain.

To avoid these problems, move such variables into the "case" where
they're used or lift them up into the main function body.

net/core/skbuff.c: In function ‘skb_checksum_setup_ip’:
net/core/skbuff.c:4809:7: warning: statement will never be executed [-Wswitch-unreachable]
 4809 |   int err;
      |       ^~~

[1] https://bugs.llvm.org/show_bug.cgi?id=44916

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 10:00:19 -08:00
Randy Dunlap 8955b4357d skbuff: remove stale bit mask comments
Remove stale comments since this flag is no longer a bit mask
but is a bit field.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 19:49:51 -08:00
Steffen Klassert 3a1296a38d net: Support GRO/GSO fraglist chaining.
This patch adds the core functions to chain/unchain
GSO skbs at the frag_list pointer. This also adds
a new GSO type SKB_GSO_FRAGLIST and a is_flist
flag to napi_gro_cb which indicates that this
flow will be GROed by fraglist chaining.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 11:00:21 +01:00
Paolo Abeni 8b69a80381 skb: add helpers to allocate ext independently from sk_buff
Currently we can allocate the extension only after the skb,
this change allows the user to do the opposite, will simplify
allocation failure handling from MPTCP.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:41:42 -08:00
Mat Martineau 3ee17bc78e mptcp: Add MPTCP to skb extensions
Add enum value for MPTCP and update config dependencies

v5 -> v6:
 - fixed '__unused' field size

Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:41:41 -08:00
Martin Varghese 76f99f987f net: Rephrased comments section of skb_mpls_pop()
Rephrased comments section of skb_mpls_pop() to align it with
comments section of skb_mpls_push().

Signed-off-by: Martin Varghese <martin.varghese@nokia.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24 22:24:45 -08:00
Martin Varghese e7dbfed1ad net: skb_mpls_push() modified to allow MPLS header push at start of packet.
The existing skb_mpls_push() implementation always inserts mpls header
after the mac header. L2 VPN use cases requires MPLS header to be
inserted before the ethernet header as the ethernet packet gets tunnelled
inside MPLS header in those cases.

Signed-off-by: Martin Varghese <martin.varghese@nokia.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24 22:24:45 -08:00
Martin Varghese d04ac224b1 net: Fixed updating of ethertype in skb_mpls_push()
The skb_mpls_push was not updating ethertype of an ethernet packet if
the packet was originally received from a non ARPHRD_ETHER device.

In the below OVS data path flow, since the device corresponding to
port 7 is an l3 device (ARPHRD_NONE) the skb_mpls_push function does
not update the ethertype of the packet even though the previous
push_eth action had added an ethernet header to the packet.

recirc_id(0),in_port(7),eth_type(0x0800),ipv4(tos=0/0xfc,ttl=64,frag=no),
actions:push_eth(src=00:00:00:00:00:00,dst=00:00:00:00:00:00),
push_mpls(label=13,tc=0,ttl=64,bos=1,eth_type=0x8847),4

Fixes: 8822e270d6 ("net: core: move push MPLS functionality from OvS to core helper")
Signed-off-by: Martin Varghese <martin.varghese@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-04 17:11:25 -08:00
Martin Varghese 040b5cfbce Fixed updating of ethertype in function skb_mpls_pop
The skb_mpls_pop was not updating ethertype of an ethernet packet if the
packet was originally received from a non ARPHRD_ETHER device.

In the below OVS data path flow, since the device corresponding to port 7
is an l3 device (ARPHRD_NONE) the skb_mpls_pop function does not update
the ethertype of the packet even though the previous push_eth action had
added an ethernet header to the packet.

recirc_id(0),in_port(7),eth_type(0x8847),
mpls(label=12/0xfffff,tc=0/0,ttl=0/0x0,bos=1/1),
actions:push_eth(src=00:00:00:00:00:00,dst=00:00:00:00:00:00),
pop_mpls(eth_type=0x800),4

Fixes: ed246cee09 ("net: core: move pop MPLS functionality from OvS to core helper")
Signed-off-by: Martin Varghese <martin.varghese@nokia.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-02 13:03:50 -08:00
Davide Caratti fa4e0f8855 net/sched: fix corrupted L2 header with MPLS 'push' and 'pop' actions
the following script:

 # tc qdisc add dev eth0 clsact
 # tc filter add dev eth0 egress protocol ip matchall \
 > action mpls push protocol mpls_uc label 0x355aa bos 1

causes corruption of all IP packets transmitted by eth0. On TC egress, we
can't rely on the value of skb->mac_len, because it's 0 and a MPLS 'push'
operation will result in an overwrite of the first 4 octets in the packet
L2 header (e.g. the Destination Address if eth0 is an Ethernet); the same
error pattern is present also in the MPLS 'pop' operation. Fix this error
in act_mpls data plane, computing 'mac_len' as the difference between the
network header and the mac header (when not at TC ingress), and use it in
MPLS 'push'/'pop' core functions.

v2: unbreak 'make htmldocs' because of missing documentation of 'mac_len'
    in skb_mpls_pop(), reported by kbuild test robot

CC: Lorenzo Bianconi <lorenzo@kernel.org>
Fixes: 2a2ea50870 ("net: sched: add mpls manipulation actions to TC")
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-15 17:14:48 -07:00
Davide Caratti dedc5a08da net: avoid errors when trying to pop MLPS header on non-MPLS packets
the following script:

 # tc qdisc add dev eth0 clsact
 # tc filter add dev eth0 egress matchall action mpls pop

implicitly makes the kernel drop all packets transmitted by eth0, if they
don't have a MPLS header. This behavior is uncommon: other encapsulations
(like VLAN) just let the packet pass unmodified. Since the result of MPLS
'pop' operation would be the same regardless of the presence / absence of
MPLS header(s) in the original packet, we can let skb_mpls_pop() return 0
when dealing with non-MPLS packets.

For the OVS use-case, this is acceptable because __ovs_nla_copy_actions()
already ensures that MPLS 'pop' operation only occurs with packets having
an MPLS Ethernet type (and there are no other callers in current code, so
the semantic change should be ok).

v2: better documentation of use-cases for skb_mpls_pop(), thanks to Simon
    Horman

Fixes: 2a2ea50870 ("net: sched: add mpls manipulation actions to TC")
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-15 17:14:48 -07:00
Eric Dumazet ebb3b78db7 tcp: annotate sk->sk_rcvbuf lockless reads
For the sake of tcp_poll(), there are few places where we fetch
sk->sk_rcvbuf while this field can change from IRQ or other cpu.

We need to add READ_ONCE() annotations, and also make sure write
sides use corresponding WRITE_ONCE() to avoid store-tearing.

Note that other transports probably need similar fixes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-13 10:13:08 -07:00
Florian Westphal 895b5c9f20 netfilter: drop bridge nf reset from nf_reset
commit 174e23810c
("sk_buff: drop all skb extensions on free and skb scrubbing") made napi
recycle always drop skb extensions.  The additional skb_ext_del() that is
performed via nf_reset on napi skb recycle is not needed anymore.

Most nf_reset() calls in the stack are there so queued skb won't block
'rmmod nf_conntrack' indefinitely.

This removes the skb_ext_del from nf_reset, and renames it to a more
fitting nf_reset_ct().

In a few selected places, add a call to skb_ext_reset to make sure that
no active extensions remain.

I am submitting this for "net", because we're still early in the release
cycle.  The patch applies to net-next too, but I think the rename causes
needless divergence between those trees.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-10-01 18:42:15 +02:00
Florian Westphal 174e23810c sk_buff: drop all skb extensions on free and skb scrubbing
Now that we have a 3rd extension, add a new helper that drops the
extension space and use it when we need to scrub an sk_buff.

At this time, scrubbing clears secpath and bridge netfilter data, but
retains the tc skb extension, after this patch all three get cleared.

NAPI reuse/free assumes we can only have a secpath attached to skb, but
it seems better to clear all extensions there as well.

v2: add unlikely hint (Eric Dumazet)

Fixes: 95a7233c45 ("net: openvswitch: Set OvS recirc_id from tc chain index")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27 20:40:19 +02:00
David S. Miller aa2eaa8c27 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Minor overlapping changes in the btusb and ixgbe drivers.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-15 14:17:27 +02:00