Commit graph

564 commits

Author SHA1 Message Date
Paolo Abeni c248b27cfc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-26 10:17:46 +02:00
Pablo Neira Ayuso 2cdaa3eefe netfilter: conntrack: restore IPS_CONFIRMED out of nf_conntrack_hash_check_insert()
e6d57e9ff0 ("netfilter: conntrack: fix rmmod double-free race")
consolidates IPS_CONFIRMED bit set in nf_conntrack_hash_check_insert().
However, this breaks ctnetlink:

 # conntrack -I -p tcp --timeout 123 --src 1.2.3.4 --dst 5.6.7.8 --state ESTABLISHED --sport 1 --dport 4 -u SEEN_REPLY
 conntrack v1.4.6 (conntrack-tools): Operation failed: Device or resource busy

This is a partial revert of the aforementioned commit to restore
IPS_CONFIRMED.

Fixes: e6d57e9ff0 ("netfilter: conntrack: fix rmmod double-free race")
Reported-by: Stéphane Graber <stgraber@stgraber.org>
Tested-by: Stéphane Graber <stgraber@stgraber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-19 10:07:59 +02:00
Jakub Kicinski d0928c1c5b Merge branch 'main' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:

====================
Netfilter updates for net-next

1. nf_tables 'brouting' support, from Sriram Yagnaraman.

2. Update bridge netfilter and ovs conntrack helpers to handle
   IPv6 Jumbo packets properly, i.e. fetch the packet length
   from hop-by-hop extension header, from Xin Long.

   This comes with a test BIG TCP test case, added to
   tools/testing/selftests/net/.

3. Fix spelling and indentation in conntrack, from Jeremy Sowden.

* 'main' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nat: fix indentation of function arguments
  netfilter: conntrack: fix typo
  selftests: add a selftest for big tcp
  netfilter: use nf_ip6_check_hbh_len in nf_ct_skb_network_trim
  netfilter: move br_nf_check_hbh_len to utils
  netfilter: bridge: move pskb_trim_rcsum out of br_nf_check_hbh_len
  netfilter: bridge: check len before accessing more nh data
  netfilter: bridge: call pskb_may_pull in br_nf_check_hbh_len
  netfilter: bridge: introduce broute meta statement
====================

Link: https://lore.kernel.org/r/20230308193033.13965-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-09 23:28:53 -08:00
Jeremy Sowden e5d015a114 netfilter: conntrack: fix typo
There's a spelling mistake in a comment.  Fix it.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-08 14:25:43 +01:00
Eric Dumazet c77737b736 netfilter: conntrack: adopt safer max chain length
Customers using GKE 1.25 and 1.26 are facing conntrack issues
root caused to commit c9c3b6811f ("netfilter: conntrack: make
max chain length random").

Even if we assume Uniform Hashing, a bucket often reachs 8 chained
items while the load factor of the hash table is smaller than 0.5

With a limit of 16, we reach load factors of 3.
With a limit of 32, we reach load factors of 11.
With a limit of 40, we reach load factors of 15.
With a limit of 50, we reach load factors of 24.

This patch changes MIN_CHAINLEN to 50, to minimize risks.

Ideally, we could in the future add a cushion based on expected
load factor (2 * nf_conntrack_max / nf_conntrack_buckets),
because some setups might expect unusual values.

Fixes: c9c3b6811f ("netfilter: conntrack: make max chain length random")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-03-07 10:58:06 +01:00
Jakub Kicinski fd2a55e74a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

1) Fix broken listing of set elements when table has an owner.

2) Fix conntrack refcount leak in ctnetlink with related conntrack
   entries, from Hangyu Hua.

3) Fix use-after-free/double-free in ctnetlink conntrack insert path,
   from Florian Westphal.

4) Fix ip6t_rpfilter with VRF, from Phil Sutter.

5) Fix use-after-free in ebtables reported by syzbot, also from Florian.

6) Use skb->len in xt_length to deal with IPv6 jumbo packets,
   from Xin Long.

7) Fix NETLINK_LISTEN_ALL_NSID with ctnetlink, from Florian Westphal.

8) Fix memleak in {ip_,ip6_,arp_}tables in ENOMEM error case,
   from Pavel Tikhomirov.

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: x_tables: fix percpu counter block leak on error path when creating new netns
  netfilter: ctnetlink: make event listener tracking global
  netfilter: xt_length: use skb len to match in length_mt6
  netfilter: ebtables: fix table blob use-after-free
  netfilter: ip6t_rpfilter: Fix regression with VRF interfaces
  netfilter: conntrack: fix rmmod double-free race
  netfilter: ctnetlink: fix possible refcount leak in ctnetlink_create_conntrack()
  netfilter: nf_tables: allow to fetch set elements when table has an owner
====================

Link: https://lore.kernel.org/r/20230222092137.88637-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-22 21:25:23 -08:00
Florian Westphal e6d57e9ff0 netfilter: conntrack: fix rmmod double-free race
nf_conntrack_hash_check_insert() callers free the ct entry directly, via
nf_conntrack_free.

This isn't safe anymore because
nf_conntrack_hash_check_insert() might place the entry into the conntrack
table and then delteted the entry again because it found that a conntrack
extension has been removed at the same time.

In this case, the just-added entry is removed again and an error is
returned to the caller.

Problem is that another cpu might have picked up this entry and
incremented its reference count.

This results in a use-after-free/double-free, once by the other cpu and
once by the caller of nf_conntrack_hash_check_insert().

Fix this by making nf_conntrack_hash_check_insert() not fail anymore
after the insertion, just like before the 'Fixes' commit.

This is safe because a racing nf_ct_iterate() has to wait for us
to release the conntrack hash spinlocks.

While at it, make the function return -EAGAIN in the rmmod (genid
changed) case, this makes nfnetlink replay the command (suggested
by Pablo Neira).

Fixes: c56716c69c ("netfilter: extensions: introduce extension genid count")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-02-22 00:19:38 +01:00
David S. Miller 1155a2281d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Add safeguard to check for NULL tupe in objects updates via
   NFT_MSG_NEWOBJ, this should not ever happen. From Alok Tiwari.

2) Incorrect pointer check in the new destroy rule command,
   from Yang Yingliang.

3) Incorrect status bitcheck in nf_conntrack_udp_packet(),
   from Florian Westphal.

4) Simplify seq_print_acct(), from Ilia Gavrilov.

5) Use 2-arg optimal variant of kfree_rcu() in IPVS,
   from Julian Anastasov.

6) TCP connection enters CLOSE state in conntrack for locally
   originated TCP reset packet from the reject target,
   from Florian Westphal.

The fixes #2 and #3 in this series address issues from the previous pull
nf-next request in this net-next cycle.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20 10:53:56 +00:00
Florian Westphal 2954fe60e3 netfilter: let reset rules clean out conntrack entries
iptables/nftables support responding to tcp packets with tcp resets.

The generated tcp reset packet passes through both output and postrouting
netfilter hooks, but conntrack will never see them because the generated
skb has its ->nfct pointer copied over from the packet that triggered the
reset rule.

If the reset rule is used for established connections, this
may result in the conntrack entry to be around for a very long
time (default timeout is 5 days).

One way to avoid this would be to not copy the nf_conn pointer
so that the rest packet passes through conntrack too.

Problem is that output rules might not have the same conntrack
zone setup as the prerouting ones, so its possible that the
reset skb won't find the correct entry.  Generating a template
entry for the skb seems error prone as well.

Add an explicit "closing" function that switches a confirmed
conntrack entry to closed state and wire this up for tcp.

If the entry isn't confirmed, no action is needed because
the conntrack entry will never be committed to the table.

Reported-by: Russel King <linux@armlinux.org.uk>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-02-17 13:04:56 +01:00
Vlad Buslov df25455e5a netfilter: nf_conntrack: allow early drop of offloaded UDP conns
Both synchronous early drop algorithm and asynchronous gc worker completely
ignore connections with IPS_OFFLOAD_BIT status bit set. With new
functionality that enabled UDP NEW connection offload in action CT
malicious user can flood the conntrack table with offloaded UDP connections
by just sending a single packet per 5tuple because such connections can no
longer be deleted by early drop algorithm.

To mitigate the issue allow both early drop and gc to consider offloaded
UDP connections for deletion.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-03 09:31:24 +00:00
Florian Westphal 2a2fa2efc6 netfilter: conntrack: move rcu read lock to nf_conntrack_find_get
Move rcu_read_lock/unlock to nf_conntrack_find_get(), this avoids
nested rcu_read_lock call from resolve_normal_ct().

Signed-off-by: Florian Westphal <fw@strlen.de>
2023-01-18 13:05:24 +01:00
Florian Westphal 4883ec512c netfilter: conntrack: avoid reload of ct->status
Compiler can't merge the two test_bit() calls, so load ct->status
once and use non-atomic accesses.

This is fine because IPS_EXPECTED or NAT_CLASH are either set at ct
creation time or not at all, but compiler can't know that.

Signed-off-by: Florian Westphal <fw@strlen.de>
2023-01-18 13:05:24 +01:00
Florian Westphal 50bfbb8957 netfilter: conntrack: remove pr_debug calls
Those are all useless or dubious.
getorigdst() is called via setsockopt, so return value/errno will
already indicate an appropriate error.

For other pr_debug calls there are better replacements, such as
slab/slub debugging or 'conntrack -E' (ctnetlink events).

Signed-off-by: Florian Westphal <fw@strlen.de>
2023-01-18 13:05:24 +01:00
Linus Torvalds 7e68dd7d07 Networking changes for 6.2.
Core
 ----
  - Allow live renaming when an interface is up
 
  - Add retpoline wrappers for tc, improving considerably the
    performances of complex queue discipline configurations.
 
  - Add inet drop monitor support.
 
  - A few GRO performance improvements.
 
  - Add infrastructure for atomic dev stats, addressing long standing
    data races.
 
  - De-duplicate common code between OVS and conntrack offloading
    infrastructure.
 
  - A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements.
 
  - Netfilter: introduce packet parser for tunneled packets
 
  - Replace IPVS timer-based estimators with kthreads to scale up
    the workload with the number of available CPUs.
 
  - Add the helper support for connection-tracking OVS offload.
 
 BPF
 ---
  - Support for user defined BPF objects: the use case is to allocate
    own objects, build own object hierarchies and use the building
    blocks to build own data structures flexibly, for example, linked
    lists in BPF.
 
  - Make cgroup local storage available to non-cgroup attached BPF
    programs.
 
  - Avoid unnecessary deadlock detection and failures wrt BPF task
    storage helpers.
 
  - A relevant bunch of BPF verifier fixes and improvements.
 
  - Veristat tool improvements to support custom filtering, sorting,
    and replay of results.
 
  - Add LLVM disassembler as default library for dumping JITed code.
 
  - Lots of new BPF documentation for various BPF maps.
 
  - Add bpf_rcu_read_{,un}lock() support for sleepable programs.
 
  - Add RCU grace period chaining to BPF to wait for the completion
    of access from both sleepable and non-sleepable BPF programs.
 
  - Add support storing struct task_struct objects as kptrs in maps.
 
  - Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
    values.
 
  - Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions.
 
 Protocols
 ---------
  - TCP: implement Protective Load Balancing across switch links.
 
  - TCP: allow dynamically disabling TCP-MD5 static key, reverting
    back to fast[er]-path.
 
  - UDP: Introduce optional per-netns hash lookup table.
 
  - IPv6: simplify and cleanup sockets disposal.
 
  - Netlink: support different type policies for each generic
    netlink operation.
 
  - MPTCP: add MSG_FASTOPEN and FastOpen listener side support.
 
  - MPTCP: add netlink notification support for listener sockets
    events.
 
  - SCTP: add VRF support, allowing sctp sockets binding to VRF
    devices.
 
  - Add bridging MAC Authentication Bypass (MAB) support.
 
  - Extensions for Ethernet VPN bridging implementation to better
    support multicast scenarios.
 
  - More work for Wi-Fi 7 support, comprising conversion of all
    the existing drivers to internal TX queue usage.
 
  - IPSec: introduce a new offload type (packet offload) allowing
    complete header processing and crypto offloading.
 
  - IPSec: extended ack support for more descriptive XFRM error
    reporting.
 
  - RXRPC: increase SACK table size and move processing into a
    per-local endpoint kernel thread, reducing considerably the
    required locking.
 
  - IEEE 802154: synchronous send frame and extended filtering
    support, initial support for scanning available 15.4 networks.
 
  - Tun: bump the link speed from 10Mbps to 10Gbps.
 
  - Tun/VirtioNet: implement UDP segmentation offload support.
 
 Driver API
 ----------
 
  - PHY/SFP: improve power level switching between standard
    level 1 and the higher power levels.
 
  - New API for netdev <-> devlink_port linkage.
 
  - PTP: convert existing drivers to new frequency adjustment
    implementation.
 
  - DSA: add support for rx offloading.
 
  - Autoload DSA tagging driver when dynamically changing protocol.
 
  - Add new PCP and APPTRUST attributes to Data Center Bridging.
 
  - Add configuration support for 800Gbps link speed.
 
  - Add devlink port function attribute to enable/disable RoCE and
    migratable.
 
  - Extend devlink-rate to support strict prioriry and weighted fair
    queuing.
 
  - Add devlink support to directly reading from region memory.
 
  - New device tree helper to fetch MAC address from nvmem.
 
  - New big TCP helper to simplify temporary header stripping.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Marvel Octeon CNF95N and CN10KB Ethernet Switches.
    - Marvel Prestera AC5X Ethernet Switch.
    - WangXun 10 Gigabit NIC.
    - Motorcomm yt8521 Gigabit Ethernet.
    - Microchip ksz9563 Gigabit Ethernet Switch.
    - Microsoft Azure Network Adapter.
    - Linux Automation 10Base-T1L adapter.
 
  - PHY:
    - Aquantia AQR112 and AQR412.
    - Motorcomm YT8531S.
 
  - PTP:
    - Orolia ART-CARD.
 
  - WiFi:
    - MediaTek Wi-Fi 7 (802.11be) devices.
    - RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB
      devices.
 
  - Bluetooth:
    - Broadcom BCM4377/4378/4387 Bluetooth chipsets.
    - Realtek RTL8852BE and RTL8723DS.
    - Cypress.CYW4373A0 WiFi + Bluetooth combo device.
 
 Drivers
 -------
  - CAN:
    - gs_usb: bus error reporting support.
    - kvaser_usb: listen only and bus error reporting support.
 
  - Ethernet NICs:
    - Intel (100G):
      - extend action skbedit to RX queue mapping.
      - implement devlink-rate support.
      - support direct read from memory.
    - nVidia/Mellanox (mlx5):
      - SW steering improvements, increasing rules update rate.
      - Support for enhanced events compression.
      - extend H/W offload packet manipulation capabilities.
      - implement IPSec packet offload mode.
    - nVidia/Mellanox (mlx4):
      - better big TCP support.
    - Netronome Ethernet NICs (nfp):
      - IPsec offload support.
      - add support for multicast filter.
    - Broadcom:
      - RSS and PTP support improvements.
    - AMD/SolarFlare:
      - netlink extened ack improvements.
      - add basic flower matches to offload, and related stats.
    - Virtual NICs:
      - ibmvnic: introduce affinity hint support.
    - small / embedded:
      - FreeScale fec: add initial XDP support.
      - Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood.
      - TI am65-cpsw: add suspend/resume support.
      - Mediatek MT7986: add RX wireless wthernet dispatch support.
      - Realtek 8169: enable GRO software interrupt coalescing per
        default.
 
  - Ethernet high-speed switches:
    - Microchip (sparx5):
      - add support for Sparx5 TC/flower H/W offload via VCAP.
    - Mellanox mlxsw:
      - add 802.1X and MAC Authentication Bypass offload support.
      - add ip6gre support.
 
  - Embedded Ethernet switches:
    - Mediatek (mtk_eth_soc):
      - improve PCS implementation, add DSA untag support.
      - enable flow offload support.
    - Renesas:
      - add rswitch R-Car Gen4 gPTP support.
    - Microchip (lan966x):
      - add full XDP support.
      - add TC H/W offload via VCAP.
      - enable PTP on bridge interfaces.
    - Microchip (ksz8):
      - add MTU support for KSZ8 series.
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - support configuring channel dwell time during scan.
 
  - MediaTek WiFi (mt76):
    - enable Wireless Ethernet Dispatch (WED) offload support.
    - add ack signal support.
    - enable coredump support.
    - remain_on_channel support.
 
  - Intel WiFi (iwlwifi):
    - enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities.
    - 320 MHz channels support.
 
  - RealTek WiFi (rtw89):
    - new dynamic header firmware format support.
    - wake-over-WLAN support.
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmOYXUcSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOk8zQP/R7BZtbJMTPiWkRnSoKHnAyupDVwrz5U
 ktukLkwPsCyJuEbAjgxrxf4EEEQ9uq2FFlxNSYuKiiQMqIpFxV6KED7LCUygn4Tc
 kxtkp0Q+5XiqisWlQmtfExf2OjuuPqcjV9tWCDBI6GebKUbfNwY/eI44RcMu4BSv
 DzIlW5GkX/kZAPqnnuqaLsN3FudDTJHGEAD7NbA++7wJ076RWYSLXlFv0Z+SCSPS
 H8/PEG0/ZK/65rIWMAFRClJ9BNIDwGVgp0GrsIvs1gqbRUOlA1hl1rDM21TqtNFf
 5QPQT7sIfTcCE/nerxKJD5JE3JyP+XRlRn96PaRw3rt4MgI6I/EOj/HOKQ5tMCNc
 oPiqb7N70+hkLZyr42qX+vN9eDPjp2koEQm7EO2Zs+/534/zWDs24Zfk/Aa1ps0I
 Fa82oGjAgkBhGe/FZ6i5cYoLcyxqRqZV1Ws9XQMl72qRC7/BwvNbIW6beLpCRyeM
 yYIU+0e9dEm+wHQEdh2niJuVtR63hy8tvmPx56lyh+6u0+pondkwbfSiC5aD3kAC
 ikKsN5DyEsdXyiBAlytCEBxnaOjQy4RAz+3YXSiS0eBNacXp03UUrNGx4Pzpu/D0
 QLFJhBnMFFCgy5to8/DvKnrTPgZdSURwqbIUcZdvU21f1HLR8tUTpaQnYffc/Whm
 V8gnt1EL+0cc
 =CbJC
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "Core:

   - Allow live renaming when an interface is up

   - Add retpoline wrappers for tc, improving considerably the
     performances of complex queue discipline configurations

   - Add inet drop monitor support

   - A few GRO performance improvements

   - Add infrastructure for atomic dev stats, addressing long standing
     data races

   - De-duplicate common code between OVS and conntrack offloading
     infrastructure

   - A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements

   - Netfilter: introduce packet parser for tunneled packets

   - Replace IPVS timer-based estimators with kthreads to scale up the
     workload with the number of available CPUs

   - Add the helper support for connection-tracking OVS offload

  BPF:

   - Support for user defined BPF objects: the use case is to allocate
     own objects, build own object hierarchies and use the building
     blocks to build own data structures flexibly, for example, linked
     lists in BPF

   - Make cgroup local storage available to non-cgroup attached BPF
     programs

   - Avoid unnecessary deadlock detection and failures wrt BPF task
     storage helpers

   - A relevant bunch of BPF verifier fixes and improvements

   - Veristat tool improvements to support custom filtering, sorting,
     and replay of results

   - Add LLVM disassembler as default library for dumping JITed code

   - Lots of new BPF documentation for various BPF maps

   - Add bpf_rcu_read_{,un}lock() support for sleepable programs

   - Add RCU grace period chaining to BPF to wait for the completion of
     access from both sleepable and non-sleepable BPF programs

   - Add support storing struct task_struct objects as kptrs in maps

   - Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
     values

   - Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions

  Protocols:

   - TCP: implement Protective Load Balancing across switch links

   - TCP: allow dynamically disabling TCP-MD5 static key, reverting back
     to fast[er]-path

   - UDP: Introduce optional per-netns hash lookup table

   - IPv6: simplify and cleanup sockets disposal

   - Netlink: support different type policies for each generic netlink
     operation

   - MPTCP: add MSG_FASTOPEN and FastOpen listener side support

   - MPTCP: add netlink notification support for listener sockets events

   - SCTP: add VRF support, allowing sctp sockets binding to VRF devices

   - Add bridging MAC Authentication Bypass (MAB) support

   - Extensions for Ethernet VPN bridging implementation to better
     support multicast scenarios

   - More work for Wi-Fi 7 support, comprising conversion of all the
     existing drivers to internal TX queue usage

   - IPSec: introduce a new offload type (packet offload) allowing
     complete header processing and crypto offloading

   - IPSec: extended ack support for more descriptive XFRM error
     reporting

   - RXRPC: increase SACK table size and move processing into a
     per-local endpoint kernel thread, reducing considerably the
     required locking

   - IEEE 802154: synchronous send frame and extended filtering support,
     initial support for scanning available 15.4 networks

   - Tun: bump the link speed from 10Mbps to 10Gbps

   - Tun/VirtioNet: implement UDP segmentation offload support

  Driver API:

   - PHY/SFP: improve power level switching between standard level 1 and
     the higher power levels

   - New API for netdev <-> devlink_port linkage

   - PTP: convert existing drivers to new frequency adjustment
     implementation

   - DSA: add support for rx offloading

   - Autoload DSA tagging driver when dynamically changing protocol

   - Add new PCP and APPTRUST attributes to Data Center Bridging

   - Add configuration support for 800Gbps link speed

   - Add devlink port function attribute to enable/disable RoCE and
     migratable

   - Extend devlink-rate to support strict prioriry and weighted fair
     queuing

   - Add devlink support to directly reading from region memory

   - New device tree helper to fetch MAC address from nvmem

   - New big TCP helper to simplify temporary header stripping

  New hardware / drivers:

   - Ethernet:
      - Marvel Octeon CNF95N and CN10KB Ethernet Switches
      - Marvel Prestera AC5X Ethernet Switch
      - WangXun 10 Gigabit NIC
      - Motorcomm yt8521 Gigabit Ethernet
      - Microchip ksz9563 Gigabit Ethernet Switch
      - Microsoft Azure Network Adapter
      - Linux Automation 10Base-T1L adapter

   - PHY:
      - Aquantia AQR112 and AQR412
      - Motorcomm YT8531S

   - PTP:
      - Orolia ART-CARD

   - WiFi:
      - MediaTek Wi-Fi 7 (802.11be) devices
      - RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB
        devices

   - Bluetooth:
      - Broadcom BCM4377/4378/4387 Bluetooth chipsets
      - Realtek RTL8852BE and RTL8723DS
      - Cypress.CYW4373A0 WiFi + Bluetooth combo device

  Drivers:

   - CAN:
      - gs_usb: bus error reporting support
      - kvaser_usb: listen only and bus error reporting support

   - Ethernet NICs:
      - Intel (100G):
         - extend action skbedit to RX queue mapping
         - implement devlink-rate support
         - support direct read from memory
      - nVidia/Mellanox (mlx5):
         - SW steering improvements, increasing rules update rate
         - Support for enhanced events compression
         - extend H/W offload packet manipulation capabilities
         - implement IPSec packet offload mode
      - nVidia/Mellanox (mlx4):
         - better big TCP support
      - Netronome Ethernet NICs (nfp):
         - IPsec offload support
         - add support for multicast filter
      - Broadcom:
         - RSS and PTP support improvements
      - AMD/SolarFlare:
         - netlink extened ack improvements
         - add basic flower matches to offload, and related stats
      - Virtual NICs:
         - ibmvnic: introduce affinity hint support
      - small / embedded:
         - FreeScale fec: add initial XDP support
         - Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood
         - TI am65-cpsw: add suspend/resume support
         - Mediatek MT7986: add RX wireless wthernet dispatch support
         - Realtek 8169: enable GRO software interrupt coalescing per
           default

   - Ethernet high-speed switches:
      - Microchip (sparx5):
         - add support for Sparx5 TC/flower H/W offload via VCAP
      - Mellanox mlxsw:
         - add 802.1X and MAC Authentication Bypass offload support
         - add ip6gre support

   - Embedded Ethernet switches:
      - Mediatek (mtk_eth_soc):
         - improve PCS implementation, add DSA untag support
         - enable flow offload support
      - Renesas:
         - add rswitch R-Car Gen4 gPTP support
      - Microchip (lan966x):
         - add full XDP support
         - add TC H/W offload via VCAP
         - enable PTP on bridge interfaces
      - Microchip (ksz8):
         - add MTU support for KSZ8 series

   - Qualcomm 802.11ax WiFi (ath11k):
      - support configuring channel dwell time during scan

   - MediaTek WiFi (mt76):
      - enable Wireless Ethernet Dispatch (WED) offload support
      - add ack signal support
      - enable coredump support
      - remain_on_channel support

   - Intel WiFi (iwlwifi):
      - enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities
      - 320 MHz channels support

   - RealTek WiFi (rtw89):
      - new dynamic header firmware format support
      - wake-over-WLAN support"

* tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2002 commits)
  ipvs: fix type warning in do_div() on 32 bit
  net: lan966x: Remove a useless test in lan966x_ptp_add_trap()
  net: ipa: add IPA v4.7 support
  dt-bindings: net: qcom,ipa: Add SM6350 compatible
  bnxt: Use generic HBH removal helper in tx path
  IPv6/GRO: generic helper to remove temporary HBH/jumbo header in driver
  selftests: forwarding: Add bridge MDB test
  selftests: forwarding: Rename bridge_mdb test
  bridge: mcast: Support replacement of MDB port group entries
  bridge: mcast: Allow user space to specify MDB entry routing protocol
  bridge: mcast: Allow user space to add (*, G) with a source list and filter mode
  bridge: mcast: Add support for (*, G) with a source list and filter mode
  bridge: mcast: Avoid arming group timer when (S, G) corresponds to a source
  bridge: mcast: Add a flag for user installed source entries
  bridge: mcast: Expose __br_multicast_del_group_src()
  bridge: mcast: Expose br_multicast_new_group_src()
  bridge: mcast: Add a centralized error path
  bridge: mcast: Place netlink policy before validation functions
  bridge: mcast: Split (*, G) and (S, G) addition into different functions
  bridge: mcast: Do not derive entry type from its filter mode
  ...
2022-12-13 15:47:48 -08:00
Linus Torvalds 268325bda5 Random number generator updates for Linux 6.2-rc1.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmOU+U8ACgkQSfxwEqXe
 A67NnQ//Y5DltmvibyPd7r1TFT2gUYv+Rx3sUV9ZE1NYptd/SWhhcL8c5FZ70Fuw
 bSKCa1uiWjOxosjXT1kGrWq3de7q7oUpAPSOGxgxzoaNURIt58N/ajItCX/4Au8I
 RlGAScHy5e5t41/26a498kB6qJ441fBEqCYKQpPLINMBAhe8TQ+NVp0rlpUwNHFX
 WrUGg4oKWxdBIW3HkDirQjJWDkkAiklRTifQh/Al4b6QDbOnRUGGCeckNOhixsvS
 waHWTld+Td8jRrA4b82tUb2uVZ2/b8dEvj/A8CuTv4yC0lywoyMgBWmJAGOC+UmT
 ZVNdGW02Jc2T+Iap8ZdsEmeLHNqbli4+IcbY5xNlov+tHJ2oz41H9TZoYKbudlr6
 /ReAUPSn7i50PhbQlEruj3eg+M2gjOeh8OF8UKwwRK8PghvyWQ1ScW0l3kUhPIhI
 PdIG6j4+D2mJc1FIj2rTVB+Bg933x6S+qx4zDxGlNp62AARUFYf6EgyD6aXFQVuX
 RxcKb6cjRuFkzFiKc8zkqg5edZH+IJcPNuIBmABqTGBOxbZWURXzIQvK/iULqZa4
 CdGAFIs6FuOh8pFHLI3R4YoHBopbHup/xKDEeAO9KZGyeVIuOSERDxxo5f/ITzcq
 APvT77DFOEuyvanr8RMqqh0yUjzcddXqw9+ieufsAyDwjD9DTuE=
 =QRhK
 -----END PGP SIGNATURE-----

Merge tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random

Pull random number generator updates from Jason Donenfeld:

 - Replace prandom_u32_max() and various open-coded variants of it,
   there is now a new family of functions that uses fast rejection
   sampling to choose properly uniformly random numbers within an
   interval:

       get_random_u32_below(ceil) - [0, ceil)
       get_random_u32_above(floor) - (floor, U32_MAX]
       get_random_u32_inclusive(floor, ceil) - [floor, ceil]

   Coccinelle was used to convert all current users of
   prandom_u32_max(), as well as many open-coded patterns, resulting in
   improvements throughout the tree.

   I'll have a "late" 6.1-rc1 pull for you that removes the now unused
   prandom_u32_max() function, just in case any other trees add a new
   use case of it that needs to converted. According to linux-next,
   there may be two trivial cases of prandom_u32_max() reintroductions
   that are fixable with a 's/.../.../'. So I'll have for you a final
   conversion patch doing that alongside the removal patch during the
   second week.

   This is a treewide change that touches many files throughout.

 - More consistent use of get_random_canary().

 - Updates to comments, documentation, tests, headers, and
   simplification in configuration.

 - The arch_get_random*_early() abstraction was only used by arm64 and
   wasn't entirely useful, so this has been replaced by code that works
   in all relevant contexts.

 - The kernel will use and manage random seeds in non-volatile EFI
   variables, refreshing a variable with a fresh seed when the RNG is
   initialized. The RNG GUID namespace is then hidden from efivarfs to
   prevent accidental leakage.

   These changes are split into random.c infrastructure code used in the
   EFI subsystem, in this pull request, and related support inside of
   EFISTUB, in Ard's EFI tree. These are co-dependent for full
   functionality, but the order of merging doesn't matter.

 - Part of the infrastructure added for the EFI support is also used for
   an improvement to the way vsprintf initializes its siphash key,
   replacing an sleep loop wart.

 - The hardware RNG framework now always calls its correct random.c
   input function, add_hwgenerator_randomness(), rather than sometimes
   going through helpers better suited for other cases.

 - The add_latent_entropy() function has long been called from the fork
   handler, but is a no-op when the latent entropy gcc plugin isn't
   used, which is fine for the purposes of latent entropy.

   But it was missing out on the cycle counter that was also being mixed
   in beside the latent entropy variable. So now, if the latent entropy
   gcc plugin isn't enabled, add_latent_entropy() will expand to a call
   to add_device_randomness(NULL, 0), which adds a cycle counter,
   without the absent latent entropy variable.

 - The RNG is now reseeded from a delayed worker, rather than on demand
   when used. Always running from a worker allows it to make use of the
   CPU RNG on platforms like S390x, whose instructions are too slow to
   do so from interrupts. It also has the effect of adding in new inputs
   more frequently with more regularity, amounting to a long term
   transcript of random values. Plus, it helps a bit with the upcoming
   vDSO implementation (which isn't yet ready for 6.2).

 - The jitter entropy algorithm now tries to execute on many different
   CPUs, round-robining, in hopes of hitting even more memory latencies
   and other unpredictable effects. It also will mix in a cycle counter
   when the entropy timer fires, in addition to being mixed in from the
   main loop, to account more explicitly for fluctuations in that timer
   firing. And the state it touches is now kept within the same cache
   line, so that it's assured that the different execution contexts will
   cause latencies.

* tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (23 commits)
  random: include <linux/once.h> in the right header
  random: align entropy_timer_state to cache line
  random: mix in cycle counter when jitter timer fires
  random: spread out jitter callback to different CPUs
  random: remove extraneous period and add a missing one in comments
  efi: random: refresh non-volatile random seed when RNG is initialized
  vsprintf: initialize siphash key using notifier
  random: add back async readiness notifier
  random: reseed in delayed work rather than on-demand
  random: always mix cycle counter in add_latent_entropy()
  hw_random: use add_hwgenerator_randomness() for early entropy
  random: modernize documentation comment on get_random_bytes()
  random: adjust comment to account for removed function
  random: remove early archrandom abstraction
  random: use random.trust_{bootloader,cpu} command line option only
  stackprotector: actually use get_random_canary()
  stackprotector: move get_random_canary() into stackprotector.h
  treewide: use get_random_u32_inclusive() when possible
  treewide: use get_random_u32_{above,below}() instead of manual loop
  treewide: use get_random_u32_below() instead of deprecated function
  ...
2022-12-12 16:22:22 -08:00
Jakub Kicinski 837e8ac871 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-08 18:19:59 -08:00
Xin Long 9464d0b68f netfilter: conntrack: fix using __this_cpu_add in preemptible
Currently in nf_conntrack_hash_check_insert(), when it fails in
nf_ct_ext_valid_pre/post(), NF_CT_STAT_INC() will be called in the
preemptible context, a call trace can be triggered:

   BUG: using __this_cpu_add() in preemptible [00000000] code: conntrack/1636
   caller is nf_conntrack_hash_check_insert+0x45/0x430 [nf_conntrack]
   Call Trace:
    <TASK>
    dump_stack_lvl+0x33/0x46
    check_preemption_disabled+0xc3/0xf0
    nf_conntrack_hash_check_insert+0x45/0x430 [nf_conntrack]
    ctnetlink_create_conntrack+0x3cd/0x4e0 [nf_conntrack_netlink]
    ctnetlink_new_conntrack+0x1c0/0x450 [nf_conntrack_netlink]
    nfnetlink_rcv_msg+0x277/0x2f0 [nfnetlink]
    netlink_rcv_skb+0x50/0x100
    nfnetlink_rcv+0x65/0x144 [nfnetlink]
    netlink_unicast+0x1ae/0x290
    netlink_sendmsg+0x257/0x4f0
    sock_sendmsg+0x5f/0x70

This patch is to fix it by changing to use NF_CT_STAT_INC_ATOMIC() for
nf_ct_ext_valid_pre/post() check in nf_conntrack_hash_check_insert(),
as well as nf_ct_ext_valid_post() in __nf_conntrack_confirm().

Note that nf_ct_ext_valid_pre() check in __nf_conntrack_confirm() is
safe to use NF_CT_STAT_INC(), as it's under local_bh_disable().

Fixes: c56716c69c ("netfilter: extensions: introduce extension genid count")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-30 13:08:49 +01:00
Jakub Kicinski f2bb566f5c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
tools/lib/bpf/ringbuf.c
  927cbb478a ("libbpf: Handle size overflow for ringbuf mmap")
  b486d19a0a ("libbpf: checkpatch: Fixed code alignments in ringbuf.c")
https://lore.kernel.org/all/20221121122707.44d1446a@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-29 13:04:52 -08:00
Daniel Xu 52d1aa8b82 netfilter: conntrack: Fix data-races around ct mark
nf_conn:mark can be read from and written to in parallel. Use
READ_ONCE()/WRITE_ONCE() for reads and writes to prevent unwanted
compiler optimizations.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-18 15:21:00 +01:00
Jason A. Donenfeld 8032bf1233 treewide: use get_random_u32_below() instead of deprecated function
This is a simple mechanical transformation done by:

@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
  (E)

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-11-18 02:15:15 +01:00
Florian Westphal d2c806abcf netfilter: conntrack: use siphash_4u64
This function is used for every packet, siphash_4u64 is noticeably faster
than using local buffer + siphash:

Before:
  1.23%  kpktgend_0       [kernel.vmlinux]     [k] __siphash_unaligned
  0.14%  kpktgend_0       [nf_conntrack]       [k] hash_conntrack_raw
After:
  0.79%  kpktgend_0       [kernel.vmlinux]     [k] siphash_4u64
  0.15%  kpktgend_0       [nf_conntrack]       [k] hash_conntrack_raw

In the pktgen test this gives about ~2.4% performance improvement.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-15 10:53:19 +01:00
Jakub Kicinski a08d97a193 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2022-10-03

We've added 143 non-merge commits during the last 27 day(s) which contain
a total of 151 files changed, 8321 insertions(+), 1402 deletions(-).

The main changes are:

1) Add kfuncs for PKCS#7 signature verification from BPF programs, from Roberto Sassu.

2) Add support for struct-based arguments for trampoline based BPF programs,
   from Yonghong Song.

3) Fix entry IP for kprobe-multi and trampoline probes under IBT enabled, from Jiri Olsa.

4) Batch of improvements to veristat selftest tool in particular to add CSV output,
   a comparison mode for CSV outputs and filtering, from Andrii Nakryiko.

5) Add preparatory changes needed for the BPF core for upcoming BPF HID support,
   from Benjamin Tissoires.

6) Support for direct writes to nf_conn's mark field from tc and XDP BPF program
   types, from Daniel Xu.

7) Initial batch of documentation improvements for BPF insn set spec, from Dave Thaler.

8) Add a new BPF_MAP_TYPE_USER_RINGBUF map which provides single-user-space-producer /
   single-kernel-consumer semantics for BPF ring buffer, from David Vernet.

9) Follow-up fixes to BPF allocator under RT to always use raw spinlock for the BPF
   hashtab's bucket lock, from Hou Tao.

10) Allow creating an iterator that loops through only the resources of one
    task/thread instead of all, from Kui-Feng Lee.

11) Add support for kptrs in the per-CPU arraymap, from Kumar Kartikeya Dwivedi.

12) Add a new kfunc helper for nf to set src/dst NAT IP/port in a newly allocated CT
    entry which is not yet inserted, from Lorenzo Bianconi.

13) Remove invalid recursion check for struct_ops for TCP congestion control BPF
    programs, from Martin KaFai Lau.

14) Fix W^X issue with BPF trampoline and BPF dispatcher, from Song Liu.

15) Fix percpu_counter leakage in BPF hashtab allocation error path, from Tetsuo Handa.

16) Various cleanups in BPF selftests to use preferred ASSERT_* macros, from Wang Yufen.

17) Add invocation for cgroup/connect{4,6} BPF programs for ICMP pings, from YiFei Zhu.

18) Lift blinding decision under bpf_jit_harden = 1 to bpf_capable(), from Yauheni Kaliuta.

19) Various libbpf fixes and cleanups including a libbpf NULL pointer deref, from Xin Liu.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (143 commits)
  net: netfilter: move bpf_ct_set_nat_info kfunc in nf_nat_bpf.c
  Documentation: bpf: Add implementation notes documentations to table of contents
  bpf, docs: Delete misformatted table.
  selftests/xsk: Fix double free
  bpftool: Fix error message of strerror
  libbpf: Fix overrun in netlink attribute iteration
  selftests/bpf: Fix spelling mistake "unpriviledged" -> "unprivileged"
  samples/bpf: Fix typo in xdp_router_ipv4 sample
  bpftool: Remove unused struct event_ring_info
  bpftool: Remove unused struct btf_attach_point
  bpf, docs: Add TOC and fix formatting.
  bpf, docs: Add Clang note about BPF_ALU
  bpf, docs: Move Clang notes to a separate file
  bpf, docs: Linux byteswap note
  bpf, docs: Move legacy packet instructions to a separate file
  selftests/bpf: Check -EBUSY for the recurred bpf_setsockopt(TCP_CONGESTION)
  bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
  bpf: Refactor bpf_setsockopt(TCP_CONGESTION) handling into another function
  bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt()
  bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline
  ...
====================

Link: https://lore.kernel.org/r/20221003194915.11847-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-03 13:02:49 -07:00
Antoine Tenart 2aa1927570 netfilter: conntrack: revisit the gc initial rescheduling bias
The previous commit changed the way the rescheduling delay is computed
which has a side effect: the bias is now represented as much as the
other entries in the rescheduling delay which makes the logic to kick in
only with very large sets, as the initial interval is very large
(INT_MAX).

Revisit the GC initial bias to allow more frequent GC for smaller sets
while still avoiding wakeups when a machine is mostly idle. We're moving
from a large initial value to pretending we have 100 entries expiring at
the upper bound. This way only a few entries having a small timeout
won't impact much the rescheduling delay and non-idle machines will have
enough entries to lower the delay when needed. This also improves
readability as the initial bias is now linked to what is computed
instead of being an arbitrary large value.

Fixes: 2cfadb761d ("netfilter: conntrack: revisit gc autotuning")
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2022-09-21 10:44:56 +02:00
Antoine Tenart 95eabdd207 netfilter: conntrack: fix the gc rescheduling delay
Commit 2cfadb761d ("netfilter: conntrack: revisit gc autotuning")
changed the eviction rescheduling to the use average expiry of scanned
entries (within 1-60s) by doing:

  for (...) {
      expires = clamp(nf_ct_expires(tmp), ...);
      next_run += expires;
      next_run /= 2;
  }

The issue is the above will make the average ('next_run' here) more
dependent on the last expiration values than the firsts (for sets > 2).
Depending on the expiration values used to compute the average, the
result can be quite different than what's expected. To fix this we can
do the following:

  for (...) {
      expires = clamp(nf_ct_expires(tmp), ...);
      next_run += (expires - next_run) / ++count;
  }

Fixes: 2cfadb761d ("netfilter: conntrack: revisit gc autotuning")
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2022-09-21 10:44:56 +02:00
Daniel Xu 864b656f82 bpf: Add support for writing to nf_conn:mark
Support direct writes to nf_conn:mark from TC and XDP prog types. This
is useful when applications want to store per-connection metadata. This
is also particularly useful for applications that run both bpf and
iptables/nftables because the latter can trivially access this metadata.

One example use case would be if a bpf prog is responsible for advanced
packet classification and iptables/nftables is later used for routing
due to pre-existing/legacy code.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/ebca06dea366e3e7e861c12f375a548cc4c61108.1662568410.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-10 17:27:32 -07:00
Paolo Abeni 9f8f1933dc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ethernet/freescale/fec.h
  7d650df99d ("net: fec: add pm_qos support on imx6q platform")
  40c79ce13b ("net: fec: add stop mode support for imx8 platform")

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-09-08 18:38:30 +02:00
Pablo Neira Ayuso b118509076 netfilter: remove nf_conntrack_helper sysctl and modparam toggles
__nf_ct_try_assign_helper() remains in place but it now requires a
template to configure the helper.

A toggle to disable automatic helper assignment was added by:

  a900689264 ("netfilter: nf_ct_helper: allow to disable automatic helper assignment")

in 2012 to address the issues described in "Secure use of iptables and
connection tracking helpers". Automatic conntrack helper assignment was
disabled by:

  3bb398d925 ("netfilter: nf_ct_helper: disable automatic helper assignment")

back in 2016.

This patch removes the sysctl and modparam toggles, users now have to
rely on explicit conntrack helper configuration via ruleset.

Update tools/testing/selftests/netfilter/nft_conntrack_helper.sh to
check that auto-assignment does not happen anymore.

Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-08-31 12:12:32 +02:00
Kumar Kartikeya Dwivedi 6e116280b4 net: netfilter: Remove ifdefs for code shared by BPF and ctnetlink
The current ifdefry for code shared by the BPF and ctnetlink side looks
ugly. As per Pablo's request, simplify this by unconditionally compiling
in the code. This can be revisited when the shared code between the two
grows further.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220725085130.11553-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-09 09:41:54 -07:00
Jakub Kicinski b3fce974d4 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
bpf-next 2022-07-22

We've added 73 non-merge commits during the last 12 day(s) which contain
a total of 88 files changed, 3458 insertions(+), 860 deletions(-).

The main changes are:

1) Implement BPF trampoline for arm64 JIT, from Xu Kuohai.

2) Add ksyscall/kretsyscall section support to libbpf to simplify tracing kernel
   syscalls through kprobe mechanism, from Andrii Nakryiko.

3) Allow for livepatch (KLP) and BPF trampolines to attach to the same kernel
   function, from Song Liu & Jiri Olsa.

4) Add new kfunc infrastructure for netfilter's CT e.g. to insert and change
   entries, from Kumar Kartikeya Dwivedi & Lorenzo Bianconi.

5) Add a ksym BPF iterator to allow for more flexible and efficient interactions
   with kernel symbols, from Alan Maguire.

6) Bug fixes in libbpf e.g. for uprobe binary path resolution, from Dan Carpenter.

7) Fix BPF subprog function names in stack traces, from Alexei Starovoitov.

8) libbpf support for writing custom perf event readers, from Jon Doron.

9) Switch to use SPDX tag for BPF helper man page, from Alejandro Colomar.

10) Fix xsk send-only sockets when in busy poll mode, from Maciej Fijalkowski.

11) Reparent BPF maps and their charging on memcg offlining, from Roman Gushchin.

12) Multiple follow-up fixes around BPF lsm cgroup infra, from Stanislav Fomichev.

13) Use bootstrap version of bpftool where possible to speed up builds, from Pu Lehui.

14) Cleanup BPF verifier's check_func_arg() handling, from Joanne Koong.

15) Make non-prealloced BPF map allocations low priority to play better with
    memcg limits, from Yafang Shao.

16) Fix BPF test runner to reject zero-length data for skbs, from Zhengchao Shao.

17) Various smaller cleanups and improvements all over the place.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (73 commits)
  bpf: Simplify bpf_prog_pack_[size|mask]
  bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)
  bpf, x64: Allow to use caller address from stack
  ftrace: Allow IPMODIFY and DIRECT ops on the same function
  ftrace: Add modify_ftrace_direct_multi_nolock
  bpf/selftests: Fix couldn't retrieve pinned program in xdp veth test
  bpf: Fix build error in case of !CONFIG_DEBUG_INFO_BTF
  selftests/bpf: Fix test_verifier failed test in unprivileged mode
  selftests/bpf: Add negative tests for new nf_conntrack kfuncs
  selftests/bpf: Add tests for new nf_conntrack kfuncs
  selftests/bpf: Add verifier tests for trusted kfunc args
  net: netfilter: Add kfuncs to set and change CT status
  net: netfilter: Add kfuncs to set and change CT timeout
  net: netfilter: Add kfuncs to allocate and insert CT
  net: netfilter: Deduplicate code in bpf_{xdp,skb}_ct_lookup
  bpf: Add documentation for kfuncs
  bpf: Add support for forcing kfunc args to be trusted
  bpf: Switch to new kfunc flags infrastructure
  tools/resolve_btfids: Add support for 8-byte BTF sets
  bpf: Introduce 8-byte BTF set
  ...
====================

Link: https://lore.kernel.org/r/20220722221218.29943-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-22 16:55:44 -07:00
Lorenzo Bianconi ef69aa3a98 net: netfilter: Add kfuncs to set and change CT status
Introduce bpf_ct_set_status and bpf_ct_change_status kfunc helpers in
order to set nf_conn field of allocated entry or update nf_conn status
field of existing inserted entry. Use nf_ct_change_status_common to
share the permitted status field changes between netlink and BPF side
by refactoring ctnetlink_change_status.

It is required to introduce two kfuncs taking nf_conn___init and nf_conn
instead of sharing one because KF_TRUSTED_ARGS flag causes strict type
checking. This would disallow passing nf_conn___init to kfunc taking
nf_conn, and vice versa. We cannot remove the KF_TRUSTED_ARGS flag as we
only want to accept refcounted pointers and not e.g. ct->master.

Hence, bpf_ct_set_* kfuncs are meant to be used on allocated CT, and
bpf_ct_change_* kfuncs are meant to be used on inserted or looked up
CT entry.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220721134245.2450-10-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-07-21 21:03:16 -07:00
Kumar Kartikeya Dwivedi 0b38923644 net: netfilter: Add kfuncs to set and change CT timeout
Introduce bpf_ct_set_timeout and bpf_ct_change_timeout kfunc helpers in
order to change nf_conn timeout. This is same as ctnetlink_change_timeout,
hence code is shared between both by extracting it out to
__nf_ct_change_timeout. It is also updated to return an error when it
sees IPS_FIXED_TIMEOUT_BIT bit in ct->status, as that check was missing.

It is required to introduce two kfuncs taking nf_conn___init and nf_conn
instead of sharing one because KF_TRUSTED_ARGS flag causes strict type
checking. This would disallow passing nf_conn___init to kfunc taking
nf_conn, and vice versa. We cannot remove the KF_TRUSTED_ARGS flag as we
only want to accept refcounted pointers and not e.g. ct->master.

Apart from this, bpf_ct_set_timeout is only called for newly allocated
CT so it doesn't need to inspect the status field just yet. Sharing the
helpers even if it was possible would make timeout setting helper
sensitive to order of setting status and timeout after allocation.

Hence, bpf_ct_set_* kfuncs are meant to be used on allocated CT, and
bpf_ct_change_* kfuncs are meant to be used on inserted or looked up
CT entry.

Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220721134245.2450-9-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-07-21 21:03:16 -07:00
Jakub Kicinski 602ae008ab Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for net-next:

1) Simplify nf_ct_get_tuple(), from Jackie Liu.

2) Add format to request_module() call, from Bill Wendling.

3) Add /proc/net/stats/nf_flowtable to monitor in-flight pending
   hardware offload objects to be processed, from Vlad Buslov.

4) Missing rcu annotation and accessors in the netfilter tree,
   from Florian Westphal.

5) Merge h323 conntrack helper nat hooks into single object,
   also from Florian.

6) A batch of update to fix sparse warnings treewide,
   from Florian Westphal.

7) Move nft_cmp_fast_mask() where it used, from Florian.

8) Missing const in nf_nat_initialized(), from James Yonan.

9) Use bitmap API for Maglev IPVS scheduler, from Christophe Jaillet.

10) Use refcount_inc instead of _inc_not_zero in flowtable,
    from Florian Westphal.

11) Remove pr_debug in xt_TPROXY, from Nathan Cancellor.

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: xt_TPROXY: remove pr_debug invocations
  netfilter: flowtable: prefer refcount_inc
  netfilter: ipvs: Use the bitmap API to allocate bitmaps
  netfilter: nf_nat: in nf_nat_initialized(), use const struct nf_conn *
  netfilter: nf_tables: move nft_cmp_fast_mask to where its used
  netfilter: nf_tables: use correct integer types
  netfilter: nf_tables: add and use BE register load-store helpers
  netfilter: nf_tables: use the correct get/put helpers
  netfilter: x_tables: use correct integer types
  netfilter: nfnetlink: add missing __be16 cast
  netfilter: nft_set_bitmap: Fix spelling mistake
  netfilter: h323: merge nat hook pointers into one
  netfilter: nf_conntrack: use rcu accessors where needed
  netfilter: nf_conntrack: add missing __rcu annotations
  netfilter: nf_flow_table: count pending offload workqueue tasks
  net/sched: act_ct: set 'net' pointer when creating new nf_flow_table
  netfilter: conntrack: use correct format characters
  netfilter: conntrack: use fallthrough to cleanup
====================

Link: https://lore.kernel.org/r/20220720230754.209053-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-20 18:05:51 -07:00
Jackie Liu 6be7915612 netfilter: conntrack: use fallthrough to cleanup
These cases all use the same function. we can simplify the code through
fallthrough.

$ size net/netfilter/nf_conntrack_core.o

        text	   data	    bss	    dec	    hex	filename
before  81601	  81430	    768	 163799	  27fd7	net/netfilter/nf_conntrack_core.o
after   80361	  81430	    768	 162559	  27aff	net/netfilter/nf_conntrack_core.o

Arch: aarch64
Gcc : gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)

Reported-by: k2ci <kernel-bot@kylinos.cn>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-07-11 16:25:13 +02:00
Florian Westphal 0ed8f619b4 netfilter: conntrack: fix crash due to confirmed bit load reordering
Kajetan Puchalski reports crash on ARM, with backtrace of:

__nf_ct_delete_from_lists
nf_ct_delete
early_drop
__nf_conntrack_alloc

Unlike atomic_inc_not_zero, refcount_inc_not_zero is not a full barrier.
conntrack uses SLAB_TYPESAFE_BY_RCU, i.e. it is possible that a 'newly'
allocated object is still in use on another CPU:

CPU1						CPU2
						encounter 'ct' during hlist walk
 delete_from_lists
 refcount drops to 0
 kmem_cache_free(ct);
 __nf_conntrack_alloc() // returns same object
						refcount_inc_not_zero(ct); /* might fail */

						/* If set, ct is public/in the hash table */
						test_bit(IPS_CONFIRMED_BIT, &ct->status);

In case CPU1 already set refcount back to 1, refcount_inc_not_zero()
will succeed.

The expected possibilities for a CPU that obtained the object 'ct'
(but no reference so far) are:

1. refcount_inc_not_zero() fails.  CPU2 ignores the object and moves to
   the next entry in the list.  This happens for objects that are about
   to be free'd, that have been free'd, or that have been reallocated
   by __nf_conntrack_alloc(), but where the refcount has not been
   increased back to 1 yet.

2. refcount_inc_not_zero() succeeds. CPU2 checks the CONFIRMED bit
   in ct->status.  If set, the object is public/in the table.

   If not, the object must be skipped; CPU2 calls nf_ct_put() to
   un-do the refcount increment and moves to the next object.

Parallel deletion from the hlists is prevented by a
'test_and_set_bit(IPS_DYING_BIT, &ct->status);' check, i.e. only one
cpu will do the unlink, the other one will only drop its reference count.

Because refcount_inc_not_zero is not a full barrier, CPU2 may try to
delete an object that is not on any list:

1. refcount_inc_not_zero() successful (refcount inited to 1 on other CPU)
2. CONFIRMED test also successful (load was reordered or zeroing
   of ct->status not yet visible)
3. delete_from_lists unlinks entry not on the hlist, because
   IPS_DYING_BIT is 0 (already cleared).

2) is already wrong: CPU2 will handle a partially initited object
that is supposed to be private to CPU1.

Add needed barriers when refcount_inc_not_zero() is successful.

It also inserts a smp_wmb() before the refcount is set to 1 during
allocation.

Because other CPU might still see the object, refcount_set(1)
"resurrects" it, so we need to make sure that other CPUs will also observe
the right content.  In particular, the CONFIRMED bit test must only pass
once the object is fully initialised and either in the hash or about to be
inserted (with locks held to delay possible unlink from early_drop or
gc worker).

I did not change flow_offload_alloc(), as far as I can see it should call
refcount_inc(), not refcount_inc_not_zero(): the ct object is attached to
the skb so its refcount should be >= 1 in all cases.

v2: prefer smp_acquire__after_ctrl_dep to smp_rmb (Will Deacon).
v3: keep smp_acquire__after_ctrl_dep close to refcount_inc_not_zero call
    add comment in nf_conntrack_netlink, no control dependency there
    due to locks.

Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/all/Yr7WTfd6AVTQkLjI@e126311.manchester.arm.com/
Reported-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
Diagnosed-by: Will Deacon <will@kernel.org>
Fixes: 7197743776 ("netfilter: conntrack: convert to refcount_t api")
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Will Deacon <will@kernel.org>
2022-07-07 20:55:18 +02:00
Florian Westphal 90d1daa458 netfilter: conntrack: add nf_conntrack_events autodetect mode
This adds the new nf_conntrack_events=2 mode and makes it the
default.

This leverages the earlier flag in struct net to allow to avoid
the event extension as long as no event listener is active in
the namespace.

This avoids, for most cases, allocation of ct->ext area.
A followup patch will take further advantage of this by avoiding
calls down into the event framework if the extension isn't present.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:56:28 +02:00
Florian Westphal b0a7ab4a77 netfilter: conntrack: un-inline nf_ct_ecache_ext_add
Only called when new ct is allocated or the extension isn't present.
This function will be extended, place this in the conntrack module
instead of inlining.

The callers already depend on nf_conntrack module.
Return value is changed to bool, noone used the returned pointer.

Make sure that the core drops the newly allocated conntrack
if the extension is requested but can't be added.
This makes it necessary to ifdef the section, as the stub
always returns false we'd drop every new conntrack if the
the ecache extension is disabled in kconfig.

Add from data path (xt_CT, nft_ct) is unchanged.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:56:28 +02:00
Pablo Neira Ayuso 8169ff5840 netfilter: conntrack: add nf_ct_iter_data object for nf_ct_iterate_cleanup*()
This patch adds a structure to collect all the context data that is
passed to the cleanup iterator.

 struct nf_ct_iter_data {
       struct net *net;
       void *data;
       u32 portid;
       int report;
 };

There is a netns field that allows to clean up conntrack entries
specifically owned by the specified netns.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:56:27 +02:00
Florian Westphal 0bcfbafbcd netfilter: conntrack: avoid unconditional local_bh_disable
Now that the conntrack entry isn't placed on the pcpu list anymore the
bh only needs to be disabled in the 'expectation present' case.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:56:27 +02:00
Florian Westphal 8a75a2c174 netfilter: conntrack: remove unconfirmed list
It has no function anymore and can be removed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:53:27 +02:00
Florian Westphal ace53fdc26 netfilter: conntrack: remove __nf_ct_unconfirmed_destroy
Its not needed anymore:

A. If entry is totally new, then the rcu-protected resource
must already have been removed from global visibility before call
to nf_ct_iterate_destroy.

B. If entry was allocated before, but is not yet in the hash table
   (uncofirmed case), genid gets incremented and synchronize_rcu() call
   makes sure access has completed.

C. Next attempt to peek at extension area will fail for unconfirmed
  conntracks, because ext->genid != genid.

D. Conntracks in the hash are iterated as before.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:52:17 +02:00
Florian Westphal c56716c69c netfilter: extensions: introduce extension genid count
Multiple netfilter extensions store pointers to external data
in their extension area struct.

Examples:
1. Timeout policies
2. Connection tracking helpers.

No references are taken for these.

When a helper or timeout policy is removed, the conntrack table gets
traversed and affected extensions are cleared.

Conntrack entries not yet in the hashtable are referenced via a special
list, the unconfirmed list.

On removal of a policy or connection tracking helper, the unconfirmed
list gets traversed an all entries are marked as dying, this prevents
them from getting committed to the table at insertion time: core checks
for dying bit, if set, the conntrack entry gets destroyed at confirm
time.

The disadvantage is that each new conntrack has to be added to the percpu
unconfirmed list, and each insertion needs to remove it from this list.
The list is only ever needed when a policy or helper is removed -- a rare
occurrence.

Add a generation ID count: Instead of adding to the list and then
traversing that list on policy/helper removal, increment a counter
that is stored in the extension area.

For unconfirmed conntracks, the extension has the genid valid at ct
allocation time.

Removal of a helper/policy etc. increments the counter.
At confirmation time, validate that ext->genid == global_id.

If the stored number is not the same, do not allow the conntrack
insertion, just like as if a confirmed-list traversal would have flagged
the entry as dying.

After insertion, the genid is no longer relevant (conntrack entries
are now reachable via the conntrack table iterators and is set to 0.

This allows removal of the percpu unconfirmed list.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:52:16 +02:00
Florian Westphal 17438b42ce netfilter: remove nf_ct_unconfirmed_destroy helper
This helper tags connections not yet in the conntrack table as
dying.  These nf_conn entries will be dropped instead when the
core attempts to insert them from the input or postrouting
'confirm' hook.

After the previous change, the entries get unlinked from the
list earlier, so that by the time the actual exit hook runs,
new connections no longer have a timeout policy assigned.

Its enough to walk the hashtable instead.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:52:16 +02:00
Florian Westphal 1397af5bfd netfilter: conntrack: remove the percpu dying list
Its no longer needed. Entries that need event redelivery are placed
on the new pernet dying list.

The advantage is that there is no need to take additional spinlock on
conntrack removal unless event redelivery failed or the conntrack entry
was never added to the table in the first place (confirmed bit not set).

The IPS_CONFIRMED bit now needs to be set as soon as the entry has been
unlinked from the unconfirmed list, else the destroy function may
attempt to unlink it a second time.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:52:16 +02:00
Florian Westphal 2ed3bf188b netfilter: ecache: use dedicated list for event redelivery
This disentangles event redelivery and the percpu dying list.

Because entries are now stored on a dedicated list, all
entries are in NFCT_ECACHE_DESTROY_FAIL state and all entries
still have confirmed bit set -- the reference count is at least 1.

The 'struct net' back-pointer can be removed as well.

The pcpu dying list will be removed eventually, it has no functionality.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:51:28 +02:00
Florian Westphal 2cfadb761d netfilter: conntrack: revisit gc autotuning
as of commit 4608fdfc07
("netfilter: conntrack: collect all entries in one cycle")
conntrack gc was changed to run every 2 minutes.

On systems where conntrack hash table is set to large value, most evictions
happen from gc worker rather than the packet path due to hash table
distribution.

This causes netlink event overflows when events are collected.

This change collects average expiry of scanned entries and
reschedules to the average remaining value, within 1 to 60 second interval.

To avoid event overflows, reschedule after each bucket and add a
limit for both run time and number of evictions per run.

If more entries have to be evicted, reschedule and restart 1 jiffy
into the future.

Reported-by: Karel Rericha <karel@maxtel.cz>
Cc: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-03-19 23:11:34 +01:00
Jakub Kicinski e243f39685 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-17 13:56:58 -07:00
Florian Westphal ee0a4dc9f3 Revert "netfilter: conntrack: tag conntracks picked up in local out hook"
This was a prerequisite for the ill-fated
"netfilter: nat: force port remap to prevent shadowing well-known ports".

As this has been reverted, this change can be backed out too.

Signed-off-by: Florian Westphal <fw@strlen.de>
2022-03-08 17:28:38 +01:00
Florian Westphal 1015c3de23 netfilter: conntrack: remove extension register api
These no longer register/unregister a meaningful structure so remove it.

Cc: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-04 06:30:28 +01:00
Florian Westphal 1bc91a5ddf netfilter: conntrack: handle ->destroy hook via nat_ops instead
The nat module already exposes a few functions to the conntrack core.
Move the nat extension destroy hook to it.

After this, no conntrack extension needs a destroy hook.
'struct nf_ct_ext_type' and the register/unregister api can be removed
in a followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-04 06:30:28 +01:00
Florian Westphal 5f31edc067 netfilter: conntrack: move extension sizes into core
No need to specify this in the registration modules, we already
collect all sizes for build-time checks on the maximum combined size.

After this change, all extensions except nat have no meaningful content
in their nf_ct_ext_type struct definition.

Next patch handles nat, this will then allow to remove the dynamic
register api completely.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-04 06:30:28 +01:00