Commit graph

1248981 commits

Author SHA1 Message Date
Maciej Fijalkowski 0cbb08707c i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue
Now that i40e driver correctly sets up frag_size in xdp_rxq_info, let us
make it work for ZC multi-buffer as well. i40e_ring::rx_buf_len for ZC
is being set via xsk_pool_get_rx_frame_size() and this needs to be
propagated up to xdp_rxq_info.

Fixes: 1c9ba9c146 ("i40e: xsk: add RX multi-buffer support")
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-12-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24 16:24:07 -08:00
Maciej Fijalkowski a045d2f2d0 i40e: set xdp_rxq_info::frag_size
i40e support XDP multi-buffer so it is supposed to use
__xdp_rxq_info_reg() instead of xdp_rxq_info_reg() and set the
frag_size. It can not be simply converted at existing callsite because
rx_buf_len could be un-initialized, so let us register xdp_rxq_info
within i40e_configure_rx_ring(), which happen to be called with already
initialized rx_buf_len value.

Commit 5180ff1364 ("i40e: use int for i40e_status") converted 'err' to
int, so two variables to deal with return codes are not needed within
i40e_configure_rx_ring(). Remove 'ret' and use 'err' to handle status
from xdp_rxq_info registration.

Fixes: e213ced19b ("i40e: add support for XDP multi-buffer Rx")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-11-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24 16:24:07 -08:00
Maciej Fijalkowski fbadd83a61 xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL
XSK ZC Rx path calculates the size of data that will be posted to XSK Rx
queue via subtracting xdp_buff::data_end from xdp_buff::data.

In bpf_xdp_frags_increase_tail(), when underlying memory type of
xdp_rxq_info is MEM_TYPE_XSK_BUFF_POOL, add offset to data_end in tail
fragment, so that later on user space will be able to take into account
the amount of bytes added by XDP program.

Fixes: 24ea50127e ("xsk: support mbuf on ZC RX")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-10-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24 16:24:07 -08:00
Maciej Fijalkowski 3de38c8717 ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue
Now that ice driver correctly sets up frag_size in xdp_rxq_info, let us
make it work for ZC multi-buffer as well. ice_rx_ring::rx_buf_len for ZC
is being set via xsk_pool_get_rx_frame_size() and this needs to be
propagated up to xdp_rxq_info.

Use a bigger hammer and instead of unregistering only xdp_rxq_info's
memory model, unregister it altogether and register it again and have
xdp_rxq_info with correct frag_size value.

Fixes: 1bbc04de60 ("ice: xsk: add RX multi-buffer support")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-9-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24 16:24:07 -08:00
Maciej Fijalkowski 290779905d intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers
Ice and i40e ZC drivers currently set offset of a frag within
skb_shared_info to 0, which is incorrect. xdp_buffs that come from
xsk_buff_pool always have 256 bytes of a headroom, so they need to be
taken into account to retrieve xdp_buff::data via skb_frag_address().
Otherwise, bpf_xdp_frags_increase_tail() would be starting its job from
xdp_buff::data_hard_start which would result in overwriting existing
payload.

Fixes: 1c9ba9c146 ("i40e: xsk: add RX multi-buffer support")
Fixes: 1bbc04de60 ("ice: xsk: add RX multi-buffer support")
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-8-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24 16:24:07 -08:00
Maciej Fijalkowski 2ee788c064 ice: remove redundant xdp_rxq_info registration
xdp_rxq_info struct can be registered by drivers via two functions -
xdp_rxq_info_reg() and __xdp_rxq_info_reg(). The latter one allows
drivers that support XDP multi-buffer to set up xdp_rxq_info::frag_size
which in turn will make it possible to grow the packet via
bpf_xdp_adjust_tail() BPF helper.

Currently, ice registers xdp_rxq_info in two spots:
1) ice_setup_rx_ring() // via xdp_rxq_info_reg(), BUG
2) ice_vsi_cfg_rxq()   // via __xdp_rxq_info_reg(), OK

Cited commit under fixes tag took care of setting up frag_size and
updated registration scheme in 2) but it did not help as
1) is called before 2) and as shown above it uses old registration
function. This means that 2) sees that xdp_rxq_info is already
registered and never calls __xdp_rxq_info_reg() which leaves us with
xdp_rxq_info::frag_size being set to 0.

To fix this misbehavior, simply remove xdp_rxq_info_reg() call from
ice_setup_rx_ring().

Fixes: 2fba7dc515 ("ice: Add support for XDP multi-buffer on Rx side")
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-7-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24 16:24:06 -08:00
Tirthendu Sarkar 83014323c6 i40e: handle multi-buffer packets that are shrunk by xdp prog
XDP programs can shrink packets by calling the bpf_xdp_adjust_tail()
helper function. For multi-buffer packets this may lead to reduction of
frag count stored in skb_shared_info area of the xdp_buff struct. This
results in issues with the current handling of XDP_PASS and XDP_DROP
cases.

For XDP_PASS, currently skb is being built using frag count of
xdp_buffer before it was processed by XDP prog and thus will result in
an inconsistent skb when frag count gets reduced by XDP prog. To fix
this, get correct frag count while building the skb instead of using
pre-obtained frag count.

For XDP_DROP, current page recycling logic will not reuse the page but
instead will adjust the pagecnt_bias so that the page can be freed. This
again results in inconsistent behavior as the page refcnt has already
been changed by the helper while freeing the frag(s) as part of
shrinking the packet. To fix this, only adjust pagecnt_bias for buffers
that are stillpart of the packet post-xdp prog run.

Fixes: e213ced19b ("i40e: add support for XDP multi-buffer Rx")
Reported-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-6-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24 16:24:06 -08:00
Maciej Fijalkowski ad2047cf5d ice: work on pre-XDP prog frag count
Fix an OOM panic in XDP_DRV mode when a XDP program shrinks a
multi-buffer packet by 4k bytes and then redirects it to an AF_XDP
socket.

Since support for handling multi-buffer frames was added to XDP, usage
of bpf_xdp_adjust_tail() helper within XDP program can free the page
that given fragment occupies and in turn decrease the fragment count
within skb_shared_info that is embedded in xdp_buff struct. In current
ice driver codebase, it can become problematic when page recycling logic
decides not to reuse the page. In such case, __page_frag_cache_drain()
is used with ice_rx_buf::pagecnt_bias that was not adjusted after
refcount of page was changed by XDP prog which in turn does not drain
the refcount to 0 and page is never freed.

To address this, let us store the count of frags before the XDP program
was executed on Rx ring struct. This will be used to compare with
current frag count from skb_shared_info embedded in xdp_buff. A smaller
value in the latter indicates that XDP prog freed frag(s). Then, for
given delta decrement pagecnt_bias for XDP_DROP verdict.

While at it, let us also handle the EOP frag within
ice_set_rx_bufs_act() to make our life easier, so all of the adjustments
needed to be applied against freed frags are performed in the single
place.

Fixes: 2fba7dc515 ("ice: Add support for XDP multi-buffer on Rx side")
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-5-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24 16:24:06 -08:00
Maciej Fijalkowski c5114710c8 xsk: fix usage of multi-buffer BPF helpers for ZC XDP
Currently when packet is shrunk via bpf_xdp_adjust_tail() and memory
type is set to MEM_TYPE_XSK_BUFF_POOL, null ptr dereference happens:

[1136314.192256] BUG: kernel NULL pointer dereference, address:
0000000000000034
[1136314.203943] #PF: supervisor read access in kernel mode
[1136314.213768] #PF: error_code(0x0000) - not-present page
[1136314.223550] PGD 0 P4D 0
[1136314.230684] Oops: 0000 [#1] PREEMPT SMP NOPTI
[1136314.239621] CPU: 8 PID: 54203 Comm: xdpsock Not tainted 6.6.0+ #257
[1136314.250469] Hardware name: Intel Corporation S2600WFT/S2600WFT,
BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
[1136314.265615] RIP: 0010:__xdp_return+0x6c/0x210
[1136314.274653] Code: ad 00 48 8b 47 08 49 89 f8 a8 01 0f 85 9b 01 00 00 0f 1f 44 00 00 f0 41 ff 48 34 75 32 4c 89 c7 e9 79 cd 80 ff 83 fe 03 75 17 <f6> 41 34 01 0f 85 02 01 00 00 48 89 cf e9 22 cc 1e 00 e9 3d d2 86
[1136314.302907] RSP: 0018:ffffc900089f8db0 EFLAGS: 00010246
[1136314.312967] RAX: ffffc9003168aed0 RBX: ffff8881c3300000 RCX:
0000000000000000
[1136314.324953] RDX: 0000000000000000 RSI: 0000000000000003 RDI:
ffffc9003168c000
[1136314.336929] RBP: 0000000000000ae0 R08: 0000000000000002 R09:
0000000000010000
[1136314.348844] R10: ffffc9000e495000 R11: 0000000000000040 R12:
0000000000000001
[1136314.360706] R13: 0000000000000524 R14: ffffc9003168aec0 R15:
0000000000000001
[1136314.373298] FS:  00007f8df8bbcb80(0000) GS:ffff8897e0e00000(0000)
knlGS:0000000000000000
[1136314.386105] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1136314.396532] CR2: 0000000000000034 CR3: 00000001aa912002 CR4:
00000000007706f0
[1136314.408377] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[1136314.420173] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[1136314.431890] PKRU: 55555554
[1136314.439143] Call Trace:
[1136314.446058]  <IRQ>
[1136314.452465]  ? __die+0x20/0x70
[1136314.459881]  ? page_fault_oops+0x15b/0x440
[1136314.468305]  ? exc_page_fault+0x6a/0x150
[1136314.476491]  ? asm_exc_page_fault+0x22/0x30
[1136314.484927]  ? __xdp_return+0x6c/0x210
[1136314.492863]  bpf_xdp_adjust_tail+0x155/0x1d0
[1136314.501269]  bpf_prog_ccc47ae29d3b6570_xdp_sock_prog+0x15/0x60
[1136314.511263]  ice_clean_rx_irq_zc+0x206/0xc60 [ice]
[1136314.520222]  ? ice_xmit_zc+0x6e/0x150 [ice]
[1136314.528506]  ice_napi_poll+0x467/0x670 [ice]
[1136314.536858]  ? ttwu_do_activate.constprop.0+0x8f/0x1a0
[1136314.546010]  __napi_poll+0x29/0x1b0
[1136314.553462]  net_rx_action+0x133/0x270
[1136314.561619]  __do_softirq+0xbe/0x28e
[1136314.569303]  do_softirq+0x3f/0x60

This comes from __xdp_return() call with xdp_buff argument passed as
NULL which is supposed to be consumed by xsk_buff_free() call.

To address this properly, in ZC case, a node that represents the frag
being removed has to be pulled out of xskb_list. Introduce
appropriate xsk helpers to do such node operation and use them
accordingly within bpf_xdp_adjust_tail().

Fixes: 24ea50127e ("xsk: support mbuf on ZC RX")
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> # For the xsk header part
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-4-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24 16:24:06 -08:00
Maciej Fijalkowski f7f6aa8e24 xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags
XDP multi-buffer support introduced XDP_FLAGS_HAS_FRAGS flag that is
used by drivers to notify data path whether xdp_buff contains fragments
or not. Data path looks up mentioned flag on first buffer that occupies
the linear part of xdp_buff, so drivers only modify it there. This is
sufficient for SKB and XDP_DRV modes as usually xdp_buff is allocated on
stack or it resides within struct representing driver's queue and
fragments are carried via skb_frag_t structs. IOW, we are dealing with
only one xdp_buff.

ZC mode though relies on list of xdp_buff structs that is carried via
xsk_buff_pool::xskb_list, so ZC data path has to make sure that
fragments do *not* have XDP_FLAGS_HAS_FRAGS set. Otherwise,
xsk_buff_free() could misbehave if it would be executed against xdp_buff
that carries a frag with XDP_FLAGS_HAS_FRAGS flag set. Such scenario can
take place when within supplied XDP program bpf_xdp_adjust_tail() is
used with negative offset that would in turn release the tail fragment
from multi-buffer frame.

Calling xsk_buff_free() on tail fragment with XDP_FLAGS_HAS_FRAGS would
result in releasing all the nodes from xskb_list that were produced by
driver before XDP program execution, which is not what is intended -
only tail fragment should be deleted from xskb_list and then it should
be put onto xsk_buff_pool::free_list. Such multi-buffer frame will never
make it up to user space, so from AF_XDP application POV there would be
no traffic running, however due to free_list getting constantly new
nodes, driver will be able to feed HW Rx queue with recycled buffers.
Bottom line is that instead of traffic being redirected to user space,
it would be continuously dropped.

To fix this, let us clear the mentioned flag on xsk_buff_pool side
during xdp_buff initialization, which is what should have been done
right from the start of XSK multi-buffer support.

Fixes: 1bbc04de60 ("ice: xsk: add RX multi-buffer support")
Fixes: 1c9ba9c146 ("i40e: xsk: add RX multi-buffer support")
Fixes: 24ea50127e ("xsk: support mbuf on ZC RX")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-3-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24 16:24:06 -08:00
Maciej Fijalkowski 2690098931 xsk: recycle buffer in case Rx queue was full
Add missing xsk_buff_free() call when __xsk_rcv_zc() failed to produce
descriptor to XSK Rx queue.

Fixes: 24ea50127e ("xsk: support mbuf on ZC RX")
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-2-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24 16:24:06 -08:00
Jakub Kicinski 77be22473b Merge branch 'fix-module_description-for-net-p2'
Breno Leitao says:

====================
Fix MODULE_DESCRIPTION() for net (p2)

There are hundreds of network modules that misses MODULE_DESCRIPTION(),
causing a warnning when compiling with W=1. Example:

        WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/net/arcnet/com90io.o
        WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/net/arcnet/arc-rimi.o
        WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/net/arcnet/com20020.o

This part2 of the patchset focus on the drivers/net/ethernet drivers.
There are still some missing warnings in drivers/net/ethernet that will
be fixed in an upcoming patchset.

v1: https://lore.kernel.org/all/20240122184543.2501493-2-leitao@debian.org/
====================

Link: https://lore.kernel.org/r/20240123190332.677489-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-24 15:12:56 -08:00
Breno Leitao bdc6734115 net: fill in MODULE_DESCRIPTION()s for rvu_mbox
W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
Add descriptions to the Marvel RVU mbox driver.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240123190332.677489-11-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-24 15:12:21 -08:00
Breno Leitao 07d1e0ce87 net: fill in MODULE_DESCRIPTION()s for litex
W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
Add descriptions to the LiteX Liteeth Ethernet device.

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Gabriel Somlo <gsomlo@gmail.com>
Link: https://lore.kernel.org/r/20240123190332.677489-10-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-24 15:12:21 -08:00
Breno Leitao 8183c470c1 net: fill in MODULE_DESCRIPTION()s for fsl_pq_mdio
W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
Add descriptions to the Freescale PQ MDIO driver.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240123190332.677489-9-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-24 15:12:21 -08:00
Breno Leitao 2e87576488 net: fill in MODULE_DESCRIPTION()s for fec
W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
Add descriptions to the FEC (MPC8xx) Ethernet controller.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Wei Fang <wei.fang@nxp.com>
Link: https://lore.kernel.org/r/20240123190332.677489-8-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-24 15:12:20 -08:00
Breno Leitao 07c42d2375 net: fill in MODULE_DESCRIPTION()s for enetc
W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
Add descriptions to the NXP ENETC Ethernet driver.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240123190332.677489-7-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-24 15:12:20 -08:00
Breno Leitao 27881ca8c8 net: fill in MODULE_DESCRIPTION()s for nps_enet
W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
Add descriptions to the EZchip NPS ethernet driver.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240123190332.677489-6-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-24 15:12:20 -08:00
Breno Leitao 53c83e2d36 net: fill in MODULE_DESCRIPTION()s for ep93xxx_eth
W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
Add descriptions to the Cirrus EP93xx ethernet driver.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240123190332.677489-5-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-24 15:12:20 -08:00
Breno Leitao bb567fbbbb net: fill in MODULE_DESCRIPTION()s for liquidio
W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
Add descriptions to the Cavium Liquidio.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240123190332.677489-4-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-24 15:12:20 -08:00
Breno Leitao 39535d7ff6 net: fill in MODULE_DESCRIPTION()s for Broadcom bgmac
W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
Add descriptions to the Broadcom iProc GBit driver.

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/20240123190332.677489-3-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-24 15:12:20 -08:00
Breno Leitao f5e414167b net: fill in MODULE_DESCRIPTION()s for 8390
W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
Add descriptions to all the good old 8390 modules and drivers.

Signed-off-by: Breno Leitao <leitao@debian.org>
CC: geert@linux-m68k.org
Link: https://lore.kernel.org/r/20240123190332.677489-2-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-24 15:12:20 -08:00
Jakub Kicinski 0879020a78 selftests: netdevsim: fix the udp_tunnel_nic test
This test is missing a whole bunch of checks for interface
renaming and one ifup. Presumably it was only used on a system
with renaming disabled and NetworkManager running.

Fixes: 91f430b2c4 ("selftests: net: add a test for UDP tunnel info infra")
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240123060529.1033912-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-24 15:11:10 -08:00
James Clark ac668d529f perf test: Skip test_arm_callgraph_fp.sh if unwinding isn't built in
Even though this is a frame pointer unwind test, it's testing that a
frame pointer stack can be augmented correctly with a partial
Dwarf unwind. So add a feature check so that this test skips instead of
fails if Dwarf unwinding isn't present.

Signed-off-by: James Clark <james.clark@arm.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Spoorthy S <spoorts2@in.ibm.com>
Cc: Kajol Jain <kjain@linux.ibm.com>
Link: https://lore.kernel.org/r/20240123163903.350306-3-james.clark@arm.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-24 14:13:48 -08:00
James Clark b58ab8ac75 perf version: Display availability of HAVE_DWARF_UNWIND_SUPPORT
Even though unwinding depends on either HAVE_DWARF_SUPPORT or
HAVE_LIBUNWIND, scripts testing unwinding can't just look for the
existence of either of those flags. This is because
NO_LIBDW_DWARF_UNWIND=1 can disable unwinding with libdw, but libdw will
still be linked leaving HAVE_DWARF_SUPPORT turned on. Presumably because
it is used for things other than unwinding, so I don't think this needs
to be fixed.

HAVE_DWARF_UNWIND_SUPPORT already takes the combination of all those
things into account, and is used to gate the built in tests like "Test
dwarf unwind", so add it to the feature list output so that it can be
used by the script tests too.

Signed-off-by: James Clark <james.clark@arm.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Spoorthy S <spoorts2@in.ibm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Link: https://lore.kernel.org/r/20240123163903.350306-2-james.clark@arm.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-24 14:13:48 -08:00
James Clark 63f209b6fa perf evlist: Fix evlist__new_default() for > 1 core PMU
The 'Session topology' test currently fails with this message when
evlist__new_default() opens more than one event:

  32: Session topology                                                :
  --- start ---
  templ file: /tmp/perf-test-vv5YzZ
  Using CPUID 0x00000000410fd070
  Opening: unknown-hardware:HG
  ------------------------------------------------------------
  perf_event_attr:
    type                             0 (PERF_TYPE_HARDWARE)
    config                           0xb00000000
    disabled                         1
  ------------------------------------------------------------
  sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8 = 4
  Opening: unknown-hardware:HG
  ------------------------------------------------------------
  perf_event_attr:
    type                             0 (PERF_TYPE_HARDWARE)
    config                           0xa00000000
    disabled                         1
  ------------------------------------------------------------
  sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8 = 5
  non matching sample_type
  FAILED tests/topology.c:73 can't get session
  ---- end ----
  Session topology: FAILED!

This is because when re-opening the file and parsing the header, Perf
expects that any file that has more than one event has the sample ID
flag set. Perf record already sets the flag in a similar way when there
is more than one event, so add the same logic to evlist__new_default().

evlist__new_default() is only currently used in tests, so I don't
expect this change to have any other side effects. The other tests that
use it don't save and re-open the file so don't hit this issue.

The session topology test has been failing on Arm big.LITTLE platforms
since commit 251aa04024 ("perf parse-events: Wildcard most
"numeric" events") when evlist__new_default() started opening multiple
events for 'cycles'.

Fixes: 251aa04024 ("perf parse-events: Wildcard most "numeric" events")
Closes: https://lore.kernel.org/lkml/CAP-5=fWVQ-7ijjK3-w1q+k2WYVNHbAcejb-xY0ptbjRw476VKA@mail.gmail.com/
Tested-by: Ian Rogers <irogers@google.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Tested-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: James Clark <james.clark@arm.com>
Cc: Changbin Du <changbin.du@huawei.com>
Cc: Yang Jihong <yangjihong1@huawei.com>
Link: https://lore.kernel.org/r/20240124094358.489372-1-james.clark@arm.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-24 14:13:48 -08:00
Kan Liang 821aca20be perf mem: Clean up perf_pmus__num_mem_pmus()
The number of mem PMUs can be calculated by searching the
perf_pmus__scan_mem().

Remove the ARCH specific perf_pmus__num_mem_pmus()

Tested-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Cc: ravi.bangoria@amd.com
Cc: james.clark@arm.com
Cc: will@kernel.org
Cc: mike.leach@linaro.org
Cc: renyu.zj@linux.alibaba.com
Cc: yuhaixin.yhx@linux.alibaba.com
Cc: tmricht@linux.ibm.com
Cc: atrajeev@linux.vnet.ibm.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: john.g.garry@oracle.com
Link: https://lore.kernel.org/r/20240123185036.3461837-8-kan.liang@linux.intel.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-24 14:05:22 -08:00
Kan Liang 70f4b20d07 perf mem: Clean up perf_mem_events__record_args()
The current code iterates all memory PMUs. It doesn't matter if the
system has only one memory PMU or multiple PMUs. The check of
perf_pmus__num_mem_pmus() is not required anymore.

The rec_tmp is not used in c2c and mem. Removing them as well.

Suggested-by: Leo Yan <leo.yan@linaro.org>
Tested-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Cc: ravi.bangoria@amd.com
Cc: james.clark@arm.com
Cc: will@kernel.org
Cc: mike.leach@linaro.org
Cc: renyu.zj@linux.alibaba.com
Cc: yuhaixin.yhx@linux.alibaba.com
Cc: tmricht@linux.ibm.com
Cc: atrajeev@linux.vnet.ibm.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: john.g.garry@oracle.com
Link: https://lore.kernel.org/r/20240123185036.3461837-7-kan.liang@linux.intel.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-24 14:05:11 -08:00
Kan Liang 8ea9dfb916 perf mem: Clean up is_mem_loads_aux_event()
The aux_event can be retrieved from the perf_pmu now. Implement a
generic support.

Reviewed-by: Ian Rogers <irogers@google.com>
Tested-by: Ravi Bangoria <ravi.bangoria@amd.com>
Tested-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Cc: james.clark@arm.com
Cc: will@kernel.org
Cc: mike.leach@linaro.org
Cc: renyu.zj@linux.alibaba.com
Cc: yuhaixin.yhx@linux.alibaba.com
Cc: tmricht@linux.ibm.com
Cc: atrajeev@linux.vnet.ibm.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: john.g.garry@oracle.com
Link: https://lore.kernel.org/r/20240123185036.3461837-6-kan.liang@linux.intel.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-24 14:05:00 -08:00
Kan Liang db95c2ce9b perf mem: Clean up perf_mem_event__supported()
For some ARCHs, e.g., ARM and AMD, to get the availability of the
mem-events, perf checks the existence of a specific PMU. For the other
ARCHs, e.g., Intel and Power, perf has to check the existence of some
specific events.

The current perf only iterates the mem-events-supported PMUs. It's not
required to check the existence of a specific PMU anymore.

Rename sysfs_name to event_name, which stores the specific mem-events.
Perf only needs to check those events for the availability of the
mem-events.

Rename perf_mem_event__supported to perf_pmu__mem_events_supported.

Reviewed-by: Ian Rogers <irogers@google.com>
Tested-by: Ravi Bangoria <ravi.bangoria@amd.com>
Tested-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Cc: james.clark@arm.com
Cc: will@kernel.org
Cc: mike.leach@linaro.org
Cc: renyu.zj@linux.alibaba.com
Cc: yuhaixin.yhx@linux.alibaba.com
Cc: tmricht@linux.ibm.com
Cc: atrajeev@linux.vnet.ibm.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: john.g.garry@oracle.com
Link: https://lore.kernel.org/r/20240123185036.3461837-5-kan.liang@linux.intel.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-24 14:04:46 -08:00
Kan Liang abbdd79b78 perf mem: Clean up perf_mem_events__name()
Introduce a generic perf_mem_events__name(). Remove the ARCH-specific
one.

The mem_load events may have a different format. Add ldlat and aux_event
in the struct perf_mem_event to indicate the format and the extra aux
event.

Add perf_mem_events_intel_aux[] to support the extra mem_load_aux event.

Rename perf_mem_events__name to perf_pmu__mem_events_name.

Reviewed-by: Ian Rogers <irogers@google.com>
Tested-by: Ravi Bangoria <ravi.bangoria@amd.com>
Tested-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Cc: james.clark@arm.com
Cc: will@kernel.org
Cc: mike.leach@linaro.org
Cc: renyu.zj@linux.alibaba.com
Cc: yuhaixin.yhx@linux.alibaba.com
Cc: tmricht@linux.ibm.com
Cc: atrajeev@linux.vnet.ibm.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: john.g.garry@oracle.com
Link: https://lore.kernel.org/r/20240123185036.3461837-4-kan.liang@linux.intel.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-24 14:04:28 -08:00
Kan Liang a30450e6a4 perf mem: Clean up perf_mem_events__ptr()
The mem_events can be retrieved from the struct perf_pmu now. An ARCH
specific perf_mem_events__ptr() is not required anymore. Remove all of
them.

The Intel hybrid has multiple mem-events-supported PMUs. But they share
the same mem_events. Other ARCHs only support one mem-events-supported
PMU. In the configuration, it's good enough to only configure the
mem_events for one PMU. Add perf_mem_events_find_pmu() which returns the
first mem-events-supported PMU.

In the perf_mem_events__init(), the perf_pmus__scan() is not required
anymore. It avoids checking the sysfs for every PMU on the system.

Make the perf_mem_events__record_args() more generic. Remove the
perf_mem_events__print_unsupport_hybrid().

Since pmu is added as a new parameter, rename perf_mem_events__ptr() to
perf_pmu__mem_events_ptr(). Several other functions also do a similar
rename.

Reviewed-by: Ian Rogers <irogers@google.com>
Reviewed-by: Kajol Jain <kjain@linux.ibm.com>
Tested-by: Ravi Bangoria <ravi.bangoria@amd.com>
Tested-by: Kajol jain <kjain@linux.ibm.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Cc: james.clark@arm.com
Cc: will@kernel.org
Cc: leo.yan@linaro.org
Cc: mike.leach@linaro.org
Cc: renyu.zj@linux.alibaba.com
Cc: yuhaixin.yhx@linux.alibaba.com
Cc: tmricht@linux.ibm.com
Cc: atrajeev@linux.vnet.ibm.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: john.g.garry@oracle.com
Link: https://lore.kernel.org/r/20240123185036.3461837-3-kan.liang@linux.intel.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-24 14:04:02 -08:00
Kan Liang bb65acdc83 perf mem: Add mem_events into the supported perf_pmu
With the mem_events, perf doesn't need to read sysfs for each PMU to
find the mem-events-supported PMU. The patch also makes it possible to
clean up the related __weak functions later.

The patch is only to add the mem_events into the perf_pmu for all ARCHs.
It will be used in the later cleanup patches.

Reviewed-by: Ian Rogers <irogers@google.com>
Reviewed-by: Kajol Jain <kjain@linux.ibm.com>
Tested-by: Ravi Bangoria <ravi.bangoria@amd.com>
Tested-by: Leo Yan <leo.yan@linaro.org>
Tested-by: Kajol Jain <kjain@linux.ibm.com>
Suggested-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Cc: will@kernel.org
Cc: mike.leach@linaro.org
Cc: renyu.zj@linux.alibaba.com
Cc: yuhaixin.yhx@linux.alibaba.com
Cc: tmricht@linux.ibm.com
Cc: atrajeev@linux.vnet.ibm.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: john.g.garry@oracle.com
Link: https://lore.kernel.org/r/20240123185036.3461837-2-kan.liang@linux.intel.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-24 14:03:33 -08:00
Jakub Kicinski 0719b5338a selftests: net: fix rps_default_mask with >32 CPUs
If there is more than 32 cpus the bitmask will start to contain
commas, leading to:

./rps_default_mask.sh: line 36: [: 00000000,00000000: integer expression expected

Remove the commas, bash doesn't interpret leading zeroes as oct
so that should be good enough. Switch to bash, Simon reports that
not all shells support this type of substitution.

Fixes: c12e0d5f26 ("self-tests: introduce self-tests for RPS default mask")
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240122195815.638997-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-24 13:55:19 -08:00
Linus Torvalds cf10015a24 execve fixes for v6.8-rc2
- Fix error handling in begin_new_exec() (Bernd Edlinger)
 
 - MAINTAINERS: specifically mention ELF (Alexey Dobriyan)
 
 - Various cleanups related to earlier open() (Askar Safin, Kees Cook)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmWxbGsWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJiQZD/9Lxd6ntRORthvCGk07g12fGZhQ
 OstFdbHyk5/Z+6/uKxSMvkoPZwJkXF2n3D/AvlfMFgyDBvLCFUu08jZOV31YFbeQ
 OFXVGcbY7nexkAmC6eN2k3SX8E+jzLdbcHeVk/iJomNUYBNTpExXhGMEyqZ53Pzo
 fo1uaRNGreCdSP04aHU1LE0vx7p16553oBeBZFT+iLd4glLte+E1TOZh4cIaSZbK
 5h0e+vG1XSBd9uP3fbYEyf+1JzKuhmm1RrVVaDkds1CLgJzUxh0cE1U9otKfnrwf
 xyBu556wTb001vYAIIcLlOJq+ROdiuA12RSyyHbKZmYAWTkQnBgKPV8BGDbshtzN
 zykJEsbRnWV3vN1n6+UzCEknE/xjvywEEdJgghZh46zk2NjnbtULOonLq8aMw7SA
 O+kcr4rqPLuRnxnkBw7QqA1y09QD9+M/iRQdgahsBIaDM3mMXGQsqeJAo9tFxO2M
 oJ1gJ9A7IdeULMBQ7zKVxTvC5c5fF2/CA5jpHUjASiUOTqcfHkPRYX2GINE62Heb
 xfsc3c1RhDrknMA/O01c8ziEBzZqhHUq4vGgWn0VjwIspYyfOOJYneeIx6/pJyTY
 OXbgaK+NetDCOKcv91Jjj0xfxrP0WogzvDbT9j2NuViqX24aQR1oZrredWPCTt5S
 wKouTaLVsM10EwR/Rw==
 =oOcx
 -----END PGP SIGNATURE-----

Merge tag 'execve-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull execve fixes from Kees Cook:

 - Fix error handling in begin_new_exec() (Bernd Edlinger)

 - MAINTAINERS: specifically mention ELF (Alexey Dobriyan)

 - Various cleanups related to earlier open() (Askar Safin, Kees Cook)

* tag 'execve-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  exec: Distinguish in_execve from in_exec
  exec: Fix error handling in begin_new_exec()
  exec: Add do_close_execat() helper
  exec: remove useless comment
  ELF, MAINTAINERS: specifically mention ELF
2024-01-24 13:32:29 -08:00
Linus Torvalds 3eab830189 uselib: remove use of __FMODE_EXEC
Jann Horn points out that uselib() really shouldn't trigger the new
FMODE_EXEC logic introduced by commit 4759ff71f2 ("exec: __FMODE_EXEC
instead of in_execve for LSMs").

In fact, it shouldn't even have ever triggered the old pre-existing
logic for __FMODE_EXEC (like the NFS code that makes executables not
need read permissions).  Unlike a real execve(), that can work even with
files that are purely executable by the user (not readable), uselib()
has that MAY_READ requirement becasue it's really just a convenience
wrapper around mmap() for legacy shared libraries.

The whole FMODE_EXEC bit was originally introduced by commit
b500531e6f ("[PATCH] Introduce FMODE_EXEC file flag"), primarily to
give ETXTBUSY error returns for distributed filesystems.

It has since grown a few other warts (like that NFS thing), but there
really isn't any reason to use it for uselib(), and now that we are
trying to use it to replace the horrid 'tsk->in_execve' flag, it's
actively wrong.

Of course, as Jann Horn also points out, nobody should be enabling
CONFIG_USELIB in the first place in this day and age, but that's a
different discussion entirely.

Reported-by: Jann Horn <jannh@google.com>
Fixes: 4759ff71f2 ("exec: __FMODE_EXEC instead of in_execve for LSMs")
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-01-24 13:12:20 -08:00
Mimi Zohar 1ed4b56310 Revert "KEYS: encrypted: Add check for strsep"
This reverts commit b4af096b5d.

New encrypted keys are created either from kernel-generated random
numbers or user-provided decrypted data.  Revert the change requiring
user-provided decrypted data.

Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2024-01-24 16:11:59 -05:00
Jenishkumar Maheshbhai Patel 9f538b415d net: mvpp2: clear BM pool before initialization
Register value persist after booting the kernel using
kexec which results in kernel panic. Thus clear the
BM pool registers before initialisation to fix the issue.

Fixes: 3f518509de ("ethernet: Add new driver for Marvell Armada 375 network unit")
Signed-off-by: Jenishkumar Maheshbhai Patel <jpatel2@marvell.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://lore.kernel.org/r/20240119035914.2595665-1-jpatel2@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-24 12:27:33 -08:00
Bernd Edlinger a5f5eee282 net: stmmac: Wait a bit for the reset to take effect
otherwise the synopsys_id value may be read out wrong,
because the GMAC_VERSION register might still be in reset
state, for at least 1 us after the reset is de-asserted.

Add a wait for 10 us before continuing to be on the safe side.

> From what have you got that delay value?

Just try and error, with very old linux versions and old gcc versions
the synopsys_id was read out correctly most of the time (but not always),
with recent linux versions and recnet gcc versions it was read out
wrongly most of the time, but again not always.
I don't have access to the VHDL code in question, so I cannot
tell why it takes so long to get the correct values, I also do not
have more than a few hardware samples, so I cannot tell how long
this timeout must be in worst case.
Experimentally I can tell that the register is read several times
as zero immediately after the reset is de-asserted, also adding several
no-ops is not enough, adding a printk is enough, also udelay(1) seems to
be enough but I tried that not very often, and I have not access to many
hardware samples to be 100% sure about the necessary delay.
And since the udelay here is only executed once per device instance,
it seems acceptable to delay the boot for 10 us.

BTW: my hardware's synopsys id is 0x37.

Fixes: c5e4ddbdfa ("net: stmmac: Add support for optional reset control")
Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Serge Semin <fancer.lancer@gmail.com>
Link: https://lore.kernel.org/r/AS8P193MB1285A810BD78C111E7F6AA34E4752@AS8P193MB1285.EURP193.PROD.OUTLOOK.COM
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-24 12:19:59 -08:00
Linus Torvalds 443b349019 samples/cgroup: add .gitignore file for generated samples
Make 'git status' quietly happy again after a full allmodconfig build.

Fixes: 60433a9d03 ("samples: introduce new samples subdir for cgroup")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-01-24 11:52:40 -08:00
Kees Cook 90383cc078 exec: Distinguish in_execve from in_exec
Just to help distinguish the fs->in_exec flag from the current->in_execve
flag, add comments in check_unsafe_exec() and copy_fs() for more
context. Also note that in_execve is only used by TOMOYO now.

Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2024-01-24 11:48:52 -08:00
Kees Cook 4759ff71f2 exec: Check __FMODE_EXEC instead of in_execve for LSMs
After commit 978ffcbf00 ("execve: open the executable file before
doing anything else"), current->in_execve was no longer in sync with the
open(). This broke AppArmor and TOMOYO which depend on this flag to
distinguish "open" operations from being "exec" operations.

Instead of moving around in_execve, switch to using __FMODE_EXEC, which
is where the "is this an exec?" intent is stored. Note that TOMOYO still
uses in_execve around cred handling.

Reported-by: Kevin Locke <kevin@kevinlocke.name>
Closes: https://lore.kernel.org/all/ZbE4qn9_h14OqADK@kevinlocke.name
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: 978ffcbf00 ("execve: open the executable file before doing anything else")
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: James Morris <jmorris@namei.org>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc:  <linux-fsdevel@vger.kernel.org>
Cc:  <linux-mm@kvack.org>
Cc:  <apparmor@lists.ubuntu.com>
Cc:  <linux-security-module@vger.kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-01-24 11:38:58 -08:00
Pablo Neira Ayuso d0009effa8 netfilter: nf_tables: validate NFPROTO_* family
Several expressions explicitly refer to NF_INET_* hook definitions
from expr->ops->validate, however, family is not validated.

Bail out with EOPNOTSUPP in case they are used from unsupported
families.

Fixes: 0ca743a559 ("netfilter: nf_tables: add compatibility layer for x_tables")
Fixes: a3c90f7a23 ("netfilter: nf_tables: flow offload expression")
Fixes: 2fa841938c ("netfilter: nf_tables: introduce routing expression")
Fixes: 554ced0a6e ("netfilter: nf_tables: add support for native socket matching")
Fixes: ad49d86e07 ("netfilter: nf_tables: Add synproxy support")
Fixes: 4ed8eb6570 ("netfilter: nf_tables: Add native tproxy support")
Fixes: 6c47260250 ("netfilter: nf_tables: add xfrm expression")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-24 20:02:40 +01:00
Florian Westphal f342de4e2f netfilter: nf_tables: reject QUEUE/DROP verdict parameters
This reverts commit e0abdadcc6.

core.c:nf_hook_slow assumes that the upper 16 bits of NF_DROP
verdicts contain a valid errno, i.e. -EPERM, -EHOSTUNREACH or similar,
or 0.

Due to the reverted commit, its possible to provide a positive
value, e.g. NF_ACCEPT (1), which results in use-after-free.

Its not clear to me why this commit was made.

NF_QUEUE is not used by nftables; "queue" rules in nftables
will result in use of "nft_queue" expression.

If we later need to allow specifiying errno values from userspace
(do not know why), this has to call NF_DROP_GETERR and check that
"err <= 0" holds true.

Fixes: e0abdadcc6 ("netfilter: nf_tables: accept QUEUE/DROP verdict parameters")
Cc: stable@vger.kernel.org
Reported-by: Notselwyn <notselwyn@pwning.tech>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-24 20:02:39 +01:00
Florian Westphal b462579b2b netfilter: nf_tables: restrict anonymous set and map names to 16 bytes
nftables has two types of sets/maps, one where userspace defines the
name, and anonymous sets/maps, where userspace defines a template name.

For the latter, kernel requires presence of exactly one "%d".
nftables uses "__set%d" and "__map%d" for this.  The kernel will
expand the format specifier and replaces it with the smallest unused
number.

As-is, userspace could define a template name that allows to move
the set name past the 256 bytes upperlimit (post-expansion).

I don't see how this could be a problem, but I would prefer if userspace
cannot do this, so add a limit of 16 bytes for the '%d' template name.

16 bytes is the old total upper limit for set names that existed when
nf_tables was merged initially.

Fixes: 387454901b ("netfilter: nf_tables: Allow set names of up to 255 chars")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-24 20:02:30 +01:00
Florian Westphal c9d9eb9c53 netfilter: nft_limit: reject configurations that cause integer overflow
Reject bogus configs where internal token counter wraps around.
This only occurs with very very large requests, such as 17gbyte/s.

Its better to reject this rather than having incorrect ratelimit.

Fixes: d2168e849e ("netfilter: nft_limit: add per-byte limiting")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-24 20:01:16 +01:00
Pablo Neira Ayuso 01acb2e866 netfilter: nft_chain_filter: handle NETDEV_UNREGISTER for inet/ingress basechain
Remove netdevice from inet/ingress basechain in case NETDEV_UNREGISTER
event is reported, otherwise a stale reference to netdevice remains in
the hook list.

Fixes: 60a3815da7 ("netfilter: add inet ingress support")
Cc: stable@vger.kernel.org
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-24 19:50:21 +01:00
George Guo b253d87fd7 netfilter: nf_tables: cleanup documentation
- Correct comments for nlpid, family, udlen and udata in struct nft_table,
  and afinfo is no longer a member of enum nft_set_class.

- Add comment for data in struct nft_set_elem.

- Add comment for flags in struct nft_ctx.

- Add comments for timeout in struct nft_set_iter, and flags is not a
  member of struct nft_set_iter, remove the comment for it.

- Add comments for commit, abort, estimate and gc_init in struct
  nft_set_ops.

- Add comments for pending_update, num_exprs, exprs and catchall_list
  in struct nft_set.

- Add comment for ext_len in struct nft_set_ext_tmpl.

- Add comment for inner_ops in struct nft_expr_type.

- Add comments for clone, destroy_clone, reduce, gc, offload,
  offload_action, offload_stats in struct nft_expr_ops.

- Add comments for blob_gen_0, blob_gen_1, bound, genmask, udlen, udata,
  blob_next in struct nft_chain.

- Add comment for flags in struct nft_base_chain.

- Add comments for udlen, udata in struct nft_object.

- Add comment for type in struct nft_object_ops.

- Add comment for hook_list in struct nft_flowtable, and remove comments
  for dev_name and ops which are not members of struct nft_flowtable.

Signed-off-by: George Guo <guodongtai@kylinos.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-24 19:50:20 +01:00
Frederic Weisbecker e787644caf rcu: Defer RCU kthreads wakeup when CPU is dying
When the CPU goes idle for the last time during the CPU down hotplug
process, RCU reports a final quiescent state for the current CPU. If
this quiescent state propagates up to the top, some tasks may then be
woken up to complete the grace period: the main grace period kthread
and/or the expedited main workqueue (or kworker).

If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
arm the RT bandwith timer to the local offline CPU. Since this happens
after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
timer gets ignored. Therefore if the RCU kthreads are waiting for RT
bandwidth to be available, they may never be actually scheduled.

This triggers TREE03 rcutorture hangs:

	 rcu: INFO: rcu_preempt self-detected stall on CPU
	 rcu:     4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
	 rcu:     (t=21035 jiffies g=938281 q=40787 ncpus=6)
	 rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
	 rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
	 rcu: RCU grace-period kthread stack dump:
	 task:rcu_preempt     state:R  running task     stack:14896 pid:14    tgid:14    ppid:2      flags:0x00004000
	 Call Trace:
	  <TASK>
	  __schedule+0x2eb/0xa80
	  schedule+0x1f/0x90
	  schedule_timeout+0x163/0x270
	  ? __pfx_process_timeout+0x10/0x10
	  rcu_gp_fqs_loop+0x37c/0x5b0
	  ? __pfx_rcu_gp_kthread+0x10/0x10
	  rcu_gp_kthread+0x17c/0x200
	  kthread+0xde/0x110
	  ? __pfx_kthread+0x10/0x10
	  ret_from_fork+0x2b/0x40
	  ? __pfx_kthread+0x10/0x10
	  ret_from_fork_asm+0x1b/0x30
	  </TASK>

The situation can't be solved with just unpinning the timer. The hrtimer
infrastructure and the nohz heuristics involved in finding the best
remote target for an unpinned timer would then also need to handle
enqueues from an offline CPU in the most horrendous way.

So fix this on the RCU side instead and defer the wake up to an online
CPU if it's too late for the local one.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Fixes: 5c0930ccaa ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
2024-01-24 22:46:17 +05:30
Linus Torvalds 1110ebe058 fbdev fixes and cleanups for 6.8-rc2:
- stifb: Fix crash in stifb_blank()
 - savage/sis: Error out if pixclock equals zero
 - minor trivial cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQS86RI+GtKfB8BJu973ErUQojoPXwUCZbEw6wAKCRD3ErUQojoP
 X/8UAQCt7qn3lty18BTgChgYboMNquc0NVTj9cU0+EkwBa4LnAD+ML4QJPIa5HN2
 LetnHIXp03dBM6JAR16+H6HIWBDePwA=
 =whwV
 -----END PGP SIGNATURE-----

Merge tag 'fbdev-for-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev

Pull fbdev fixes and cleanups from Helge Deller:
 "A crash fix in stifb which was missed to be included in the drm-misc
  tree, two checks to prevent wrong userspace input in sisfb and
  savagefb and two trivial printk cleanups:

   - stifb: Fix crash in stifb_blank()

   - savage/sis: Error out if pixclock equals zero

   - minor trivial cleanups"

* tag 'fbdev-for-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
  fbdev: stifb: Fix crash in stifb_blank()
  fbcon: Fix incorrect printed function name in fbcon_prepare_logo()
  fbdev: sis: Error out if pixclock equals zero
  fbdev: savage: Error out if pixclock equals zero
  fbdev: vt8500lcdfb: Remove unnecessary print function dev_err()
2024-01-24 08:55:51 -08:00