Commit graph

1136672 commits

Author SHA1 Message Date
Pablo Neira Ayuso 3a07327d10 netfilter: nft_inner: support for inner tunnel header matching
This new expression allows you to match on the inner headers that are
encapsulated by any of the existing tunneling protocols.

This expression parses the inner packet to set the link, network and
transport offsets, so the existing expressions (with a few updates) can
be reused to match on the inner headers.

The inner expression supports for different tunnel combinations such as:

- ethernet frame over IPv4/IPv6 packet, eg. VxLAN.
- IPv4/IPv6 packet over IPv4/IPv6 packet, eg. IPIP.
- IPv4/IPv6 packet over IPv4/IPv6 + transport header, eg. GRE.
- transport header (ESP or SCTP) over transport header (usually UDP)

The following fields are used to describe the tunnel protocol:

- flags, which describe how to parse the inner headers:

  NFT_PAYLOAD_CTX_INNER_TUN, the tunnel provides its own header.
  NFT_PAYLOAD_CTX_INNER_ETHER, the ethernet frame is available as inner header.
  NFT_PAYLOAD_CTX_INNER_NH, the network header is available as inner header.
  NFT_PAYLOAD_CTX_INNER_TH, the transport header is available as inner header.

For example, VxLAN sets on all of these flags. While GRE only sets on
NFT_PAYLOAD_CTX_INNER_NH and NFT_PAYLOAD_CTX_INNER_TH. Then, ESP over
UDP only sets on NFT_PAYLOAD_CTX_INNER_TH.

The tunnel description is composed of the following attributes:

- header size: in case the tunnel comes with its own header, eg. VxLAN.

- type: this provides a hint to userspace on how to delinearize the rule.
  This is useful for VxLAN and Geneve since they run over UDP, since
  transport does not provide a hint. This is also useful in case hardware
  offload is ever supported. The type is not currently interpreted by the
  kernel.

- expression: currently only payload supported. Follow up patch adds
  also inner meta support which is required by autogenerated
  dependencies. The exthdr expression should be supported too
  at some point. There is a new inner_ops operation that needs to be
  set on to allow to use an existing expression from the inner expression.

This patch adds a new NFT_PAYLOAD_TUN_HEADER base which allows to match
on the tunnel header fields, eg. vxlan vni.

The payload expression is embedded into nft_inner private area and this
private data area is passed to the payload inner eval function via
direct call.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-10-25 13:48:42 +02:00
Pablo Neira Ayuso 3927ce8850 netfilter: nft_payload: access ipip payload for inner offset
ipip is an special case, transport and inner header offset are set to
the same offset to use the upcoming inner expression for matching on
inner tunnel headers.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-10-25 13:48:42 +02:00
Pablo Neira Ayuso c247897d7c netfilter: nft_payload: access GRE payload via inner offset
Parse GRE v0 packets to properly set up inner offset, this allow for
matching on inner headers.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-10-25 13:48:41 +02:00
Florian Westphal d037abc241 netfilter: nft_objref: make it builtin
nft_objref is needed to reference named objects, it makes
no sense to disable it.

Before:
   text	   data	    bss	    dec	 filename
  4014	    424	      0	   4438	 nft_objref.o
  4174	   1128	      0	   5302	 nft_objref.ko
359351	  15276	    864	 375491	 nf_tables.ko
After:
  text	   data	    bss	    dec	 filename
  3815	    408	      0	   4223	 nft_objref.o
363161	  15692	    864	 379717	 nf_tables.ko

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-10-25 13:48:35 +02:00
Florian Westphal e7a1caa67c netfilter: nf_tables: reduce nft_pktinfo by 8 bytes
structure is reduced from 32 to 24 bytes.  While at it, also check
that iphdrlen is sane, this is guaranteed for NFPROTO_IPV4 but not
for ingress or bridge, so add checks for this.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-10-25 13:44:14 +02:00
Pablo Neira Ayuso ac1f8c0493 netfilter: nft_payload: move struct nft_payload_set definition where it belongs
Not required to expose this header in nf_tables_core.h, move it to where
it is used, ie. nft_payload.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-10-25 13:44:14 +02:00
Kees Cook d6dd508080 bnx2: Use kmalloc_size_roundup() to match ksize() usage
Round up allocations with kmalloc_size_roundup() so that build_skb()'s
use of ksize() is always accurate and no special handling of the memory
is needed by KASAN, UBSAN_BOUNDS, nor FORTIFY_SOURCE.

Cc: Rasesh Mody <rmody@marvell.com>
Cc: GR-Linux-NIC-Dev@marvell.com
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221022021004.gonna.489-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 12:59:04 +02:00
Paolo Abeni 6459838af0 Merge branch 'mptcp-socket-option-updates'
Mat Martineau says:

====================
mptcp: Socket option updates

Patches 1 and 3 refactor a recent socket option helper function for more
generic use, and make use of it in a couple of places.

Patch 2 adds TCP_FASTOPEN_NO_COOKIE functionality to MPTCP sockets,
similar to TCP_FASTOPEN_CONNECT support recently added in v6.1
====================

Link: https://lore.kernel.org/r/20221022004505.160988-1-mathew.j.martineau@linux.intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 12:33:00 +02:00
Matthieu Baerts caea64675d mptcp: sockopt: use new helper for TCP_DEFER_ACCEPT
mptcp_setsockopt_sol_tcp_defer() was doing the same thing as
mptcp_setsockopt_first_sf_only() except for the returned code in case of
error.

Ignoring the error is needed to mimic how TCP_DEFER_ACCEPT is handled
when used with "plain" TCP sockets.

The specific function for TCP_DEFER_ACCEPT can be replaced by the new
mptcp_setsockopt_first_sf_only() helper and errors can be ignored to
stay compatible with TCP. A bit of cleanup.

Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 12:32:54 +02:00
Matthieu Baerts e64d4deb4d mptcp: add TCP_FASTOPEN_NO_COOKIE support
The goal of this socket option is to configure MPTCP + TFO without
cookie per socket.

It was already possible to enable TFO without a cookie per netns by
setting net.ipv4.tcp_fastopen sysctl knob to the right value. Per route
was also supported by setting 'fastopen_no_cookie' option. This patch
adds a per socket support like it is possible to do with TCP thanks to
TCP_FASTOPEN_NO_COOKIE socket option.

The only thing to do here is to relay the request to the first subflow
like it is already done for TCP_FASTOPEN_CONNECT.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 12:32:54 +02:00
Matthieu Baerts d3d429047c mptcp: sockopt: make 'tcp_fastopen_connect' generic
There are other socket options that need to act only on the first
subflow, e.g. all TCP_FASTOPEN* socket options.

This is similar to the getsockopt version.

In the next commit, this new mptcp_setsockopt_first_sf_only() helper is
used by other another option.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 12:32:54 +02:00
Paolo Abeni 818a26048a Merge branch 'soreuseport-fix-broken-so_incoming_cpu'
Kuniyuki Iwashima says:

====================
soreuseport: Fix broken SO_INCOMING_CPU.

setsockopt(SO_INCOMING_CPU) for UDP/TCP is broken since 4.5/4.6 due to
these commits:

  * e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
  * c125e80b88 ("soreuseport: fast reuseport TCP socket selection")

These commits introduced the O(1) socket selection algorithm and removed
O(n) iteration over the list, but it ignores the score calculated by
compute_score().  As a result, it caused two misbehaviours:

  * Unconnected sockets receive packets sent to connected sockets
  * SO_INCOMING_CPU does not work

The former is fixed by commit acdcecc612 ("udp: correct reuseport
selection with connected sockets").  This series fixes the latter and
adds some tests for SO_INCOMING_CPU.
====================

Link: https://lore.kernel.org/r/20221021204435.4259-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 11:35:19 +02:00
Kuniyuki Iwashima 6df96146b2 selftest: Add test for SO_INCOMING_CPU.
Some highly optimised applications use SO_INCOMING_CPU to make them
efficient, but they didn't test if it's working correctly by getsockopt()
to avoid slowing down.  As a result, no one noticed it had been broken
for years, so it's a good time to add a test to catch future regression.

The test does

  1) Create $(nproc) TCP listeners associated with each CPU.

  2) Create 32 child sockets for each listener by calling
     sched_setaffinity() for each CPU.

  3) Check if accept()ed sockets' sk_incoming_cpu matches
     listener's one.

If we see -EAGAIN, SO_INCOMING_CPU is broken.  However, we might not see
any error even if broken; the kernel could miraculously distribute all SYN
to correct listeners.  Not to let that happen, we must increase the number
of clients and CPUs to some extent, so the test requires $(nproc) >= 2 and
creates 64 sockets at least.

Test:
  $ nproc
  96
  $ ./so_incoming_cpu

Before the previous patch:

  # Starting 12 tests from 5 test cases.
  #  RUN           so_incoming_cpu.before_reuseport.test1 ...
  # so_incoming_cpu.c:191:test1:Expected cpu (5) == i (0)
  # test1: Test terminated by assertion
  #          FAIL  so_incoming_cpu.before_reuseport.test1
  not ok 1 so_incoming_cpu.before_reuseport.test1
  ...
  # FAILED: 0 / 12 tests passed.
  # Totals: pass:0 fail:12 xfail:0 xpass:0 skip:0 error:0

After:

  # Starting 12 tests from 5 test cases.
  #  RUN           so_incoming_cpu.before_reuseport.test1 ...
  # so_incoming_cpu.c:199:test1:SO_INCOMING_CPU is very likely to be working correctly with 3072 sockets.
  #            OK  so_incoming_cpu.before_reuseport.test1
  ok 1 so_incoming_cpu.before_reuseport.test1
  ...
  # PASSED: 12 / 12 tests passed.
  # Totals: pass:12 fail:0 xfail:0 xpass:0 skip:0 error:0

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 11:35:16 +02:00
Kuniyuki Iwashima b261eda84e soreuseport: Fix socket selection for SO_INCOMING_CPU.
Kazuho Oku reported that setsockopt(SO_INCOMING_CPU) does not work
with setsockopt(SO_REUSEPORT) since v4.6.

With the combination of SO_REUSEPORT and SO_INCOMING_CPU, we could
build a highly efficient server application.

setsockopt(SO_INCOMING_CPU) associates a CPU with a TCP listener
or UDP socket, and then incoming packets processed on the CPU will
likely be distributed to the socket.  Technically, a socket could
even receive packets handled on another CPU if no sockets in the
reuseport group have the same CPU receiving the flow.

The logic exists in compute_score() so that a socket will get a higher
score if it has the same CPU with the flow.  However, the score gets
ignored after the blamed two commits, which introduced a faster socket
selection algorithm for SO_REUSEPORT.

This patch introduces a counter of sockets with SO_INCOMING_CPU in
a reuseport group to check if we should iterate all sockets to find
a proper one.  We increment the counter when

  * calling listen() if the socket has SO_INCOMING_CPU and SO_REUSEPORT

  * enabling SO_INCOMING_CPU if the socket is in a reuseport group

Also, we decrement it when

  * detaching a socket out of the group to apply SO_INCOMING_CPU to
    migrated TCP requests

  * disabling SO_INCOMING_CPU if the socket is in a reuseport group

When the counter reaches 0, we can get back to the O(1) selection
algorithm.

The overall changes are negligible for the non-SO_INCOMING_CPU case,
and the only notable thing is that we have to update sk_incomnig_cpu
under reuseport_lock.  Otherwise, the race prevents transitioning to
the O(n) algorithm and results in the wrong socket selection.

 cpu1 (setsockopt)               cpu2 (listen)
+-----------------+             +-------------+

lock_sock(sk1)                  lock_sock(sk2)

reuseport_update_incoming_cpu(sk1, val)
.
|  /* set CPU as 0 */
|- WRITE_ONCE(sk1->incoming_cpu, val)
|
|                               spin_lock_bh(&reuseport_lock)
|                               reuseport_grow(sk2, reuse)
|                               .
|                               |- more_socks_size = reuse->max_socks * 2U;
|                               |- if (more_socks_size > U16_MAX &&
|                               |       reuse->num_closed_socks)
|                               |  .
|                               |  |- RCU_INIT_POINTER(sk1->sk_reuseport_cb, NULL);
|                               |  `- __reuseport_detach_closed_sock(sk1, reuse)
|                               |     .
|                               |     `- reuseport_put_incoming_cpu(sk1, reuse)
|                               |        .
|                               |        |  /* Read shutdown()ed sk1's sk_incoming_cpu
|                               |        |   * without lock_sock().
|                               |        |   */
|                               |        `- if (sk1->sk_incoming_cpu >= 0)
|                               |           .
|                               |           |  /* decrement not-yet-incremented
|                               |           |   * count, which is never incremented.
|                               |           |   */
|                               |           `- __reuseport_put_incoming_cpu(reuse);
|                               |
|                               `- spin_lock_bh(&reuseport_lock)
|
|- spin_lock_bh(&reuseport_lock)
|
|- reuse = rcu_dereference_protected(sk1->sk_reuseport_cb, ...)
|- if (!reuse)
|  .
|  |  /* Cannot increment reuse->incoming_cpu. */
|  `- goto out;
|
`- spin_unlock_bh(&reuseport_lock)

Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Fixes: c125e80b88 ("soreuseport: fast reuseport TCP socket selection")
Reported-by: Kazuho Oku <kazuhooku@gmail.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 11:35:16 +02:00
Paolo Abeni 71920a773e Merge branch 'net-ipa-validation-cleanup'
Alex Elder says:

====================
net: ipa: validation cleanup

This series gathers a set of IPA driver cleanups, mostly involving
code that ensures certain things are known to be correct *early*
(either at build or initializatin time), so they can be assumed good
during normal operation.

The first removes three constant symbols, by making a (reasonable)
assumption that a routing table consists of entries for the modem
followed by entries for the AP, with no unused entries between them.

The second removes two checks that are redundant (they verify the
sizes of two memory regions are in range, which will have been done
earlier for all regions).

The third adds some new checks to routing and filter tables that
can be done at "init time" (without requiring any access to IPA
hardware).

The fourth moves a check that routing and filter table addresses can
be encoded within certain IPA immediate commands, so it's performed
earlier; the checks can be done without touching IPA hardware.  The
fifth moves some other command-related checks earlier, for the same
reason.

The sixth removes the definition ipa_table_valid(), because what it
does has become redundant.  Finally, the last patch moves two more
validation calls so they're done very early in the probe process.
This will be required by some upcoming patches, which will record
the size of the routing and filter tables at this time so they're
available for subsequent initialization.
====================

Link: https://lore.kernel.org/r/20221021191340.4187935-1-elder@linaro.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 11:15:21 +02:00
Alex Elder 73da9cac51 net: ipa: check table memory regions earlier
Verify that the sizes of the routing and filter table memory regions
are valid as part of memory initialization, rather than waiting for
table initialization.  The main reason to do this is that upcoming
patches use these memory region sizes to determine the number of
entries in these tables, and we'll want to know these sizes are good
sooner.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 11:15:19 +02:00
Alex Elder 39ad815244 net: ipa: kill ipa_table_valid()
What ipa_table_valid() (and ipa_table_valid_one(), which it calls)
does is ensure that the memory regions that hold routing and filter
tables have reasonable size.  Specifically, it checks that the size
of a region is sufficient (or rather, exactly the right size) to
hold the maximum number of entries supported by the driver.  (There
is an additional check that's erroneous, but in practice it is never
reached.)

Recently ipa_table_mem_valid() was added, which is called by
ipa_table_init().  That function verifies that all table memory
regions are of sufficient size, and requires hashed tables to have
zero size if hashing is not supported.  It only ensures the filter
table is large enough to hold the number of endpoints that support
filtering, but that is adequate.

Therefore everything that ipa_table_valid() does is redundant, so
get rid of it.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 11:15:19 +02:00
Alex Elder 7fd10a2aca net: ipa: introduce ipa_cmd_init()
Currently, ipa_cmd_data_valid() is called by ipa_mem_config().
Nothing it does requires access to hardware though, so it can be
done during the init phase of IPA driver startup.

Create a new function ipa_cmd_init(), whose purpose is to do early
initialization related to IPA immediate commands.  It will call the
build-time validation function, then will make the two calls made
previously by ipa_cmd_data_valid().  This make ipa_cmd_data_valid()
unnecessary, so get rid of it.

Rename ipa_cmd_header_valid() to be ipa_cmd_header_init_local_valid(),
so its name is clearer about which IPA immediate command it is
associated with.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 11:15:19 +02:00
Alex Elder 5444b0ea99 net: ipa: verify table sizes fit in commands early
We currently verify the table size and offset fit in the immediate
command fields that must encode them in ipa_table_valid_one().  We
can now make this check earlier, in ipa_table_mem_valid().

The non-hashed IPv4 filter and route tables will always exist, and
their sizes will match the IPv6 tables, as well as the hashed tables
(if supported).  So it's sufficient to verify the offset and size of
the IPv4 non-hashed tables fit into these fields.

Rename the function ipa_cmd_table_init_valid(), to reinforce that
it is the TABLE_INIT immediate command fields we're checking.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 11:15:18 +02:00
Alex Elder cf13919654 net: ipa: validate IPA table memory earlier
Add checks in ipa_table_init() to ensure the memory regions defined
for IPA filter and routing tables are valid.

For routing tables, the checks ensure:
  - The non-hashed IPv4 and IPv6 routing tables are defined
  - The non-hashed IPv4 and IPv6 routing tables are the same size
  - The number entries in the non-hashed IPv4 routing table is enough
    to hold the number entries available to the modem, plus at least
    one usable by the AP.

For filter tables, the checks ensure:
  - The non-hashed IPv4 and IPv6 filter tables are defined
  - The non-hashed IPv4 and IPv6 filter tables are the same size
  - The number entries in the non-hashed IPv4 filter table is enough
    to hold the endpoint bitmap, plus an entry for each defined
    endpoint that supports filtering.

In addition, for both routing and filter tables:
  - If hashing isn't supported (IPA v4.2), hashed tables are zero size
  - If hashing *is* supported, all hashed tables are the same size as
    their non-hashed counterparts.

When validating the size of routing tables, require the AP to have
at least one entry (in addition to those used by the modem).

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 11:15:18 +02:00
Alex Elder 2554322b31 net: ipa: remove two memory region checks
There's no need to ensure table memory regions fit within the
IPA-local memory range.  And there's no need to ensure the modem
header memory region is in range either.  These are verified for all
memory regions in ipa_mem_size_valid(), once we have settled on the
size of IPA memory.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 11:15:18 +02:00
Alex Elder fb4014ac76 net: ipa: kill two constant symbols
The entries in each IPA routing table are divided between the modem
and the AP.  The modem always gets some number of entries located at
the base of the table; the AP gets all those that follow.

There's no reason to think the modem will use anything different
from the first entries in a routing table, so:
  - Get rid of IPA_ROUTE_MODEM_MIN (just assume it's 0)
  - Get rid of IPA_ROUTE_AP_MIN (just assume it's IPA_ROUTE_MODEM_COUNT)
And finally:
  - Open-code IPA_ROUTE_AP_COUNT and remove its definition

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 11:15:18 +02:00
Paolo Abeni 34802d0662 Merge branch 'extend-action-skbedit-to-rx-queue-mapping'
Amritha Nambiar says:

====================
Extend action skbedit to RX queue mapping

Based on the discussion on
https://lore.kernel.org/netdev/166260012413.81018.8010396115034847972.stgit@anambiarhost.jf.intel.com/ ,
the following series extends skbedit tc action to RX queue mapping.
Currently, skbedit action in tc allows overriding of transmit queue.
Extending this ability of skedit action supports the selection of
receive queue for incoming packets. On the receive side, this action
is supported only in hardware, so the skip_sw flag is enforced.

Enabled ice driver to offload this type of filter into the hardware
for accepting packets to the device's receive queue.
====================

Link: https://lore.kernel.org/r/166633888716.52141.3425659377117969638.stgit@anambiarhost.jf.intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 10:32:43 +02:00
Amritha Nambiar d5ae8ecf38 Documentation: networking: TC queue based filtering
Add tc-queue-filters.rst with notes on TC filters for
selecting a set of queues and/or a queue.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 10:32:40 +02:00
Amritha Nambiar 143b86f346 ice: Enable RX queue selection using skbedit action
This patch uses TC skbedit queue_mapping action to support
forwarding packets to a device queue. Such filters with action
forward to queue will be the highest priority switch filter in
HW.
Example:
$ tc filter add dev ens4f0 protocol ip ingress flower\
  dst_ip 192.168.1.12 ip_proto tcp dst_port 5001\
  action skbedit queue_mapping 5 skip_sw

The above command adds an ingress filter, incoming packets
qualifying the match will be accepted into queue 5. The queue
number is in decimal format.

Refactored ice_add_tc_flower_adv_fltr() to consolidate code with
action FWD_TO_VSI and FWD_TO QUEUE.

Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 10:32:40 +02:00
Amritha Nambiar 4a6a676f8c act_skbedit: skbedit queue mapping for receive queue
Add support for skbedit queue mapping action on receive
side. This is supported only in hardware, so the skip_sw
flag is enforced. This enables offloading filters for
receive queue selection in the hardware using the
skbedit action. Traffic arrives on the Rx queue requested
in the skbedit action parameter. A new tc action flag
TCA_ACT_FLAGS_AT_INGRESS is introduced to identify the
traffic direction the action queue_mapping is requested
on during filter addition. This is used to disallow
offloading the skbedit queue mapping action on transmit
side.

Example:
$tc filter add dev $IFACE ingress protocol ip flower dst_ip $DST_IP\
 action skbedit queue_mapping $rxq_id skip_sw

Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25 10:32:40 +02:00
Jakub Kicinski 6143eca357 Merge branch 'net-sfp-improve-high-power-module-implementation'
Russell King says:

====================
net: sfp: improve high power module implementation

This series aims to improve the power level switching between standard
level 1 and the higher power levels.

The first patch updates the DT binding documentation to include the
minimum and default of 1W, which is the base level that every SFP cage
must support. Hence, it makes sense to document this in the binding.

The second patch enforces a minimum of 1W when parsing the firmware
description, and optimises the code for that case; there's no need to
check for SFF8472 compliance since we will not need to touch the
A2h registers.

Patch 3 validates that the module supports SFF-8472 rev 10.2 before
checking for power level 2 - rev 10.2 is where support for power
levels was introduced, so if the module doesn't support this revision,
it doesn't support power levels. Setting the power level 2 declaration
bit is likely to be spurious.

Patch 4 does the same for power level 3, except this was introduced in
SFF-8472 rev 11.9. The revision code was never updated, so we use the
rev 11.4 to signify this.

Patch 5 cleans up the code - rather than using BIT(0), we now use a
properly named value for the power level select bit.

Patch 6 introduces a read-modify-write helper.

Patch 7 gets rid of the DM7052 hack (which sets a power level
declaration bit but is not compatible with SFF-8472 rev 10.2, and
the module does not implement the A2h I2C address.)

Series tested with my DM7052.

v2: update sff.sfp.yaml with Rob's feedback
====================

Andrew's review tags from v1.

Link: https://lore.kernel.org/r/Y0%2F7dAB8OU3jrbz6@shell.armlinux.org.uk
Link: https://lore.kernel.org/r/Y1K17UtfFopACIi2@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 21:06:30 -07:00
Russell King (Oracle) bd1432f68d net: sfp: get rid of DM7052 hack when enabling high power
Since we no longer mis-detect high-power mode with the DM7052 module,
we no longer need the hack in sfp_module_enable_high_power(), and can
now switch this to use sfp_modify_u8().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 21:06:24 -07:00
Russell King (Oracle) a3c536fc75 net: sfp: add sfp_modify_u8() helper
Add a helper to modify bits in a single byte in memory space, and use
it when updating the soft tx-disable flag in the module.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 21:06:20 -07:00
Russell King (Oracle) 3989004984 net: sfp: provide a definition for the power level select bit
Provide a named definition for the power level select bit in the
extended status register, rather than using BIT(0) in the code.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 21:06:16 -07:00
Russell King (Oracle) f8810ca758 net: sfp: ignore power level 3 prior to SFF-8472 Rev 11.4
Power level 3 was included in SFF-8472 revision 11.9, but this does
not have a compliance code. Use revision 11.4 as the minimum
compliance level instead.

This should avoid any spurious indication of 2W modules.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 21:06:12 -07:00
Russell King (Oracle) 18cc659e95 net: sfp: ignore power level 2 prior to SFF-8472 Rev 10.2
Power level 2 was introduced by SFF-8472 revision 10.2. Ignore
the power declaration bit for modules that are not compliant with
at least this revision.

This should remove any spurious indication of 1.5W modules.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 21:06:06 -07:00
Russell King (Oracle) 02eaf5a791 net: sfp: check firmware provided max power
Check that the firmware provided maximum power is at least 1W, which
is the minimum power level for any SFP module.

Now that we enforce the minimum of 1W, we can exit early from
sfp_module_parse_power() if the module power is 1W or less.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 21:05:54 -07:00
Russell King (Oracle) a272bcb9e5 dt-bindings: net: sff,sfp: update binding
Add a minimum and default for the maximum-power-milliwatt option;
module power levels were originally up to 1W, so this is the default
and the minimum power level we can have for a functional SFP cage.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 21:05:45 -07:00
Jakub Kicinski 1b3d6ecd41 Merge branch 'bnxt_en-driver-updates'
Michael Chan says:

====================
bnxt_en: Driver updates

This patchset adds .get_module_eeprom_by_page() support and adds
an NVRAM resize step to allow larger firmware images to be flashed
to older firmware.
====================

Link: https://lore.kernel.org/r/1666334243-23866-1-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 19:24:20 -07:00
Vikas Gupta 4503422462 bnxt_en: check and resize NVRAM UPDATE entry before flashing
Resize of the UPDATE entry is required if the image to
be flashed is larger than the available space. Add this step,
otherwise flashing larger firmware images by ethtool or devlink
may fail.

Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 19:24:14 -07:00
Vikas Gupta 7ef3d3901b bnxt_en: add .get_module_eeprom_by_page() support
Add support for .get_module_eeprom_by_page() callback which
implements generic solution for module`s eeprom access.

v3: Add bnxt_get_module_status() to get a more specific extack error
    string.
    Return -EINVAL from bnxt_get_module_eeprom_by_page() when we
    don't want to fallback to old method.
v2: Simplification suggested by Ido Schimmel

Link: https://lore.kernel.org/netdev/YzVJ%2FvKJugoz15yV@shredder/
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 19:24:14 -07:00
Michael Chan 84a911db83 bnxt_en: Update firmware interface to 1.10.2.118
The main changes are PTM timestamp support, CMIS EEPROM support, and
asymmetric CoS queues support.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 19:24:14 -07:00
Jakub Kicinski 96917bb3a3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
include/linux/net.h
  a5ef058dc4 ("net: introduce and use custom sockopt socket flag")
  e993ffe3da ("net: flag sockets supporting msghdr originated zerocopy")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 13:44:11 -07:00
Linus Torvalds 337a0a0b63 Including fixes from bpf.
Current release - regressions:
 
  - eth: fman: re-expose location of the MAC address to userspace,
    apparently some udev scripts depended on the exact value
 
 Current release - new code bugs:
 
  - bpf:
    - wait for busy refill_work when destroying bpf memory allocator
    - allow bpf_user_ringbuf_drain() callbacks to return 1
    - fix dispatcher patchable function entry to 5 bytes nop
 
 Previous releases - regressions:
 
  - net-memcg: avoid stalls when under memory pressure
 
  - tcp: fix indefinite deferral of RTO with SACK reneging
 
  - tipc: fix a null-ptr-deref in tipc_topsrv_accept
 
  - eth: macb: specify PHY PM management done by MAC
 
  - tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
 
 Previous releases - always broken:
 
  - eth: amd-xgbe: SFP fixes and compatibility improvements
 
 Misc:
 
  - docs: netdev: offer performance feedback to contributors
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmNW024ACgkQMUZtbf5S
 IrvX7w//SP/zKZwgzC13zd2rrCP16TX2QvkHPmSLvcldQDXdCypmsoc5Vb8UNpkG
 jwAuy2pxqPy2oxTwTBQv9TNRT2oqEFOsFTK+w410whlL7g1wZ02aXU8qFhV2XumW
 o4gRtM+UISPUKFbOnawdK1XlrNdeLF3bjETvW2GP9zxCb0iqoQXtDDNKxv2B2iQA
 MSyTtzHA4n9GS7LKGtPgsP2Ose7h1Z+AjTIpQH1nvfEHJUf/wmxUdCK+fuwfeLjY
 PhmYaPG/333j1bfBk1Ms/nUYA5KRXlEj9A/7jDtxhxNEwaTNKyLB19a6oVxXxpSQ
 x/k+nZP1RColn5xeco5a1X9aHHQ46PJQ8wVAmxYDIeIA5XPMgShNmhAyjrq1ac+o
 9vYeYpmnMGSTLdBMvGbWpynWHe7SddgF8LkbnYf2HLKbxe4bgkOnmxOUH4q9iinZ
 MfVSknjax4DP0C7X1kGgR6WyltWnkrahOdUkINsIUNxj0KxJa/eStpJIbJrfkxNV
 gHbOjB2/bF3SXENrS4A0IJCgsbO9YugN83Eyu0WDWQOw9wVgopzxOJx9R+H0wkVH
 XpGGP8qi1DZiTE3iQiq1LHj6f6kirFmtt9QFH5yzaqtKBaqXakHaXwUO4VtD+BI9
 NPFKvFL6jrp8EAn0PTM/RrvhJZN+V0bFXiyiMe0TLx+aR0UMxGc=
 =dD6N
 -----END PGP SIGNATURE-----

Merge tag 'net-6.1-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf.

  The net-memcg fix stands out, the rest is very run-off-the-mill. Maybe
  I'm biased.

  Current release - regressions:

   - eth: fman: re-expose location of the MAC address to userspace,
     apparently some udev scripts depended on the exact value

  Current release - new code bugs:

   - bpf:
       - wait for busy refill_work when destroying bpf memory allocator
       - allow bpf_user_ringbuf_drain() callbacks to return 1
       - fix dispatcher patchable function entry to 5 bytes nop

  Previous releases - regressions:

   - net-memcg: avoid stalls when under memory pressure

   - tcp: fix indefinite deferral of RTO with SACK reneging

   - tipc: fix a null-ptr-deref in tipc_topsrv_accept

   - eth: macb: specify PHY PM management done by MAC

   - tcp: fix a signed-integer-overflow bug in tcp_add_backlog()

  Previous releases - always broken:

   - eth: amd-xgbe: SFP fixes and compatibility improvements

  Misc:

   - docs: netdev: offer performance feedback to contributors"

* tag 'net-6.1-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (37 commits)
  net-memcg: avoid stalls when under memory pressure
  tcp: fix indefinite deferral of RTO with SACK reneging
  tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
  net: lantiq_etop: don't free skb when returning NETDEV_TX_BUSY
  net: fix UAF issue in nfqnl_nf_hook_drop() when ops_init() failed
  docs: netdev: offer performance feedback to contributors
  kcm: annotate data-races around kcm->rx_wait
  kcm: annotate data-races around kcm->rx_psock
  net: fman: Use physical address for userspace interfaces
  net/mlx5e: Cleanup MACsec uninitialization routine
  atlantic: fix deadlock at aq_nic_stop
  nfp: only clean `sp_indiff` when application firmware is unloaded
  amd-xgbe: add the bit rate quirk for Molex cables
  amd-xgbe: fix the SFP compliance codes check for DAC cables
  amd-xgbe: enable PLL_CTL for fixed PHY modes only
  amd-xgbe: use enums for mailbox cmd and sub_cmds
  amd-xgbe: Yellow carp devices do not need rrc
  bpf: Use __llist_del_all() whenever possbile during memory draining
  bpf: Wait for busy refill_work when destroying bpf memory allocator
  MAINTAINERS: add keyword match on PTP
  ...
2022-10-24 12:43:51 -07:00
Linus Torvalds f6602a97a1 Urgent RCU pull request for v6.1
This pull request contains a commit that fixes bf95b2bc3e ("rcu: Switch
 polled grace-period APIs to ->gp_seq_polled"), which could incorrectly
 leave interrupts enabled after an early-boot call to synchronize_rcu().
 Such synchronize_rcu() calls must acquire leaf rcu_node locks in order to
 properly interact with polled grace periods, but the code did not take
 into account the possibility of synchronize_rcu() being invoked from
 the portion of the boot sequence during which interrupts are disabled.
 This commit therefore switches the lock acquisition and release from
 irq to irqsave/irqrestore.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmNS54gTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jI5pD/0aURo25lH/gxCxt1A5gi/UI+/rlz+O
 7rV51ZXW3cAPY1O7EP/57D8sXtgcALxYyRZ2JVh1WH+ScP+Se9YENs+UBp5zwy75
 NIOkJ2LRxMCeX+T7Vw8jtsYd4bsgSQ1q2IXiGHCrRxtUwXewYMxLEVx4Bd89P45F
 P49NJx66s5uXmKH2VV6UUcJT7yfyn8+USYxTaDmhhnEGE1frKBoYVqsqoaVfa0ON
 r8O50/06bj6VdSRooGGw9vUcuoUlsrmnnXDd2aITkWqtBkeBOA6S3Nx+LUiRRXaw
 m9e0s0yLulTlMsXVx/UAM+eBaXP3SDKK7wF1xXoyCqLdPKtZ4ABM0RQzXMsFBQQy
 xs04Ba/0CBe1wVv9BijXuR1WgmcRUtoqxAFKXR6bIwh7uPtFDINRO2hfi9vEEVSQ
 +4XbnoqsyHaul8tvBWV7O1Fo7+hNgFgJwAZq9I2xOJzh8wbVCc1+5E13K0Oktbq4
 HERDAzeqA8Cr+6VxicrXt8/5xw7GsUJbYoXMpttB1WQBIps3/ayUxww1RyK9qegi
 DnpUSAUdd0bF1ZfVtkh7V59uk+BBsL9MKNowTRrT1nVWO2VcAbf9IVvONHBFlIw4
 d2IKn4IbpGsQR9ZJUfo/6y7GkZYBLU46lknduW+n7vuP+7j2B3KkFQw7z71e1c7u
 LSEgmtPZS/c7SA==
 =o6hI
 -----END PGP SIGNATURE-----

Merge tag 'rcu-urgent.2022.10.20a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull RCU fix from Paul McKenney:
 "Fix a regression caused by commit bf95b2bc3e ("rcu: Switch polled
  grace-period APIs to ->gp_seq_polled"), which could incorrectly leave
  interrupts enabled after an early-boot call to synchronize_rcu().

  Such synchronize_rcu() calls must acquire leaf rcu_node locks in order
  to properly interact with polled grace periods, but the code did not
  take into account the possibility of synchronize_rcu() being invoked
  from the portion of the boot sequence during which interrupts are
  disabled.

  This commit therefore switches the lock acquisition and release from
  irq to irqsave/irqrestore"

* tag 'rcu-urgent.2022.10.20a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  rcu: Keep synchronize_rcu() from enabling irqs in early boot
2022-10-24 12:33:30 -07:00
Linus Torvalds 2a91e897c0 linux-kselftest-kunit-fixes-6.1-rc3
This KUnit fixes update for Linux 6.1-rc3 consists of one single fix
 to update alloc_string_stream() callers to check for IS_ERR() instead
 of NULL to be in sync with alloc_string_stream() returning IS_ERR().
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmNWwFwACgkQCwJExA0N
 QxzH8Q/+PdNr/qmF3eNWiz5u/UC+jBQOiNURb+DenFkS9i6iMp61HBSUTDFakR//
 NsPiOg4G8RUT7XI9kvhb22vC5b+RacHTwHrxh1TBd83j3WW2xsvCEOF228yiRM3m
 8XM2pg+oLSkmcPmt9ym2hgiZBQ5rvzkyWmZKSZy+gzW9vylCA8BXDoVlTNbsiemh
 JbxFyPaq3ApSkNcLBPuKYWpz1PungGg6RybJ2ZlP0384Z7JJtpdpom/a1fEXTABi
 d4DCkvKntI6xyVkv62t3F2WVKnV78VJJX87RcwDV4qZ5W8V2MXXZwuPPrbeyC6tB
 TLkh4M1EdfSfr6zxkVOiPz8noCmiF9TzJauwjPV/tyOnDT8dDSpcty2iAgjyIdFg
 PSS1rNeFo6EkeouETkX+h8TI3VzzNTFBrWiAcqOlr1i5R2Pt20IqAQcUdhQVFg0C
 Y3BjycG83f7k2uYQivUSLSv3DYbrDIov29Ej9P4sZ9MZSO/DbkkhNb87A7IQQCI5
 zRKRh8KZWUzjAHJge5M9a5wZdhnECpwoJNiBlhKxa1/TmrIS3Vc0KthC1URrgS4E
 IK/T3HGqlMlkV3Zi59R/bIBFCfz7Y6OJ0CjPt/Ey+xUpm8uUYwU4PgDfPbVWglzf
 5KMEkEXmEWqZJviMbs+YF+FlqaIz692WTEYMQ/obrz2cvchPsUc=
 =rHRJ
 -----END PGP SIGNATURE-----

Merge tag 'linux-kselftest-kunit-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull KUnit fixes from Shuah Khan:
 "One single fix to update alloc_string_stream() callers to check for
  IS_ERR() instead of NULL to be in sync with alloc_string_stream()
  returning an ERR_PTR()"

* tag 'linux-kselftest-kunit-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  kunit: update NULL vs IS_ERR() tests
2022-10-24 12:19:34 -07:00
Linus Torvalds 21c92498e9 linux-kselftest-fixes-6.1-rc3
This Kselftest fixes update for Linux 6.1-rc3 consists of:
 
 - futex, intel_pstate, kexec build fixes
 - ftrace dynamic_events dependency check fix
 - memory-hotplug fix to remove redundant warning from test report
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmNWu/cACgkQCwJExA0N
 Qxz/nBAAhKn5l5G69tUcJIPT2XNNVnslJJKg8CBuFhfzw9U+BIJqbR9ZCqyzLAz2
 ZVYTEbZtshbjgEJbe8KwDskK3H9F3RYkMEJ+EOH7N6GwcL9Ol7Ubzx6Yf9RM3V8S
 +Y8rUE3iZfC4grR4uQNK0edUb2mXZwMLB+ZYEetp9YlTS62vtpUUtaMWRAU4mvd5
 2Ws+6YKV89kP4zoZB6SCJe9/ORvp0wvskoSCshw9BAaJrbJ6MkGToOZtFWzTCKNx
 x3he60mj4n5rE6PtxuWh1sAar1m/7ZF9DT9gGy6t99/zT0BUWTUQ8RECi2/IIe5m
 SqXJkWiNqO+ytFm17IqDPd/fk/XQlCdCRR423fkAjie6DGdZzu1KeJjg+Q9vPpeP
 qaCpFXwc3E4culDKDWBn2ZRQZFBoFbrCUEbqO0uZB5Ua9jB9FuTJhofanEmz3TUh
 XZlXUrPTVwE6SLur7f5x8ubj//0qGcbdM0qyYl6C9/i213axYdj3V6Xh1FStHJpK
 IvNcUqcxrSAPEKXmOY9hEZeAMHYevBwRXOqHSLNhrZbjGN0ASGfIhU5Ys1ffNS0c
 SuVsxfxzqR1Eu11JqRnSIcHj9hJBgroaBv+c5h2ivZD+MqZSrocFoRnnwjV3nW9S
 B+n6Whi1JuatLK4pfn8Ba+VoZ7hl9Bb3dxvALRdqDbqYBs5cp+I=
 =awHm
 -----END PGP SIGNATURE-----

Merge tag 'linux-kselftest-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull Kselftest fixes from Shuah Khan:

 - futex, intel_pstate, kexec build fixes

 - ftrace dynamic_events dependency check fix

 - memory-hotplug fix to remove redundant warning from test report

* tag 'linux-kselftest-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  selftests/ftrace: fix dynamic_events dependency check
  selftests/memory-hotplug: Remove the redundant warning information
  selftests/kexec: fix build for ARCH=x86_64
  selftests/intel_pstate: fix build for ARCH=x86_64
  selftests/futex: fix build for clang
2022-10-24 12:10:55 -07:00
Linus Torvalds 74d5b415a5 Some pin control fixes for v6.1:
- Fix typos in UART1 and MMC in the Ingenic driver.
 
 - A really well researched glitch bug fix to the Qualcomm driver
   that was tracked down and fixed by Dough Anderson from
   Chromium. Hats off for this one!
 
 - Revert two patches on the Xilinx ZynqMP driver: this needs a
   proper solution making use of firmware version information to
   adapt to different firmware releases.
 
 - Fix interrupt triggers in the Ocelot driver.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEElDRnuGcz/wPCXQWMQRCzN7AZXXMFAmNWPCQACgkQQRCzN7AZ
 XXOcbRAAyyqqUTiHc3v6OiU12WyNxTay3Fypq4hl2UKsi6BN+NIzliavDdSKYJ+y
 dR5h9YnXZ766OShGPoBgXAusY5Apx09IJkFePsc11IZcZwbsBRSmAwKkD9LHxb93
 AIVHyn6zA1Ic2oD8ysgCC7kMAXUWibiSLgEqj3RSuwbD9lN2pYuUqZUK/Q7ydHpC
 2yPRTlJig/9Ai79NlbFQXz8QUXKAR/niPPtaVtYEii+M87643kCHxKop52oWmF8V
 KV9WTxFqtW7TgNqunrBn0JJjxU/k0Dlbecj4Y6gDszg9V+7sR0u+LpZ/KVFakOI0
 P8FcmQquAOG+jApGoe8XMwQw0xYuSTJxKZ3sBHzkj5k3f//MzB47UUMh34wbwo/N
 nt2lIzyV/Jlbhhj/1NjRMqECS00Ap+bo5BmDoyNFsPpQFqFfuyEmT27Mq/hJnmfm
 i896lGkvJvnNAofBzw0/QiB65iAhNt6xhy2L0VqyNRAeFfQ0K9ltf3wipg0g3KH4
 rq0W2e7Fepl/vE+2qSyViDXbPG5GgQG3ljv/DE+8DdvMWYLLc2zMz0FenYj5k2bv
 XW9NtvEPRxVMCzTW5WEcDL21UQF7F5Vu8yojrjlv5XZ/QWafPjm/xpKhNBhNXuYf
 hmR4hL2k4YCBRqWmhn9B3S+7jotuTRF4E5CUIPkXIOjUWXKJE1Y=
 =w7Lm
 -----END PGP SIGNATURE-----

Merge tag 'pinctrl-v6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

Pull pin control fixes from Linus Walleij:

 - Fix typos in UART1 and MMC in the Ingenic driver

 - A really well researched glitch bug fix to the Qualcomm driver that
   was tracked down and fixed by Dough Anderson from Chromium. Hats off
   for this one!

 - Revert two patches on the Xilinx ZynqMP driver: this needs a proper
   solution making use of firmware version information to adapt to
   different firmware releases

 - Fix interrupt triggers in the Ocelot driver

* tag 'pinctrl-v6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: ocelot: Fix incorrect trigger of the interrupt.
  Revert "dt-bindings: pinctrl-zynqmp: Add output-enable configuration"
  Revert "pinctrl: pinctrl-zynqmp: Add support for output-enable and bias-high-impedance"
  pinctrl: qcom: Avoid glitching lines when we first mux to output
  pinctrl: Ingenic: JZ4755 bug fixes
2022-10-24 11:48:30 -07:00
Jakub Kicinski 720ca52bce net-memcg: avoid stalls when under memory pressure
As Shakeel explains the commit under Fixes had the unintended
side-effect of no longer pre-loading the cached memory allowance.
Even tho we previously dropped the first packet received when
over memory limit - the consecutive ones would get thru by using
the cache. The charging was happening in batches of 128kB, so
we'd let in 128kB (truesize) worth of packets per one drop.

After the change we no longer force charge, there will be no
cache filling side effects. This causes significant drops and
connection stalls for workloads which use a lot of page cache,
since we can't reclaim page cache under GFP_NOWAIT.

Some of the latency can be recovered by improving SACK reneg
handling but nowhere near enough to get back to the pre-5.15
performance (the application I'm experimenting with still
sees 5-10x worst latency).

Apply the suggested workaround of using GFP_ATOMIC. We will now
be more permissive than previously as we'll drop _no_ packets
in softirq when under pressure. But I can't think of any good
and simple way to address that within networking.

Link: https://lore.kernel.org/all/20221012163300.795e7b86@kernel.org/
Suggested-by: Shakeel Butt <shakeelb@google.com>
Fixes: 4b1327be9f ("net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()")
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20221021160304.1362511-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 10:35:09 -07:00
Neal Cardwell 3d2af9cce3 tcp: fix indefinite deferral of RTO with SACK reneging
This commit fixes a bug that can cause a TCP data sender to repeatedly
defer RTOs when encountering SACK reneging.

The bug is that when we're in fast recovery in a scenario with SACK
reneging, every time we get an ACK we call tcp_check_sack_reneging()
and it can note the apparent SACK reneging and rearm the RTO timer for
srtt/2 into the future. In some SACK reneging scenarios that can
happen repeatedly until the receive window fills up, at which point
the sender can't send any more, the ACKs stop arriving, and the RTO
fires at srtt/2 after the last ACK. But that can take far too long
(O(10 secs)), since the connection is stuck in fast recovery with a
low cwnd that cannot grow beyond ssthresh, even if more bandwidth is
available.

This fix changes the logic in tcp_check_sack_reneging() to only rearm
the RTO timer if data is cumulatively ACKed, indicating forward
progress. This avoids this kind of nearly infinite loop of RTO timer
re-arming. In addition, this meets the goals of
tcp_check_sack_reneging() in handling Windows TCP behavior that looks
temporarily like SACK reneging but is not really.

Many thanks to Jakub Kicinski and Neil Spring, who reported this issue
and provided critical packet traces that enabled root-causing this
issue. Also, many thanks to Jakub Kicinski for testing this fix.

Fixes: 5ae344c949 ("tcp: reduce spurious retransmits due to transient SACK reneging")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: Neil Spring <ntspring@fb.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Tested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20221021170821.1093930-1-ncardwell.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 10:34:48 -07:00
Jakub Kicinski e28c44450b bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmNVkYkACgkQ6rmadz2v
 bTqzHw/+NYMwfLm5Ck+BK0+HiYU5VVLoG4jp8G7B3sJL/6nUDduajzoqa+nM19Xl
 +HEjbMza7CizmhkCRkzIs1VVtx8mtvKdTxbhvm77SU2+GBn+X1es+XhtFd4EOpok
 MINNHs+cOC/HlnPD/QbFgvxKiKkjyjWxInjUp6Y/mLMcKCn7l9KOkc07/la9Dj3j
 RI0gXCywq1pJaPuTCnt0/wcYLJvzn6QsZnKmmksQwt59GQqOd11HWid3rBWZhDp6
 beEoHDIMGHROtu60vm4DB0p4l6tauZfeXyPCeu3Tx5ZSsypJIyU1iTdKiIUjG963
 ilpy55nrX9bWxadB7LIKHyYfW3in4o+D1ZZaUvLIato/69CZJZ7Uc4kU1RF4Ay1F
 Df1Fmal2WeNAxxETPmQPvVeCePvQvwLHl4KNogdZZvd/67cyc1cDhnuTJp37iPak
 FALHaaw0VOrTdTvxsWym7yEbkhPbCHpPrKYFZFHgGrRTFk/GM2k38mM07lcLxFGw
 aKyooS+eoIZMEgtK5Hma2wpeIVSlkJiJk1d0K20OxdnIUyYEsMXmI+uV1gMxq/8z
 EHNi0+296xOoxy22I1Bd5Tu7fIeecHFN44q7YFmpGsB54UNLpFsP0vYUmYT/6hLI
 Y0KVZu4c3oQDX7ttifMvkeOCURDJBPrZx37bpNpNXF55fB5ehNk=
 =eV7W
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Alexei Starovoitov says:

====================
pull-request: bpf 2022-10-23

We've added 7 non-merge commits during the last 18 day(s) which contain
a total of 8 files changed, 69 insertions(+), 5 deletions(-).

The main changes are:

1) Wait for busy refill_work when destroying bpf memory allocator, from Hou.

2) Allow bpf_user_ringbuf_drain() callbacks to return 1, from David.

3) Fix dispatcher patchable function entry to 5 bytes nop, from Jiri.

4) Prevent decl_tag from being referenced in func_proto, from Stanislav.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Use __llist_del_all() whenever possbile during memory draining
  bpf: Wait for busy refill_work when destroying bpf memory allocator
  bpf: Fix dispatcher patchable function entry to 5 bytes nop
  bpf: prevent decl_tag from being referenced in func_proto
  selftests/bpf: Add reproducer for decl_tag in func_proto return type
  selftests/bpf: Make bpf_user_ringbuf_drain() selftest callback return 1
  bpf: Allow bpf_user_ringbuf_drain() callbacks to return 1
====================

Link: https://lore.kernel.org/r/20221023192244.81137-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 10:32:01 -07:00
David S. Miller 86d6f77a3c Merge branch 'ptp-ocxp-Oroli-ART-CARD'
Vadim Fedorenko says:

====================
ptp: ocp: add support for Orolia ART-CARD

Orolia company created alternative open source TimeCard. The hardware of
the card provides similar to OCP's card functions, that's why the support
is added to current driver.

The first patch in the series changes the way to store information about
serial ports and is more like preparation.

The patches 2 to 4 introduces actual hardware support.

The last patch removes fallback from devlink flashing interface to protect
against flashing wrong image. This became actual now as we have 2 different
boards supported and wrong image can ruin hardware easily.

v2:
  Address comments from Jonathan Lemon

v3:
  Fix issue reported by kernel test robot <lkp@intel.com>

v4:
  Fix clang build issue

v5:
  Fix warnings and per-patch build errors

v6:
  Fix more style issues
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 13:10:40 +01:00
Vadim Fedorenko c1fd463d57 ptp: ocp: remove flash image header check fallback
Previously there was a fallback mode to flash firmware image without
proper header. But now we have different supported vendors and flashing
wrong image could destroy the hardware. Remove fallback mode and force
header check. Both vendors have published firmware images with headers.

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 13:10:40 +01:00
Vadim Fedorenko ee6439aaad ptp: ocp: expose config and temperature for ART card
Orolia card has disciplining configuration and temperature table
stored in EEPROM. This patch exposes them as binary attributes to
have read and write access.

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Co-developed-by: Charles Parent <charles.parent@orolia2s.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 13:10:40 +01:00