Commit graph

663 commits

Author SHA1 Message Date
Erick Archer 14b526f55b RDMA/uverbs: Remove flexible arrays from struct *_filter
When a struct containing a flexible array is included in another struct,
and there is a member after the struct-with-flex-array, there is a
possibility of memory overlap. These cases must be audited [1]. See:

struct inner {
	...
	int flex[];
};

struct outer {
	...
	struct inner header;
	int overlap;
	...
};

This is the scenario for all the "struct *_filter" structures that are
included in the following "struct ib_flow_spec_*" structures:

struct ib_flow_spec_eth
struct ib_flow_spec_ib
struct ib_flow_spec_ipv4
struct ib_flow_spec_ipv6
struct ib_flow_spec_tcp_udp
struct ib_flow_spec_tunnel
struct ib_flow_spec_esp
struct ib_flow_spec_gre
struct ib_flow_spec_mpls

The pattern is like the one shown below:

struct *_filter {
	...
	u8 real_sz[];
};

struct ib_flow_spec_* {
	...
	struct *_filter val;
	struct *_filter mask;
};

In this case, the trailing flexible array "real_sz" is never allocated
and is only used to calculate the size of the structures. Here the use
of the "offsetof" helper can be changed by the "sizeof" operator because
the goal is to get the size of these structures. Therefore, the trailing
flexible arrays can also be removed.

However, due to the trailing padding that can be induced in structs it
is possible that the:

offsetof(struct *_filter, real_sz) != sizeof(struct *_filter)

This situation happens with the "struct ib_flow_ipv6_filter" and to
avoid it the "__packed" macro is used in this structure. But now, the
"sizeof(struct ib_flow_ipv6_filter)" has changed. This is not a problem
since this size is not used in the code.

The situation now is that "sizeof(struct ib_flow_spec_ipv6)" has also
changed (this struct contains the struct ib_flow_ipv6_filter). This is
also not a problem since it is only used to set the size of the "union
ib_flow_spec", which can store all the "ib_flow_spec_*" structures.

Link: https://lore.kernel.org/r/20240217142913.4285-1-erick.archer@gmx.com
Signed-off-by: Erick Archer <erick.archer@gmx.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-02-21 13:28:52 -04:00
Mike Marciniszyn 4fbc3a52cd RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz
64k pages introduce the situation in this diagram when the HCA 4k page
size is being used:

 +-------------------------------------------+ <--- 64k aligned VA
 |                                           |
 |              HCA 4k page                  |
 |                                           |
 +-------------------------------------------+
 |                   o                       |
 |                                           |
 |                   o                       |
 |                                           |
 |                   o                       |
 +-------------------------------------------+
 |                                           |
 |              HCA 4k page                  |
 |                                           |
 +-------------------------------------------+ <--- Live HCA page
 |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO| <--- offset
 |                                           | <--- VA
 |                MR data                    |
 +-------------------------------------------+
 |                                           |
 |              HCA 4k page                  |
 |                                           |
 +-------------------------------------------+
 |                   o                       |
 |                                           |
 |                   o                       |
 |                                           |
 |                   o                       |
 +-------------------------------------------+
 |                                           |
 |              HCA 4k page                  |
 |                                           |
 +-------------------------------------------+

The VA addresses are coming from rdma-core in this diagram can be
arbitrary, but for 64k pages, the VA may be offset by some number of HCA
4k pages and followed by some number of HCA 4k pages.

The current iterator doesn't account for either the preceding 4k pages or
the following 4k pages.

Fix the issue by extending the ib_block_iter to contain the number of DMA
pages like comment [1] says and by using __sg_advance to start the
iterator at the first live HCA page.

The changes are contained in a parallel set of iterator start and next
functions that are umem aware and specific to umem since there is one user
of the rdma_for_each_block() without umem.

These two fixes prevents the extra pages before and after the user MR
data.

Fix the preceding pages by using the __sq_advance field to start at the
first 4k page containing MR data.

Fix the following pages by saving the number of pgsz blocks in the
iterator state and downcounting on each next.

This fix allows for the elimination of the small page crutch noted in the
Fixes.

Fixes: 10c75ccb54 ("RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz()")
Link: https://lore.kernel.org/r/20231129202143.1434-2-shiraz.saleem@intel.com
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-12-04 20:02:41 -04:00
Chuck Lever 0aa44595d6 RDMA/core: Fix a couple of obvious typos in comments
Fix typos.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/169643338101.8035.6826446669479247727.stgit@manet.1015granger.net
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-10-04 21:55:44 +03:00
Kees Cook 4755dc6f29 RDMA: Annotate struct rdma_hw_stats with __counted_by
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).

As found with Coccinelle[1], add __counted_by for struct rdma_hw_stats.

[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci

Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: linux-rdma@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230929180431.3005464-1-keescook@chromium.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-10-02 14:44:54 +03:00
Or Har-Toov 703289ce43 IB/core: Add support for XDR link speed
Add new IBTA speed XDR, the new rate that was added to Infiniband spec
as part of XDR and supporting signaling rate of 200Gb.

In order to report that value to rdma-core, add new u32 field to
query_port response.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/9d235fc600a999e8274010f0e18b40fa60540e6c.1695204156.git.leon@kernel.org
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-09-26 12:38:39 +03:00
wenglianfa aebf8145e1 RDMA/core: Add support to dump SRQ resource in RAW format
Add support to dump SRQ resource in raw format. It enable drivers to
return the entire device specific SRQ context without setting each
field separately.

Example:
$ rdma res show srq -r
dev hns3 149000...

$ rdma res show srq -j -r
[{"ifindex":0,"ifname":"hns3","data":[149,0,0,...]}]

Signed-off-by: wenglianfa <wenglianfa@huawei.com>
Link: https://lore.kernel.org/r/20230918131110.3987498-3-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-09-20 10:50:54 +03:00
wenglianfa 0e32d7d43b RDMA/core: Add dedicated SRQ resource tracker function
Add a dedicated callback function for SRQ resource tracker.

Signed-off-by: wenglianfa <wenglianfa@huawei.com>
Link: https://lore.kernel.org/r/20230918131110.3987498-2-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-09-20 10:50:54 +03:00
Yue Haibing 40cc695d63 RDMA Remove unused function declarations
Commit c2261dd76b ("RDMA/device: Add ib_device_set_netdev() as an alternative to get_netdev")
declared but never implemented ib_device_netdev(), remove it.
Commit 922a8e9fb2 ("RDMA: iWARP Connection Manager.") declared but never implemented
iw_cm_unbind_qp() and iw_cm_get_qp().

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20230809142718.42316-1-yuehaibing@huawei.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-08-13 10:32:35 +03:00
Jason Gunthorpe 8d7c7c0eeb RDMA: Add ib_virt_dma_to_page()
Make it clearer what is going on by adding a function to go back from the
"virtual" dma_addr to a kva and another to a struct page. This is used in the
ib_uses_virt_dma() style drivers (siw, rxe, hfi, qib).

Call them instead of a naked casting and  virt_to_page() when working with dma_addr
values encoded by the various ib_map functions.

This also fixes the virt_to_page() casting problem Linus Walleij has been
chasing.

Cc: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/0-v2-05ea785520ed+10-ib_virt_page_jgg@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-04-16 11:08:07 +03:00
Mark Zhang 312b8f79eb RDMA/mlx: Calling qp event handler in workqueue context
Move the call of qp event handler from atomic to workqueue context,
so that the handler is able to block. This is needed by following
patches.

Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Patrisious Haddad <phaddad@nvidia.com>
Link: https://lore.kernel.org/r/0cd17b8331e445f03942f4bb28d447f24ac5669d.1672821186.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-01-15 12:23:10 +02:00
Li Zhijian 208e3a134b RDMA: Extend RDMA kernel verbs ABI to support flush
This commit extends the RDMA kernel verbs ABI to support the flush
operation defined in IBA A19.4.1. These changes are
backward compatible with the existing RDMA kernel verbs ABI.

It makes device/HCA support new FLUSH attributes/capabilities, and it
also makes memory region support new FLUSH access flags.

Users can use ibv_reg_mr(3) to register flush access flags. Only the
access flags also supported by device's capabilities can be registered
successfully.

Once registered successfully, it means the MR is flushable. Similarly,
A flushable MR should also have one or both of GLOBAL_VISIBILITY and
PERSISTENT attributes/capabilities like device/HCA.

Link: https://lore.kernel.org/r/20221206130201.30986-3-lizhijian@fujitsu.com
Reviewed-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-09 19:36:01 -04:00
Xiao Yang 3ff81e827b RDMA: Extend RDMA kernel ABI to support atomic write
1) Define new atomic write request/completion in kernel.
2) Define new atomic write capability in kernel.
3) Define new atomic write opcode for RC service in packet.

Link: https://lore.kernel.org/r/1669905432-14-3-git-send-email-yangx.jy@fujitsu.com
Signed-off-by: Xiao Yang <yangx.jy@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-01 19:51:09 -04:00
Jason Gunthorpe 09f530f0c6 RDMA: Add netdevice_tracker to ib_device_set_netdev()
This will cause an informative backtrace to print if the user of
ib_device_set_netdev() isn't careful about tearing down the ibdevice
before its the netdevice parent is destroyed. Such as like this:

  unregister_netdevice: waiting for vlan0 to become free. Usage count = 2
  leaked reference.
   ib_device_set_netdev+0x266/0x730
   siw_newlink+0x4e0/0xfd0
   nldev_newlink+0x35c/0x5c0
   rdma_nl_rcv_msg+0x36d/0x690
   rdma_nl_rcv+0x2ee/0x430
   netlink_unicast+0x543/0x7f0
   netlink_sendmsg+0x918/0xe20
   sock_sendmsg+0xcf/0x120
   ____sys_sendmsg+0x70d/0x8b0
   ___sys_sendmsg+0x11d/0x1b0
   __sys_sendmsg+0xfa/0x1d0
   do_syscall_64+0x35/0xb0
   entry_SYSCALL_64_after_hwframe+0x63/0xcd

This will help debug the issues syzkaller is seeing.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/0-v1-a7c81b3842ce+e5-netdev_tracker_jgg@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2022-11-28 11:58:19 +02:00
Li Zhijian 53c2d5b14a RDMA/core: return -EOPNOSUPP for ODP unsupported device
ib_reg_mr(3) which is used to register a MR with specific access flags
for specific HCA will set errno when something go wrong.
So, here we should return the specific -EOPNOTSUPP when the being
requested ODP access flag is unsupported by the HCA(such as RXE).

Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Link: https://lore.kernel.org/r/20221001020045.8324-1-lizhijian@fujitsu.com
Reviewed-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2022-10-19 10:02:18 +03:00
Linus Torvalds c993e07be0 dma-mapping updates
- convert arm32 to the common dma-direct code (Arnd Bergmann, Robin Murphy,
    Christoph Hellwig)
  - restructure the PCIe peer to peer mapping support (Logan Gunthorpe)
  - allow the IOMMU code to communicate an optional DMA mapping length
    and use that in scsi and libata (John Garry)
  - split the global swiotlb lock (Tianyu Lan)
  - various fixes and cleanup (Chao Gao, Dan Carpenter, Dongli Zhang,
    Lukas Bulwahn, Robin Murphy)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmLuIYULHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYPS5A//Ty1ZNyXExmwZ6J6g7/oIvQlpAHilDr22mCd8tR8Y
 Ne7TgLa/X+usFvJTxJfkvg/LNMDjD7qx0J/mhDGm4reOFcEL4/PBy0rDSOgnmntV
 k/fPhgwnpuztiAQ+s+WkJ3pkrmG1HaEId7GGj2JaoYdas6RX2mGX7vL8uvUFepjw
 lYPAqWMtJHkOfsDK0PqqyQsr7dcC6lyFLqnn/wqvHtTJeKCfGs6W/SIrlWme2SZY
 3dNx84ZR1uPjaazAmtf2IWfjh/TBmd0ETRYycgUUKRP9iwsCkBQDBwsBGSIYXiWj
 BUKQ5oMvjAlUGRF0jYz9e77KuedE6GxWiXNQstitBmid142M37DHA5tvZRf65MPS
 THHcjTDmmoaO4YfFhhXOcFOrjG4/V8bF7fgHB6XkHDjhVVTcnIx8zuOAXIVBZvIV
 VAALmamBqEfIZZrCqgr7hzFssK2bip+TIMkdoD46Wcr+D7bAlujhuzWxubn9+ulT
 23v/pAvC80ut6LvKj6EA+GpRm/pejfOtEbjXPoO2hguNxvuUKvPQqNh9hy0q+v1e
 8n2Y/4lhy5bv02S7wKooNkfCoV753jBY1TIru45UmEYc3EkTQPii6okYe0DvW4QX
 VCnKgo156wSBfE+9eWdxCROv2SZqJFMV/wL3vw54dpJQMbDy7VkNsh4mGREdUkU1
 uek=
 =Bv19
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.20-2022-08-06' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping updates from Christoph Hellwig:

 - convert arm32 to the common dma-direct code (Arnd Bergmann, Robin
   Murphy, Christoph Hellwig)

 - restructure the PCIe peer to peer mapping support (Logan Gunthorpe)

 - allow the IOMMU code to communicate an optional DMA mapping length
   and use that in scsi and libata (John Garry)

 - split the global swiotlb lock (Tianyu Lan)

 - various fixes and cleanup (Chao Gao, Dan Carpenter, Dongli Zhang,
   Lukas Bulwahn, Robin Murphy)

* tag 'dma-mapping-5.20-2022-08-06' of git://git.infradead.org/users/hch/dma-mapping: (45 commits)
  swiotlb: fix passing local variable to debugfs_create_ulong()
  dma-mapping: reformat comment to suppress htmldoc warning
  PCI/P2PDMA: Remove pci_p2pdma_[un]map_sg()
  RDMA/rw: drop pci_p2pdma_[un]map_sg()
  RDMA/core: introduce ib_dma_pci_p2p_dma_supported()
  nvme-pci: convert to using dma_map_sgtable()
  nvme-pci: check DMA ops when indicating support for PCI P2PDMA
  iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg
  iommu: Explicitly skip bus address marked segments in __iommu_map_sg()
  dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support
  dma-direct: support PCI P2PDMA pages in dma-direct map_sg
  dma-mapping: allow EREMOTEIO return code for P2PDMA transfers
  PCI/P2PDMA: Introduce helpers for dma_map_sg implementations
  PCI/P2PDMA: Attempt to set map_type if it has not been set
  lib/scatterlist: add flag for indicating P2PDMA segments in an SGL
  swiotlb: clean up some coding style and minor issues
  dma-mapping: update comment after dmabounce removal
  scsi: sd: Add a comment about limiting max_sectors to shost optimal limit
  ata: libata-scsi: cap ata_device->max_sectors according to shost->max_sectors
  scsi: scsi_transport_sas: cap shost opt_sectors according to DMA optimal limit
  ...
2022-08-06 10:56:45 -07:00
Logan Gunthorpe 495758bb1a RDMA/core: introduce ib_dma_pci_p2p_dma_supported()
Introduce the helper function ib_dma_pci_p2p_dma_supported() to check
if a given ib_device can be used in P2PDMA transfers. This ensures
the ib_device is not using virt_dma and also that the underlying
dma_device supports P2PDMA.

Use the new helper in nvme-rdma to replace the existing check for
ib_uses_virt_dma(). Adding the dma_pci_p2pdma_supported() check allows
switching away from pci_p2pdma_[un]map_sg().

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-07-26 07:28:07 -04:00
Xin Gao 8937e28eac RDMA: Fix comment typo
The double `get' is duplicated, remove one.

Link: https://lore.kernel.org/r/20220722021833.15669-1-gaoxin@cdjrlc.com
Signed-off-by: Xin Gao <gaoxin@cdjrlc.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-07-22 12:07:16 -03:00
Julia Lawall 83567cee04 RDMA/core: Fix typo in comment
Spelling mistake (triple letters) in comment.
Detected with the help of Coccinelle.

Link: https://lore.kernel.org/r/20220521111145.81697-86-Julia.Lawall@inria.fr
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-05-24 11:24:58 -03:00
Jason Gunthorpe 7bf5323b05 Merge branch 'mlx5-next' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Leon Romanovsky says:

====================
Mellanox shared branch that includes:

 * Removal of FPGA TLS code https://lore.kernel.org/all/cover.1649073691.git.leonro@nvidia.com

  Mellanox INNOVA TLS cards are EOL in May, 2018 [1]. As such, the code
  is unmaintained, untested and not in-use by any upstream/distro oriented
  customers. In order to reduce code complexity, drop the kernel code,
  clean build config options and delete useless kTLS vs. TLS separation.

  [1] https://network.nvidia.com/related-docs/eol/LCR-000286.pdf

 * Removal of FPGA IPsec code https://lore.kernel.org/all/cover.1649232994.git.leonro@nvidia.com

  Together with FPGA TLS, the IPsec went to EOL state in the November of
  2019 [1]. Exactly like FPGA TLS, no active customers exist for this
  upstream code and all the complexity around that area can be deleted.

  [2] https://network.nvidia.com/related-docs/eol/LCR-000535.pdf

 * Fix to undefined behavior from Borislav https://lore.kernel.org/all/20220405151517.29753-11-bp@alien8.de
====================

* 'mlx5-next' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
  net/mlx5: Remove not-implemented IPsec capabilities
  net/mlx5: Remove ipsec_ops function table
  net/mlx5: Reduce kconfig complexity while building crypto support
  net/mlx5: Move IPsec file to relevant directory
  net/mlx5: Remove not-needed IPsec config
  net/mlx5: Align flow steering allocation namespace to common style
  net/mlx5: Unify device IPsec capabilities check
  net/mlx5: Remove useless IPsec device checks
  net/mlx5: Remove ipsec vs. ipsec offload file separation
  RDMA/core: Delete IPsec flow action logic from the core
  RDMA/mlx5: Drop crypto flow steering API
  RDMA/mlx5: Delete never supported IPsec flow action
  net/mlx5: Remove FPGA ipsec specific statistics
  net/mlx5: Remove XFRM no_trailer flag
  net/mlx5: Remove not-used IDA field from IPsec struct
  net/mlx5: Delete metadata handling logic
  net/mlx5_fpga: Drop INNOVA IPsec support
  IB/mlx5: Fix undefined behavior due to shift overflowing the constant
  net/mlx5: Cleanup kTLS function names and their exposure
  net/mlx5: Remove tls vs. ktls separation as it is the same
  net/mlx5: Remove indirection in TLS build
  net/mlx5: Reliably return TLS device capabilities
  net/mlx5_fpga: Drop INNOVA TLS support

Link: https://lore.kernel.org/r/20220409055303.1223644-1-leon@kernel.org
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
2022-04-12 10:43:36 -03:00
Leon Romanovsky 32313c6ae6 RDMA/core: Delete IPsec flow action logic from the core
The removal of mlx5 flow steering logic, left the kernel without any RDMA
drivers that implements flow action callbacks supplied by RDMA/core. Any
user access to them caused to EOPNOTSUPP error, which can be achieved by
simply removing ioctl implementation.

Link: https://lore.kernel.org/r/a638e376314a2eb1c66f597c0bbeeab2e5de7faf.1649232994.git.leonro@nvidia.com
Reviewed-by: Raed Salem <raeds@nvidia.com>
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2022-04-09 08:25:06 +03:00
Jason Gunthorpe e945c653c8 RDMA: Split kernel-only global device caps from uverbs device caps
Split out flags from ib_device::device_cap_flags that are only used
internally to the kernel into kernel_cap_flags that is not part of the
uapi. This limits the device_cap_flags to being the same bitmap that will
be copied to userspace.

This cleanly splits out the uverbs flags from the kernel flags to avoid
confusion in the flags bitmap.

Add some short comments describing which each of the kernel flags is
connected to. Remove unused kernel flags.

Link: https://lore.kernel.org/r/0-v2-22c19e565eef+139a-kern_caps_jgg@nvidia.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-04-06 15:02:13 -03:00
Xiao Yang f543a3e82b IB/uverbs: Move part of enum ib_device_cap_flags to uapi
1) Part of enum ib_device_cap_flags are used by ibv_query_device(3)
   or ibv_query_device_ex(3), so we define them in
   include/uapi/rdma/ib_user_verbs.h and only expose them to userspace.

2) Reformat enum ib_device_cap_flags by removing the indent before '='.

Link: https://lore.kernel.org/r/20220331032419.313904-2-yangx.jy@fujitsu.com
Signed-off-by: Xiao Yang <yangx.jy@fujitsu.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-04-04 10:58:37 -03:00
Xiao Yang 30ad63e784 IB/uverbs: Move enum ib_raw_packet_caps to uapi
This enum is used by ibv_query_device_ex(3) so it should be defined
in include/uapi/rdma/ib_user_verbs.h.

Link: https://lore.kernel.org/r/20220331032419.313904-1-yangx.jy@fujitsu.com
Signed-off-by: Xiao Yang <yangx.jy@fujitsu.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-04-04 10:58:36 -03:00
Zhu Yanjun 18451db82e RDMA/core: Calculate UDP source port based on flow label or lqpn/rqpn
Calculate and set UDP source port based on the flow label. If flow label
is not defined in GRH then calculate it based on lqpn/rqpn.

Link: https://lore.kernel.org/r/20220106180359.2915060-2-yanjun.zhu@linux.dev
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-01-07 19:34:01 -04:00
Chengchang Tang 6d202d9f70 RDMA/hns: Use the core code to manage the fixed mmap entries
Add a new implementation for mmap by using the new mmap entry API. This
makes way for further use of the dynamic mmap allocator in this driver.

Link: https://lore.kernel.org/r/20211028105640.1056-1-liangwenpeng@huawei.com
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Yixing Liu <liuyixing1@huawei.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-10-29 14:07:31 -03:00
Logan Gunthorpe ac0fffa085 RDMA/core: Set sgtable nents when using ib_dma_virt_map_sg()
ib_dma_map_sgtable_attrs() should be mapping the sgls and setting nents
but the ib_uses_virt_dma() path falls back to ib_dma_virt_map_sg() which
will not set the nents in the sgtable.

Check the return value (per the map_sg calling convention) and set
sgt->nents appropriately on success.

Fixes: 79fbd3e124 ("RDMA: Use the sg_table directly and remove the opencoded version from umem")
Link: https://lore.kernel.org/r/20211013165942.89806-1-logang@deltatee.com
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Tested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-10-13 15:26:41 -03:00
Aharon Landau a29b934ceb RDMA/mlx5: Add modify_op_stat() support
Add support for ib callback modify_op_stat() to add or remove an optional
counter. When adding, a steering flow table is created with a rule that
catches and counts all the matching packets. When removing, the table and
flow counter are destroyed.

Link: https://lore.kernel.org/r/20211008122439.166063-13-markzhang@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-10-12 12:48:06 -03:00
Aharon Landau 5e2ddd1e59 RDMA/counter: Add optional counter support
An optional counter is a driver-specific counter that may be dynamically
enabled/disabled.  This enhancement allows drivers to expose counters
which are, for example, mutually exclusive and cannot be enabled at the
same time, counters that might degrades performance, optional debug
counters, etc.

Optional counters are marked with IB_STAT_FLAG_OPTIONAL flag. They are not
exported in sysfs, and must be at the end of all stats, otherwise the
attr->show() in sysfs would get wrong indexes for hwcounters that are
behind optional counters.

Link: https://lore.kernel.org/r/20211008122439.166063-7-markzhang@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Neta Ostrovsky <netao@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-10-12 12:48:05 -03:00
Aharon Landau 0dc8968460 RDMA/counter: Add an is_disabled field in struct rdma_hw_stats
Add a bitmap in rdma_hw_stat structure, with each bit indicates whether
the corresponding counter is currently disabled or not. By default
hwcounters are enabled.

Link: https://lore.kernel.org/r/20211008122439.166063-6-markzhang@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-10-12 12:48:05 -03:00
Mark Zhang 0a0800ce2a RDMA/core: Add a helper API rdma_free_hw_stats_struct
Add a new API rdma_free_hw_stats_struct to pair with
rdma_alloc_hw_stats_struct (which is also de-inlined).

This will be useful when there are more alloc/free works in following
patches.

Link: https://lore.kernel.org/r/20211008122439.166063-5-markzhang@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-10-12 12:48:04 -03:00
Aharon Landau 13f30b0fa0 RDMA/counter: Add a descriptor in struct rdma_hw_stats
Add a counter statistic descriptor structure in rdma_hw_stats. In addition
to the counter name, more meta-information will be added.  This code
extension is needed for optional-counter support in the following patches.

Link: https://lore.kernel.org/r/20211008122439.166063-4-markzhang@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-10-12 12:48:04 -03:00
Jason Gunthorpe 6a217437f9 Merge branch 'sg_nents' into rdma.git for-next
From Maor Gottlieb
====================

Fix the use of nents and orig_nents in the sg table append helpers. The
nents should be used by the DMA layer to store the number of DMA mapped
sges, the orig_nents is the number of CPU sges.

Since the sg append logic doesn't always create a SGL with exactly
orig_nents entries store a total_nents as well to allow the table to be
properly free'd and reorganize the freeing logic to share across all the
use cases.

====================

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

* 'sg_nents':
  RDMA: Use the sg_table directly and remove the opencoded version from umem
  lib/scatterlist: Fix wrong update of orig_nents
  lib/scatterlist: Provide a dedicated function to support table append
2021-08-30 09:49:59 -03:00
Maor Gottlieb 79fbd3e124 RDMA: Use the sg_table directly and remove the opencoded version from umem
This allows using the normal sg_table APIs and makes all the code
cleaner. Remove sgt, nents and nmapd from ib_umem.

Link: https://lore.kernel.org/r/20210824142531.3877007-4-maorg@nvidia.com
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-08-24 19:52:40 -03:00
Leon Romanovsky 8da9fe4e4f RDMA/core: Reorganize create QP low-level functions
The low-level create QP function grew to be larger than any sensible
inline function should be. The inline attribute is not really needed for
that function and can be implemented as exported symbol.

Link: https://lore.kernel.org/r/2c08709d86f876c3dfb77684357b2a939e570ca4.1628014762.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-08-03 15:26:18 -03:00
Leon Romanovsky 514aee660d RDMA: Globally allocate and release QP memory
Convert QP object to follow IB/core general allocation scheme.  That
change allows us to make sure that restrack properly kref the memory.

Link: https://lore.kernel.org/r/48e767124758aeecc433360ddd85eaa6325b34d9.1627040189.git.leonro@nvidia.com
Reviewed-by: Gal Pressman <galpress@amazon.com> #efa
Tested-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> #rdma and core
Tested-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Tested-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-08-03 13:44:27 -03:00
Anand Khoje 84dcd8c7ea IB/core: Shuffle locks in ib_port_data to save memory
pahole shows two 4-byte holes in struct ib_port_data after pkey_list_lock
and netdev_lock respectively.

Shuffling the netdev_lock to be after pkey_list_lock, this shaves off
eight bytes from the struct.

Link: https://lore.kernel.org/r/20210616154509.1047-3-anand.a.khoje@oracle.com
Suggested-by: Haakon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Anand Khoje <anand.a.khoje@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-21 20:49:32 -03:00
Avihai Horon 1477d44ce4 RDMA/mlx5: Enable Relaxed Ordering by default for kernel ULPs
Relaxed Ordering is a capability that can only benefit users that support
it. All kernel ULPs should support Relaxed Ordering, as they are designed
to read data only after observing the CQE and use the DMA API correctly.

Hence, implicitly enable Relaxed Ordering by default for MR transfers in
kernel ULPs.

Link: https://lore.kernel.org/r/b7e820aab7402b8efa63605f4ea465831b3b1e5e.1623236426.git.leonro@nvidia.com
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-21 12:33:08 -03:00
Jason Gunthorpe 915e4af59f RDMA: Remove rdma_set_device_sysfs_group()
The driver's device group can be specified as part of the ops structure
like the device's port group. No need for the complicated API.

Link: https://lore.kernel.org/r/8964785a34fd3a29ff5b6693493f575b717e594d.1623427137.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16 20:58:32 -03:00
Jason Gunthorpe d7407d1669 RDMA: Change ops->init_port to ops->port_groups
init_port was only being used to register sysfs attributes against the
port kobject. Now that all users are creating static attribute_group's we
can simply set the attribute_group list in the ops and the core code can
just handle it directly.

This makes all the sysfs management quite straightforward and prevents any
driver from abusing the naked port kobject in future because no driver
code can access it.

Link: https://lore.kernel.org/r/114f68f3d921460eafe14cea5a80ca65d81729c3.1623427137.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16 20:58:31 -03:00
Jason Gunthorpe b7066b32a1 RDMA/core: Create the device hw_counters through the normal groups mechanism
Instead of calling device_add_groups() add the group to the existing
groups array which is managed through device_add().

This requires setting up the hw_counters before device_add(), so it gets
split up from the already split port sysfs flow.

Move all the memory freeing to the release function.

Link: https://lore.kernel.org/r/666250d937b64f6fdf45da9e2dc0b6e5e4f7abd8.1623427137.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16 20:58:30 -03:00
Jason Gunthorpe 467f432a52 RDMA/core: Split port and device counter sysfs attributes
This code creates a 'struct hw_stats_attribute' for each sysfs entry that
contains a naked 'struct attribute' inside.

It then proceeds to attach this same structure to a 'struct device' kobj
and a 'struct ib_port' kobj. However, this violates the typing
requirements.  'struct device' requires the attribute to be a 'struct
device_attribute' and 'struct ib_port' requires the attribute to be
'struct port_attribute'.

This happens to work because the show/store function pointers in all three
structures happen to be at the same offset and happen to be nearly the
same signature. This means when container_of() was used to go between the
wrong two types it still managed to work.

However clang CFI detection notices that the function pointers have a
slightly different signature. As with show/store this was only working
because the device and port struct layouts happened to have the kobj at
the front.

Correct this by have two independent sets of data structures for the port
and device case. The two different attributes correctly include the
port/device_attribute struct and everything from there up is kept
split. The show/store function call chains start with device/port unique
functions that invoke a common show/store function pointer.

Link: https://lore.kernel.org/r/a8b3864b4e722aed3657512af6aa47dc3c5033be.1623427137.git.leonro@nvidia.com
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16 20:58:29 -03:00
Jason Gunthorpe d8a5883814 RDMA/core: Replace the ib_port_data hw_stats pointers with a ib_port pointer
It is much saner to store a pointer to the kobject structure that contains
the cannonical stats pointer than to copy the stats pointers into a public
structure.

Future patches will require the sysfs pointer for other purposes.

Link: https://lore.kernel.org/r/f90551dfd296cde1cb507bbef27cca9891d19871.1623427137.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16 20:58:29 -03:00
Jason Gunthorpe 4b5f4d3fb4 RDMA: Split the alloc_hw_stats() ops to port and device variants
This is being used to implement both the port and device global stats,
which is causing some confusion in the drivers. For instance EFA and i40iw
both seem to be misusing the device stats.

Split it into two ops so drivers that don't support one or the other can
leave the op NULL'd, making the calling code a little simpler to
understand.

Link: https://lore.kernel.org/r/1955c154197b2a159adc2dc97266ddc74afe420c.1623427137.git.leonro@nvidia.com
Tested-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16 20:58:29 -03:00
Wan Jiabing 7c6c2f5337 RDMA: Remove unnecessary struct declaration
The declaration of struct ib_grh is uncessary here, because it is defined
at line 766.

Link: https://lore.kernel.org/r/20210510062843.15707-1-wanjiabing@vivo.com
Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-05-11 14:44:17 -03:00
Leon Romanovsky 16149eddd3 RDMA/core: Remove never used ib_modify_wq function call
The function ib_modify_wq() is not used, so remove it.

Link: https://lore.kernel.org/r/c5e48d517b9163fe4f9ffd224050b83fdb3571c6.1620552935.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-05-11 14:43:58 -03:00
Neta Ostrovsky 48f8a70e89 RDMA/restrack: Add support to get resource tracking for SRQ
In order to track SRQ resources, a new restrack object is initialized and
added to the resource tracking database.

Link: https://lore.kernel.org/r/0db71c409f24f2f6b019bf8797a8fed96fe7079c.1618753110.git.leonro@nvidia.com
Signed-off-by: Neta Ostrovsky <netao@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-04-22 10:30:27 -03:00
Mike Marciniszyn 042a00f93a IB/{ipoib,hfi1}: Add a timeout handler for rdma_netdev
The current rdma_netdev handling in ipoib hooks the tx_timeout handler,
but prints out a totally useless message that prevents effective debugging
especially when multiple transmit queues are being used.

Add a tx_timeout rdma_netdev hook and implement the callback in the hfi1
to print additional information.

The existing non-helpful message is avoided when the driver has presented
a callback.

Link: https://lore.kernel.org/r/1617026056-50483-3-git-send-email-dennis.dalessandro@cornelisnetworks.com
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-04-07 20:19:00 -03:00
Mark Bloch 1fb7f8973f RDMA: Support more than 255 rdma ports
Current code uses many different types when dealing with a port of a RDMA
device: u8, unsigned int and u32. Switch to u32 to clean up the logic.

This allows us to make (at least) the core view consistent and use the
same type. Unfortunately not all places can be converted. Many uverbs
functions expect port to be u8 so keep those places in order not to break
UAPIs.  HW/Spec defined values must also not be changed.

With the switch to u32 we now can support devices with more than 255
ports. U32_MAX is reserved to make control logic a bit easier to deal
with. As a device with U32_MAX ports probably isn't going to happen any
time soon this seems like a non issue.

When a device with more than 255 ports is created uverbs will report the
RDMA device as having 255 ports as this is the max currently supported.

The verbs interface is not changed yet because the IBTA spec limits the
port size in too many places to be u8 and all applications that relies in
verbs won't be able to cope with this change. At this stage, we are
extending the interfaces that are using vendor channel solely

Once the limitation is lifted mlx5 in switchdev mode will be able to have
thousands of SFs created by the device. As the only instance of an RDMA
device that reports more than 255 ports will be a representor device and
it exposes itself as a RAW Ethernet only device CM/MAD/IPoIB and other
ULPs aren't effected by this change and their sysfs/interfaces that are
exposes to userspace can remain unchanged.

While here cleanup some alignment issues and remove unneeded sanity
checks (mainly in rdmavt),

Link: https://lore.kernel.org/r/20210301070420.439400-1-leon@kernel.org
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-03-26 09:31:21 -03:00
Gal Pressman f675ba125b RDMA/core: Remove unused req_ncomp_notif device operation
The request_ncomp_notif device operation and function are unused, remove
them.

Link: https://lore.kernel.org/r/20210311150921.23726-1-galpress@amazon.com
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-03-11 13:46:40 -04:00
Parav Pandit 7416790e22 RDMA/core: Introduce and use API to read port immutable data
Currently mlx5 driver caches port GID table length for 2 ports.  It is
also cached by IB core as port immutable data.

When mlx5 representor ports are present, which are usually more than 2,
invalid access to port_caps array can happen while validating the GID
table length which is only for 2 ports.

To avoid this, take help of the IB cores port immutable data by exposing
an API to read the port immutable fields.

Remove mlx5 driver's internal cache, thereby reduce code and data.

Link: https://lore.kernel.org/r/20210203130133.4057329-5-leon@kernel.org
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-02-05 12:06:01 -04:00