Merge branch 'pci/peer-to-peer'

- Add PCI support for peer-to-peer DMA (Logan Gunthorpe)

  - Add sysfs group for PCI peer-to-peer memory statistics (Logan
    Gunthorpe)

  - Add PCI peer-to-peer DMA scatterlist mapping interface (Logan
    Gunthorpe)

  - Add PCI configfs/sysfs helpers for use by peer-to-peer users (Logan
    Gunthorpe)

  - Add PCI peer-to-peer DMA driver writer's documentation (Logan
    Gunthorpe)

  - Add block layer flag to indicate driver support for PCI peer-to-peer
    DMA (Logan Gunthorpe)

  - Map Infiniband scatterlists for peer-to-peer DMA if they contain P2P
    memory (Logan Gunthorpe)

  - Register nvme-pci CMB buffer as PCI peer-to-peer memory (Logan
    Gunthorpe)

  - Add nvme-pci support for PCI peer-to-peer memory in requests (Logan
    Gunthorpe)

  - Use PCI peer-to-peer memory in nvme (Stephen Bates, Steve Wise,
    Christoph Hellwig, Logan Gunthorpe)

* pci/peer-to-peer:
  nvmet: Optionally use PCI P2P memory
  nvmet: Introduce helper functions to allocate and free request SGLs
  nvme-pci: Add support for P2P memory in requests
  nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  IB/core: Ensure we map P2P memory correctly in rdma_rw_ctx_[init|destroy]()
  block: Add PCI P2P flag for request queue
  PCI/P2PDMA: Add P2P DMA driver writer's documentation
  docs-rst: Add a new directory for PCI documentation
  PCI/P2PDMA: Introduce configfs/sysfs enable attribute helpers
  PCI/P2PDMA: Add PCI p2pmem DMA mappings to adjust the bus offset
  PCI/P2PDMA: Add sysfs group to display p2pmem stats
  PCI/P2PDMA: Support peer-to-peer memory
This commit is contained in:
Bjorn Helgaas 2018-10-20 11:45:33 -05:00
commit 1734715493
22 changed files with 1493 additions and 50 deletions

View file

@ -323,3 +323,27 @@ Description:
This is similar to /sys/bus/pci/drivers_autoprobe, but
affects only the VFs associated with a specific PF.
What: /sys/bus/pci/devices/.../p2pmem/size
Date: November 2017
Contact: Logan Gunthorpe <logang@deltatee.com>
Description:
If the device has any Peer-to-Peer memory registered, this
file contains the total amount of memory that the device
provides (in decimal).
What: /sys/bus/pci/devices/.../p2pmem/available
Date: November 2017
Contact: Logan Gunthorpe <logang@deltatee.com>
Description:
If the device has any Peer-to-Peer memory registered, this
file contains the amount of memory that has not been
allocated (in decimal).
What: /sys/bus/pci/devices/.../p2pmem/published
Date: November 2017
Contact: Logan Gunthorpe <logang@deltatee.com>
Description:
If the device has any Peer-to-Peer memory registered, this
file contains a '1' if the memory has been published for
use outside the driver that owns the device.

View file

@ -29,7 +29,7 @@ available subsections can be seen below.
iio/index
input
usb/index
pci
pci/index
spi
i2c
hsi

View file

@ -0,0 +1,22 @@
.. SPDX-License-Identifier: GPL-2.0
============================================
The Linux PCI driver implementer's API guide
============================================
.. class:: toc-title
Table of contents
.. toctree::
:maxdepth: 2
pci
p2pdma
.. only:: subproject and html
Indices
=======
* :ref:`genindex`

View file

@ -0,0 +1,145 @@
.. SPDX-License-Identifier: GPL-2.0
============================
PCI Peer-to-Peer DMA Support
============================
The PCI bus has pretty decent support for performing DMA transfers
between two devices on the bus. This type of transaction is henceforth
called Peer-to-Peer (or P2P). However, there are a number of issues that
make P2P transactions tricky to do in a perfectly safe way.
One of the biggest issues is that PCI doesn't require forwarding
transactions between hierarchy domains, and in PCIe, each Root Port
defines a separate hierarchy domain. To make things worse, there is no
simple way to determine if a given Root Complex supports this or not.
(See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel
only supports doing P2P when the endpoints involved are all behind the
same PCI bridge, as such devices are all in the same PCI hierarchy
domain, and the spec guarantees that all transactions within the
hierarchy will be routable, but it does not require routing
between hierarchies.
The second issue is that to make use of existing interfaces in Linux,
memory that is used for P2P transactions needs to be backed by struct
pages. However, PCI BARs are not typically cache coherent so there are
a few corner case gotchas with these pages so developers need to
be careful about what they do with them.
Driver Writer's Guide
=====================
In a given P2P implementation there may be three or more different
types of kernel drivers in play:
* Provider - A driver which provides or publishes P2P resources like
memory or doorbell registers to other drivers.
* Client - A driver which makes use of a resource by setting up a
DMA transaction to or from it.
* Orchestrator - A driver which orchestrates the flow of data between
clients and providers.
In many cases there could be overlap between these three types (i.e.,
it may be typical for a driver to be both a provider and a client).
For example, in the NVMe Target Copy Offload implementation:
* The NVMe PCI driver is both a client, provider and orchestrator
in that it exposes any CMB (Controller Memory Buffer) as a P2P memory
resource (provider), it accepts P2P memory pages as buffers in requests
to be used directly (client) and it can also make use of the CMB as
submission queue entries (orchastrator).
* The RDMA driver is a client in this arrangement so that an RNIC
can DMA directly to the memory exposed by the NVMe device.
* The NVMe Target driver (nvmet) can orchestrate the data from the RNIC
to the P2P memory (CMB) and then to the NVMe device (and vice versa).
This is currently the only arrangement supported by the kernel but
one could imagine slight tweaks to this that would allow for the same
functionality. For example, if a specific RNIC added a BAR with some
memory behind it, its driver could add support as a P2P provider and
then the NVMe Target could use the RNIC's memory instead of the CMB
in cases where the NVMe cards in use do not have CMB support.
Provider Drivers
----------------
A provider simply needs to register a BAR (or a portion of a BAR)
as a P2P DMA resource using :c:func:`pci_p2pdma_add_resource()`.
This will register struct pages for all the specified memory.
After that it may optionally publish all of its resources as
P2P memory using :c:func:`pci_p2pmem_publish()`. This will allow
any orchestrator drivers to find and use the memory. When marked in
this way, the resource must be regular memory with no side effects.
For the time being this is fairly rudimentary in that all resources
are typically going to be P2P memory. Future work will likely expand
this to include other types of resources like doorbells.
Client Drivers
--------------
A client driver typically only has to conditionally change its DMA map
routine to use the mapping function :c:func:`pci_p2pdma_map_sg()` instead
of the usual :c:func:`dma_map_sg()` function. Memory mapped in this
way does not need to be unmapped.
The client may also, optionally, make use of
:c:func:`is_pci_p2pdma_page()` to determine when to use the P2P mapping
functions and when to use the regular mapping functions. In some
situations, it may be more appropriate to use a flag to indicate a
given request is P2P memory and map appropriately. It is important to
ensure that struct pages that back P2P memory stay out of code that
does not have support for them as other code may treat the pages as
regular memory which may not be appropriate.
Orchestrator Drivers
--------------------
The first task an orchestrator driver must do is compile a list of
all client devices that will be involved in a given transaction. For
example, the NVMe Target driver creates a list including the namespace
block device and the RNIC in use. If the orchestrator has access to
a specific P2P provider to use it may check compatibility using
:c:func:`pci_p2pdma_distance()` otherwise it may find a memory provider
that's compatible with all clients using :c:func:`pci_p2pmem_find()`.
If more than one provider is supported, the one nearest to all the clients will
be chosen first. If more than one provider is an equal distance away, the
one returned will be chosen at random (it is not an arbitrary but
truely random). This function returns the PCI device to use for the provider
with a reference taken and therefore when it's no longer needed it should be
returned with pci_dev_put().
Once a provider is selected, the orchestrator can then use
:c:func:`pci_alloc_p2pmem()` and :c:func:`pci_free_p2pmem()` to
allocate P2P memory from the provider. :c:func:`pci_p2pmem_alloc_sgl()`
and :c:func:`pci_p2pmem_free_sgl()` are convenience functions for
allocating scatter-gather lists with P2P memory.
Struct Page Caveats
-------------------
Driver writers should be very careful about not passing these special
struct pages to code that isn't prepared for it. At this time, the kernel
interfaces do not have any checks for ensuring this. This obviously
precludes passing these pages to userspace.
P2P memory is also technically IO memory but should never have any side
effects behind it. Thus, the order of loads and stores should not be important
and ioreadX(), iowriteX() and friends should not be necessary.
However, as the memory is not cache coherent, if access ever needs to
be protected by a spinlock then :c:func:`mmiowb()` must be used before
unlocking the lock. (See ACQUIRES VS I/O ACCESSES in
Documentation/memory-barriers.txt)
P2P DMA Support Library
=======================
.. kernel-doc:: drivers/pci/p2pdma.c
:export:

View file

@ -12,6 +12,7 @@
*/
#include <linux/moduleparam.h>
#include <linux/slab.h>
#include <linux/pci-p2pdma.h>
#include <rdma/mr_pool.h>
#include <rdma/rw.h>
@ -280,7 +281,11 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
struct ib_device *dev = qp->pd->device;
int ret;
ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
if (is_pci_p2pdma_page(sg_page(sg)))
ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
else
ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
if (!ret)
return -ENOMEM;
sg_cnt = ret;
@ -602,7 +607,9 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
break;
}
ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
/* P2PDMA contexts do not need to be unmapped */
if (!is_pci_p2pdma_page(sg_page(sg)))
ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
}
EXPORT_SYMBOL(rdma_rw_ctx_destroy);

View file

@ -3051,7 +3051,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
ns->queue = blk_mq_init_queue(ctrl->tagset);
if (IS_ERR(ns->queue))
goto out_free_ns;
blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
ns->queue->queuedata = ns;
ns->ctrl = ctrl;

View file

@ -343,6 +343,7 @@ struct nvme_ctrl_ops {
unsigned int flags;
#define NVME_F_FABRICS (1 << 0)
#define NVME_F_METADATA_SUPPORTED (1 << 1)
#define NVME_F_PCI_P2PDMA (1 << 2)
int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);

View file

@ -30,6 +30,7 @@
#include <linux/types.h>
#include <linux/io-64-nonatomic-lo-hi.h>
#include <linux/sed-opal.h>
#include <linux/pci-p2pdma.h>
#include "nvme.h"
@ -99,9 +100,8 @@ struct nvme_dev {
struct work_struct remove_work;
struct mutex shutdown_lock;
bool subsystem;
void __iomem *cmb;
pci_bus_addr_t cmb_bus_addr;
u64 cmb_size;
bool cmb_use_sqes;
u32 cmbsz;
u32 cmbloc;
struct nvme_ctrl ctrl;
@ -158,7 +158,7 @@ struct nvme_queue {
struct nvme_dev *dev;
spinlock_t sq_lock;
struct nvme_command *sq_cmds;
struct nvme_command __iomem *sq_cmds_io;
bool sq_cmds_is_io;
spinlock_t cq_lock ____cacheline_aligned_in_smp;
volatile struct nvme_completion *cqes;
struct blk_mq_tags **tags;
@ -447,11 +447,8 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
static void nvme_submit_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd)
{
spin_lock(&nvmeq->sq_lock);
if (nvmeq->sq_cmds_io)
memcpy_toio(&nvmeq->sq_cmds_io[nvmeq->sq_tail], cmd,
sizeof(*cmd));
else
memcpy(&nvmeq->sq_cmds[nvmeq->sq_tail], cmd, sizeof(*cmd));
memcpy(&nvmeq->sq_cmds[nvmeq->sq_tail], cmd, sizeof(*cmd));
if (++nvmeq->sq_tail == nvmeq->q_depth)
nvmeq->sq_tail = 0;
@ -748,8 +745,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
goto out;
ret = BLK_STS_RESOURCE;
nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir,
DMA_ATTR_NO_WARN);
if (is_pci_p2pdma_page(sg_page(iod->sg)))
nr_mapped = pci_p2pdma_map_sg(dev->dev, iod->sg, iod->nents,
dma_dir);
else
nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
dma_dir, DMA_ATTR_NO_WARN);
if (!nr_mapped)
goto out;
@ -791,7 +793,10 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
DMA_TO_DEVICE : DMA_FROM_DEVICE;
if (iod->nents) {
dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
/* P2PDMA requests do not need to be unmapped */
if (!is_pci_p2pdma_page(sg_page(iod->sg)))
dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
if (blk_integrity_rq(req))
dma_unmap_sg(dev->dev, &iod->meta_sg, 1, dma_dir);
}
@ -1232,9 +1237,18 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
{
dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
if (nvmeq->sq_cmds)
dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
nvmeq->sq_cmds, nvmeq->sq_dma_addr);
if (nvmeq->sq_cmds) {
if (nvmeq->sq_cmds_is_io)
pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
nvmeq->sq_cmds,
SQ_SIZE(nvmeq->q_depth));
else
dma_free_coherent(nvmeq->q_dmadev,
SQ_SIZE(nvmeq->q_depth),
nvmeq->sq_cmds,
nvmeq->sq_dma_addr);
}
}
static void nvme_free_queues(struct nvme_dev *dev, int lowest)
@ -1323,12 +1337,21 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues,
static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
int qid, int depth)
{
/* CMB SQEs will be mapped before creation */
if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS))
return 0;
struct pci_dev *pdev = to_pci_dev(dev->dev);
if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
nvmeq->sq_cmds);
nvmeq->sq_cmds_is_io = true;
}
if (!nvmeq->sq_cmds) {
nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
&nvmeq->sq_dma_addr, GFP_KERNEL);
nvmeq->sq_cmds_is_io = false;
}
nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
&nvmeq->sq_dma_addr, GFP_KERNEL);
if (!nvmeq->sq_cmds)
return -ENOMEM;
return 0;
@ -1405,13 +1428,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
int result;
s16 vector;
if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
dev->ctrl.page_size);
nvmeq->sq_dma_addr = dev->cmb_bus_addr + offset;
nvmeq->sq_cmds_io = dev->cmb + offset;
}
/*
* A queue's vector matches the queue identifier unless the controller
* has only one vector available.
@ -1652,9 +1668,6 @@ static void nvme_map_cmb(struct nvme_dev *dev)
return;
dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
if (!use_cmb_sqes)
return;
size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
bar = NVME_CMB_BIR(dev->cmbloc);
@ -1671,11 +1684,18 @@ static void nvme_map_cmb(struct nvme_dev *dev)
if (size > bar_size - offset)
size = bar_size - offset;
dev->cmb = ioremap_wc(pci_resource_start(pdev, bar) + offset, size);
if (!dev->cmb)
if (pci_p2pdma_add_resource(pdev, bar, size, offset)) {
dev_warn(dev->ctrl.device,
"failed to register the CMB\n");
return;
dev->cmb_bus_addr = pci_bus_address(pdev, bar) + offset;
}
dev->cmb_size = size;
dev->cmb_use_sqes = use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS);
if ((dev->cmbsz & (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS)) ==
(NVME_CMBSZ_WDS | NVME_CMBSZ_RDS))
pci_p2pmem_publish(pdev, true);
if (sysfs_add_file_to_group(&dev->ctrl.device->kobj,
&dev_attr_cmb.attr, NULL))
@ -1685,12 +1705,10 @@ static void nvme_map_cmb(struct nvme_dev *dev)
static inline void nvme_release_cmb(struct nvme_dev *dev)
{
if (dev->cmb) {
iounmap(dev->cmb);
dev->cmb = NULL;
if (dev->cmb_size) {
sysfs_remove_file_from_group(&dev->ctrl.device->kobj,
&dev_attr_cmb.attr, NULL);
dev->cmbsz = 0;
dev->cmb_size = 0;
}
}
@ -1889,13 +1907,13 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
if (nr_io_queues == 0)
return 0;
if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) {
if (dev->cmb_use_sqes) {
result = nvme_cmb_qdepth(dev, nr_io_queues,
sizeof(struct nvme_command));
if (result > 0)
dev->q_depth = result;
else
nvme_release_cmb(dev);
dev->cmb_use_sqes = false;
}
do {
@ -2390,7 +2408,8 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.name = "pcie",
.module = THIS_MODULE,
.flags = NVME_F_METADATA_SUPPORTED,
.flags = NVME_F_METADATA_SUPPORTED |
NVME_F_PCI_P2PDMA,
.reg_read32 = nvme_pci_reg_read32,
.reg_write32 = nvme_pci_reg_write32,
.reg_read64 = nvme_pci_reg_read64,

View file

@ -17,6 +17,8 @@
#include <linux/slab.h>
#include <linux/stat.h>
#include <linux/ctype.h>
#include <linux/pci.h>
#include <linux/pci-p2pdma.h>
#include "nvmet.h"
@ -340,6 +342,48 @@ static ssize_t nvmet_ns_device_path_store(struct config_item *item,
CONFIGFS_ATTR(nvmet_ns_, device_path);
#ifdef CONFIG_PCI_P2PDMA
static ssize_t nvmet_ns_p2pmem_show(struct config_item *item, char *page)
{
struct nvmet_ns *ns = to_nvmet_ns(item);
return pci_p2pdma_enable_show(page, ns->p2p_dev, ns->use_p2pmem);
}
static ssize_t nvmet_ns_p2pmem_store(struct config_item *item,
const char *page, size_t count)
{
struct nvmet_ns *ns = to_nvmet_ns(item);
struct pci_dev *p2p_dev = NULL;
bool use_p2pmem;
int ret = count;
int error;
mutex_lock(&ns->subsys->lock);
if (ns->enabled) {
ret = -EBUSY;
goto out_unlock;
}
error = pci_p2pdma_enable_store(page, &p2p_dev, &use_p2pmem);
if (error) {
ret = error;
goto out_unlock;
}
ns->use_p2pmem = use_p2pmem;
pci_dev_put(ns->p2p_dev);
ns->p2p_dev = p2p_dev;
out_unlock:
mutex_unlock(&ns->subsys->lock);
return ret;
}
CONFIGFS_ATTR(nvmet_ns_, p2pmem);
#endif /* CONFIG_PCI_P2PDMA */
static ssize_t nvmet_ns_device_uuid_show(struct config_item *item, char *page)
{
return sprintf(page, "%pUb\n", &to_nvmet_ns(item)->uuid);
@ -509,6 +553,9 @@ static struct configfs_attribute *nvmet_ns_attrs[] = {
&nvmet_ns_attr_ana_grpid,
&nvmet_ns_attr_enable,
&nvmet_ns_attr_buffered_io,
#ifdef CONFIG_PCI_P2PDMA
&nvmet_ns_attr_p2pmem,
#endif
NULL,
};

View file

@ -15,6 +15,7 @@
#include <linux/module.h>
#include <linux/random.h>
#include <linux/rculist.h>
#include <linux/pci-p2pdma.h>
#include "nvmet.h"
@ -365,9 +366,93 @@ static void nvmet_ns_dev_disable(struct nvmet_ns *ns)
nvmet_file_ns_disable(ns);
}
static int nvmet_p2pmem_ns_enable(struct nvmet_ns *ns)
{
int ret;
struct pci_dev *p2p_dev;
if (!ns->use_p2pmem)
return 0;
if (!ns->bdev) {
pr_err("peer-to-peer DMA is not supported by non-block device namespaces\n");
return -EINVAL;
}
if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
pr_err("peer-to-peer DMA is not supported by the driver of %s\n",
ns->device_path);
return -EINVAL;
}
if (ns->p2p_dev) {
ret = pci_p2pdma_distance(ns->p2p_dev, nvmet_ns_dev(ns), true);
if (ret < 0)
return -EINVAL;
} else {
/*
* Right now we just check that there is p2pmem available so
* we can report an error to the user right away if there
* is not. We'll find the actual device to use once we
* setup the controller when the port's device is available.
*/
p2p_dev = pci_p2pmem_find(nvmet_ns_dev(ns));
if (!p2p_dev) {
pr_err("no peer-to-peer memory is available for %s\n",
ns->device_path);
return -EINVAL;
}
pci_dev_put(p2p_dev);
}
return 0;
}
/*
* Note: ctrl->subsys->lock should be held when calling this function
*/
static void nvmet_p2pmem_ns_add_p2p(struct nvmet_ctrl *ctrl,
struct nvmet_ns *ns)
{
struct device *clients[2];
struct pci_dev *p2p_dev;
int ret;
if (!ctrl->p2p_client)
return;
if (ns->p2p_dev) {
ret = pci_p2pdma_distance(ns->p2p_dev, ctrl->p2p_client, true);
if (ret < 0)
return;
p2p_dev = pci_dev_get(ns->p2p_dev);
} else {
clients[0] = ctrl->p2p_client;
clients[1] = nvmet_ns_dev(ns);
p2p_dev = pci_p2pmem_find_many(clients, ARRAY_SIZE(clients));
if (!p2p_dev) {
pr_err("no peer-to-peer memory is available that's supported by %s and %s\n",
dev_name(ctrl->p2p_client), ns->device_path);
return;
}
}
ret = radix_tree_insert(&ctrl->p2p_ns_map, ns->nsid, p2p_dev);
if (ret < 0)
pci_dev_put(p2p_dev);
pr_info("using p2pmem on %s for nsid %d\n", pci_name(p2p_dev),
ns->nsid);
}
int nvmet_ns_enable(struct nvmet_ns *ns)
{
struct nvmet_subsys *subsys = ns->subsys;
struct nvmet_ctrl *ctrl;
int ret;
mutex_lock(&subsys->lock);
@ -384,6 +469,13 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
if (ret)
goto out_unlock;
ret = nvmet_p2pmem_ns_enable(ns);
if (ret)
goto out_unlock;
list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
nvmet_p2pmem_ns_add_p2p(ctrl, ns);
ret = percpu_ref_init(&ns->ref, nvmet_destroy_namespace,
0, GFP_KERNEL);
if (ret)
@ -418,6 +510,9 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
mutex_unlock(&subsys->lock);
return ret;
out_dev_put:
list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
pci_dev_put(radix_tree_delete(&ctrl->p2p_ns_map, ns->nsid));
nvmet_ns_dev_disable(ns);
goto out_unlock;
}
@ -425,6 +520,7 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
void nvmet_ns_disable(struct nvmet_ns *ns)
{
struct nvmet_subsys *subsys = ns->subsys;
struct nvmet_ctrl *ctrl;
mutex_lock(&subsys->lock);
if (!ns->enabled)
@ -434,6 +530,10 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
list_del_rcu(&ns->dev_link);
if (ns->nsid == subsys->max_nsid)
subsys->max_nsid = nvmet_max_nsid(subsys);
list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
pci_dev_put(radix_tree_delete(&ctrl->p2p_ns_map, ns->nsid));
mutex_unlock(&subsys->lock);
/*
@ -450,6 +550,7 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
percpu_ref_exit(&ns->ref);
mutex_lock(&subsys->lock);
subsys->nr_namespaces--;
nvmet_ns_changed(subsys, ns->nsid);
nvmet_ns_dev_disable(ns);
@ -725,6 +826,51 @@ void nvmet_req_execute(struct nvmet_req *req)
}
EXPORT_SYMBOL_GPL(nvmet_req_execute);
int nvmet_req_alloc_sgl(struct nvmet_req *req)
{
struct pci_dev *p2p_dev = NULL;
if (IS_ENABLED(CONFIG_PCI_P2PDMA)) {
if (req->sq->ctrl && req->ns)
p2p_dev = radix_tree_lookup(&req->sq->ctrl->p2p_ns_map,
req->ns->nsid);
req->p2p_dev = NULL;
if (req->sq->qid && p2p_dev) {
req->sg = pci_p2pmem_alloc_sgl(p2p_dev, &req->sg_cnt,
req->transfer_len);
if (req->sg) {
req->p2p_dev = p2p_dev;
return 0;
}
}
/*
* If no P2P memory was available we fallback to using
* regular memory
*/
}
req->sg = sgl_alloc(req->transfer_len, GFP_KERNEL, &req->sg_cnt);
if (!req->sg)
return -ENOMEM;
return 0;
}
EXPORT_SYMBOL_GPL(nvmet_req_alloc_sgl);
void nvmet_req_free_sgl(struct nvmet_req *req)
{
if (req->p2p_dev)
pci_p2pmem_free_sgl(req->p2p_dev, req->sg);
else
sgl_free(req->sg);
req->sg = NULL;
req->sg_cnt = 0;
}
EXPORT_SYMBOL_GPL(nvmet_req_free_sgl);
static inline bool nvmet_cc_en(u32 cc)
{
return (cc >> NVME_CC_EN_SHIFT) & 0x1;
@ -921,6 +1067,37 @@ bool nvmet_host_allowed(struct nvmet_req *req, struct nvmet_subsys *subsys,
return __nvmet_host_allowed(subsys, hostnqn);
}
/*
* Note: ctrl->subsys->lock should be held when calling this function
*/
static void nvmet_setup_p2p_ns_map(struct nvmet_ctrl *ctrl,
struct nvmet_req *req)
{
struct nvmet_ns *ns;
if (!req->p2p_client)
return;
ctrl->p2p_client = get_device(req->p2p_client);
list_for_each_entry_rcu(ns, &ctrl->subsys->namespaces, dev_link)
nvmet_p2pmem_ns_add_p2p(ctrl, ns);
}
/*
* Note: ctrl->subsys->lock should be held when calling this function
*/
static void nvmet_release_p2p_ns_map(struct nvmet_ctrl *ctrl)
{
struct radix_tree_iter iter;
void __rcu **slot;
radix_tree_for_each_slot(slot, &ctrl->p2p_ns_map, &iter, 0)
pci_dev_put(radix_tree_deref_slot(slot));
put_device(ctrl->p2p_client);
}
u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp)
{
@ -962,6 +1139,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
INIT_LIST_HEAD(&ctrl->async_events);
INIT_RADIX_TREE(&ctrl->p2p_ns_map, GFP_KERNEL);
memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE);
memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE);
@ -1026,6 +1204,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
mutex_lock(&subsys->lock);
list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
nvmet_setup_p2p_ns_map(ctrl, req);
mutex_unlock(&subsys->lock);
*ctrlp = ctrl;
@ -1053,6 +1232,7 @@ static void nvmet_ctrl_free(struct kref *ref)
struct nvmet_subsys *subsys = ctrl->subsys;
mutex_lock(&subsys->lock);
nvmet_release_p2p_ns_map(ctrl);
list_del(&ctrl->subsys_entry);
mutex_unlock(&subsys->lock);

View file

@ -78,6 +78,9 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
op = REQ_OP_READ;
}
if (is_pci_p2pdma_page(sg_page(req->sg)))
op_flags |= REQ_NOMERGE;
sector = le64_to_cpu(req->cmd->rw.slba);
sector <<= (req->ns->blksize_shift - 9);

View file

@ -26,6 +26,7 @@
#include <linux/configfs.h>
#include <linux/rcupdate.h>
#include <linux/blkdev.h>
#include <linux/radix-tree.h>
#define NVMET_ASYNC_EVENTS 4
#define NVMET_ERROR_LOG_SLOTS 128
@ -77,6 +78,9 @@ struct nvmet_ns {
struct completion disable_done;
mempool_t *bvec_pool;
struct kmem_cache *bvec_cache;
int use_p2pmem;
struct pci_dev *p2p_dev;
};
static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item)
@ -84,6 +88,11 @@ static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item)
return container_of(to_config_group(item), struct nvmet_ns, group);
}
static inline struct device *nvmet_ns_dev(struct nvmet_ns *ns)
{
return ns->bdev ? disk_to_dev(ns->bdev->bd_disk) : NULL;
}
struct nvmet_cq {
u16 qid;
u16 size;
@ -184,6 +193,9 @@ struct nvmet_ctrl {
char subsysnqn[NVMF_NQN_FIELD_LEN];
char hostnqn[NVMF_NQN_FIELD_LEN];
struct device *p2p_client;
struct radix_tree_root p2p_ns_map;
};
struct nvmet_subsys {
@ -294,6 +306,9 @@ struct nvmet_req {
void (*execute)(struct nvmet_req *req);
const struct nvmet_fabrics_ops *ops;
struct pci_dev *p2p_dev;
struct device *p2p_client;
};
extern struct workqueue_struct *buffered_io_wq;
@ -336,6 +351,8 @@ bool nvmet_req_init(struct nvmet_req *req, struct nvmet_cq *cq,
void nvmet_req_uninit(struct nvmet_req *req);
void nvmet_req_execute(struct nvmet_req *req);
void nvmet_req_complete(struct nvmet_req *req, u16 status);
int nvmet_req_alloc_sgl(struct nvmet_req *req);
void nvmet_req_free_sgl(struct nvmet_req *req);
void nvmet_cq_setup(struct nvmet_ctrl *ctrl, struct nvmet_cq *cq, u16 qid,
u16 size);

View file

@ -503,7 +503,7 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
}
if (rsp->req.sg != rsp->cmd->inline_sg)
sgl_free(rsp->req.sg);
nvmet_req_free_sgl(&rsp->req);
if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
nvmet_rdma_process_wr_wait_list(queue);
@ -652,24 +652,24 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
{
struct rdma_cm_id *cm_id = rsp->queue->cm_id;
u64 addr = le64_to_cpu(sgl->addr);
u32 len = get_unaligned_le24(sgl->length);
u32 key = get_unaligned_le32(sgl->key);
int ret;
rsp->req.transfer_len = get_unaligned_le24(sgl->length);
/* no data command? */
if (!len)
if (!rsp->req.transfer_len)
return 0;
rsp->req.sg = sgl_alloc(len, GFP_KERNEL, &rsp->req.sg_cnt);
if (!rsp->req.sg)
return NVME_SC_INTERNAL;
ret = nvmet_req_alloc_sgl(&rsp->req);
if (ret < 0)
goto error_out;
ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
nvmet_data_dir(&rsp->req));
if (ret < 0)
return NVME_SC_INTERNAL;
rsp->req.transfer_len += len;
goto error_out;
rsp->n_rdma += ret;
if (invalidate) {
@ -678,6 +678,10 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
}
return 0;
error_out:
rsp->req.transfer_len = 0;
return NVME_SC_INTERNAL;
}
static u16 nvmet_rdma_map_sgl(struct nvmet_rdma_rsp *rsp)
@ -745,6 +749,8 @@ static void nvmet_rdma_handle_command(struct nvmet_rdma_queue *queue,
cmd->send_sge.addr, cmd->send_sge.length,
DMA_TO_DEVICE);
cmd->req.p2p_client = &queue->dev->device->dev;
if (!nvmet_req_init(&cmd->req, &queue->nvme_cq,
&queue->nvme_sq, &nvmet_rdma_ops))
return;

View file

@ -132,6 +132,23 @@ config PCI_PASID
If unsure, say N.
config PCI_P2PDMA
bool "PCI peer-to-peer transfer support"
depends on PCI && ZONE_DEVICE
select GENERIC_ALLOCATOR
help
Enableѕ drivers to do PCI peer-to-peer transactions to and from
BARs that are exposed in other devices that are the part of
the hierarchy where peer-to-peer DMA is guaranteed by the PCI
specification to work (ie. anything below a single PCI bridge).
Many PCIe root complexes do not support P2P transactions and
it's hard to tell which support it at all, so at this time,
P2P DMA transations must be between devices behind the same root
port.
If unsure, say N.
config PCI_LABEL
def_bool y if (DMI || ACPI)
depends on PCI

View file

@ -26,6 +26,7 @@ obj-$(CONFIG_PCI_SYSCALL) += syscall.o
obj-$(CONFIG_PCI_STUB) += pci-stub.o
obj-$(CONFIG_PCI_PF_STUB) += pci-pf-stub.o
obj-$(CONFIG_PCI_ECAM) += ecam.o
obj-$(CONFIG_PCI_P2PDMA) += p2pdma.o
obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
# Endpoint library must be initialized before its users

805
drivers/pci/p2pdma.c Normal file
View file

@ -0,0 +1,805 @@
// SPDX-License-Identifier: GPL-2.0
/*
* PCI Peer 2 Peer DMA support.
*
* Copyright (c) 2016-2018, Logan Gunthorpe
* Copyright (c) 2016-2017, Microsemi Corporation
* Copyright (c) 2017, Christoph Hellwig
* Copyright (c) 2018, Eideticom Inc.
*/
#define pr_fmt(fmt) "pci-p2pdma: " fmt
#include <linux/ctype.h>
#include <linux/pci-p2pdma.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/genalloc.h>
#include <linux/memremap.h>
#include <linux/percpu-refcount.h>
#include <linux/random.h>
#include <linux/seq_buf.h>
struct pci_p2pdma {
struct percpu_ref devmap_ref;
struct completion devmap_ref_done;
struct gen_pool *pool;
bool p2pmem_published;
};
static ssize_t size_show(struct device *dev, struct device_attribute *attr,
char *buf)
{
struct pci_dev *pdev = to_pci_dev(dev);
size_t size = 0;
if (pdev->p2pdma->pool)
size = gen_pool_size(pdev->p2pdma->pool);
return snprintf(buf, PAGE_SIZE, "%zd\n", size);
}
static DEVICE_ATTR_RO(size);
static ssize_t available_show(struct device *dev, struct device_attribute *attr,
char *buf)
{
struct pci_dev *pdev = to_pci_dev(dev);
size_t avail = 0;
if (pdev->p2pdma->pool)
avail = gen_pool_avail(pdev->p2pdma->pool);
return snprintf(buf, PAGE_SIZE, "%zd\n", avail);
}
static DEVICE_ATTR_RO(available);
static ssize_t published_show(struct device *dev, struct device_attribute *attr,
char *buf)
{
struct pci_dev *pdev = to_pci_dev(dev);
return snprintf(buf, PAGE_SIZE, "%d\n",
pdev->p2pdma->p2pmem_published);
}
static DEVICE_ATTR_RO(published);
static struct attribute *p2pmem_attrs[] = {
&dev_attr_size.attr,
&dev_attr_available.attr,
&dev_attr_published.attr,
NULL,
};
static const struct attribute_group p2pmem_group = {
.attrs = p2pmem_attrs,
.name = "p2pmem",
};
static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
{
struct pci_p2pdma *p2p =
container_of(ref, struct pci_p2pdma, devmap_ref);
complete_all(&p2p->devmap_ref_done);
}
static void pci_p2pdma_percpu_kill(void *data)
{
struct percpu_ref *ref = data;
/*
* pci_p2pdma_add_resource() may be called multiple times
* by a driver and may register the percpu_kill devm action multiple
* times. We only want the first action to actually kill the
* percpu_ref.
*/
if (percpu_ref_is_dying(ref))
return;
percpu_ref_kill(ref);
}
static void pci_p2pdma_release(void *data)
{
struct pci_dev *pdev = data;
if (!pdev->p2pdma)
return;
wait_for_completion(&pdev->p2pdma->devmap_ref_done);
percpu_ref_exit(&pdev->p2pdma->devmap_ref);
gen_pool_destroy(pdev->p2pdma->pool);
sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
pdev->p2pdma = NULL;
}
static int pci_p2pdma_setup(struct pci_dev *pdev)
{
int error = -ENOMEM;
struct pci_p2pdma *p2p;
p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL);
if (!p2p)
return -ENOMEM;
p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
if (!p2p->pool)
goto out;
init_completion(&p2p->devmap_ref_done);
error = percpu_ref_init(&p2p->devmap_ref,
pci_p2pdma_percpu_release, 0, GFP_KERNEL);
if (error)
goto out_pool_destroy;
error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
if (error)
goto out_pool_destroy;
pdev->p2pdma = p2p;
error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
if (error)
goto out_pool_destroy;
return 0;
out_pool_destroy:
pdev->p2pdma = NULL;
gen_pool_destroy(p2p->pool);
out:
devm_kfree(&pdev->dev, p2p);
return error;
}
/**
* pci_p2pdma_add_resource - add memory for use as p2p memory
* @pdev: the device to add the memory to
* @bar: PCI BAR to add
* @size: size of the memory to add, may be zero to use the whole BAR
* @offset: offset into the PCI BAR
*
* The memory will be given ZONE_DEVICE struct pages so that it may
* be used with any DMA request.
*/
int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
u64 offset)
{
struct dev_pagemap *pgmap;
void *addr;
int error;
if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
return -EINVAL;
if (offset >= pci_resource_len(pdev, bar))
return -EINVAL;
if (!size)
size = pci_resource_len(pdev, bar) - offset;
if (size + offset > pci_resource_len(pdev, bar))
return -EINVAL;
if (!pdev->p2pdma) {
error = pci_p2pdma_setup(pdev);
if (error)
return error;
}
pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL);
if (!pgmap)
return -ENOMEM;
pgmap->res.start = pci_resource_start(pdev, bar) + offset;
pgmap->res.end = pgmap->res.start + size - 1;
pgmap->res.flags = pci_resource_flags(pdev, bar);
pgmap->ref = &pdev->p2pdma->devmap_ref;
pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
pci_resource_start(pdev, bar);
addr = devm_memremap_pages(&pdev->dev, pgmap);
if (IS_ERR(addr)) {
error = PTR_ERR(addr);
goto pgmap_free;
}
error = gen_pool_add_virt(pdev->p2pdma->pool, (unsigned long)addr,
pci_bus_address(pdev, bar) + offset,
resource_size(&pgmap->res), dev_to_node(&pdev->dev));
if (error)
goto pgmap_free;
error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
&pdev->p2pdma->devmap_ref);
if (error)
goto pgmap_free;
pci_info(pdev, "added peer-to-peer DMA memory %pR\n",
&pgmap->res);
return 0;
pgmap_free:
devm_kfree(&pdev->dev, pgmap);
return error;
}
EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);
/*
* Note this function returns the parent PCI device with a
* reference taken. It is the caller's responsibily to drop
* the reference.
*/
static struct pci_dev *find_parent_pci_dev(struct device *dev)
{
struct device *parent;
dev = get_device(dev);
while (dev) {
if (dev_is_pci(dev))
return to_pci_dev(dev);
parent = get_device(dev->parent);
put_device(dev);
dev = parent;
}
return NULL;
}
/*
* Check if a PCI bridge has its ACS redirection bits set to redirect P2P
* TLPs upstream via ACS. Returns 1 if the packets will be redirected
* upstream, 0 otherwise.
*/
static int pci_bridge_has_acs_redir(struct pci_dev *pdev)
{
int pos;
u16 ctrl;
pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
if (!pos)
return 0;
pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
if (ctrl & (PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_EC))
return 1;
return 0;
}
static void seq_buf_print_bus_devfn(struct seq_buf *buf, struct pci_dev *pdev)
{
if (!buf)
return;
seq_buf_printf(buf, "%s;", pci_name(pdev));
}
/*
* Find the distance through the nearest common upstream bridge between
* two PCI devices.
*
* If the two devices are the same device then 0 will be returned.
*
* If there are two virtual functions of the same device behind the same
* bridge port then 2 will be returned (one step down to the PCIe switch,
* then one step back to the same device).
*
* In the case where two devices are connected to the same PCIe switch, the
* value 4 will be returned. This corresponds to the following PCI tree:
*
* -+ Root Port
* \+ Switch Upstream Port
* +-+ Switch Downstream Port
* + \- Device A
* \-+ Switch Downstream Port
* \- Device B
*
* The distance is 4 because we traverse from Device A through the downstream
* port of the switch, to the common upstream port, back up to the second
* downstream port and then to Device B.
*
* Any two devices that don't have a common upstream bridge will return -1.
* In this way devices on separate PCIe root ports will be rejected, which
* is what we want for peer-to-peer seeing each PCIe root port defines a
* separate hierarchy domain and there's no way to determine whether the root
* complex supports forwarding between them.
*
* In the case where two devices are connected to different PCIe switches,
* this function will still return a positive distance as long as both
* switches eventually have a common upstream bridge. Note this covers
* the case of using multiple PCIe switches to achieve a desired level of
* fan-out from a root port. The exact distance will be a function of the
* number of switches between Device A and Device B.
*
* If a bridge which has any ACS redirection bits set is in the path
* then this functions will return -2. This is so we reject any
* cases where the TLPs are forwarded up into the root complex.
* In this case, a list of all infringing bridge addresses will be
* populated in acs_list (assuming it's non-null) for printk purposes.
*/
static int upstream_bridge_distance(struct pci_dev *a,
struct pci_dev *b,
struct seq_buf *acs_list)
{
int dist_a = 0;
int dist_b = 0;
struct pci_dev *bb = NULL;
int acs_cnt = 0;
/*
* Note, we don't need to take references to devices returned by
* pci_upstream_bridge() seeing we hold a reference to a child
* device which will already hold a reference to the upstream bridge.
*/
while (a) {
dist_b = 0;
if (pci_bridge_has_acs_redir(a)) {
seq_buf_print_bus_devfn(acs_list, a);
acs_cnt++;
}
bb = b;
while (bb) {
if (a == bb)
goto check_b_path_acs;
bb = pci_upstream_bridge(bb);
dist_b++;
}
a = pci_upstream_bridge(a);
dist_a++;
}
return -1;
check_b_path_acs:
bb = b;
while (bb) {
if (a == bb)
break;
if (pci_bridge_has_acs_redir(bb)) {
seq_buf_print_bus_devfn(acs_list, bb);
acs_cnt++;
}
bb = pci_upstream_bridge(bb);
}
if (acs_cnt)
return -2;
return dist_a + dist_b;
}
static int upstream_bridge_distance_warn(struct pci_dev *provider,
struct pci_dev *client)
{
struct seq_buf acs_list;
int ret;
seq_buf_init(&acs_list, kmalloc(PAGE_SIZE, GFP_KERNEL), PAGE_SIZE);
if (!acs_list.buffer)
return -ENOMEM;
ret = upstream_bridge_distance(provider, client, &acs_list);
if (ret == -2) {
pci_warn(client, "cannot be used for peer-to-peer DMA as ACS redirect is set between the client and provider (%s)\n",
pci_name(provider));
/* Drop final semicolon */
acs_list.buffer[acs_list.len-1] = 0;
pci_warn(client, "to disable ACS redirect for this path, add the kernel parameter: pci=disable_acs_redir=%s\n",
acs_list.buffer);
} else if (ret < 0) {
pci_warn(client, "cannot be used for peer-to-peer DMA as the client and provider (%s) do not share an upstream bridge\n",
pci_name(provider));
}
kfree(acs_list.buffer);
return ret;
}
/**
* pci_p2pdma_distance_many - Determive the cumulative distance between
* a p2pdma provider and the clients in use.
* @provider: p2pdma provider to check against the client list
* @clients: array of devices to check (NULL-terminated)
* @num_clients: number of clients in the array
* @verbose: if true, print warnings for devices when we return -1
*
* Returns -1 if any of the clients are not compatible (behind the same
* root port as the provider), otherwise returns a positive number where
* a lower number is the preferrable choice. (If there's one client
* that's the same as the provider it will return 0, which is best choice).
*
* For now, "compatible" means the provider and the clients are all behind
* the same PCI root port. This cuts out cases that may work but is safest
* for the user. Future work can expand this to white-list root complexes that
* can safely forward between each ports.
*/
int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients,
int num_clients, bool verbose)
{
bool not_supported = false;
struct pci_dev *pci_client;
int distance = 0;
int i, ret;
if (num_clients == 0)
return -1;
for (i = 0; i < num_clients; i++) {
pci_client = find_parent_pci_dev(clients[i]);
if (!pci_client) {
if (verbose)
dev_warn(clients[i],
"cannot be used for peer-to-peer DMA as it is not a PCI device\n");
return -1;
}
if (verbose)
ret = upstream_bridge_distance_warn(provider,
pci_client);
else
ret = upstream_bridge_distance(provider, pci_client,
NULL);
pci_dev_put(pci_client);
if (ret < 0)
not_supported = true;
if (not_supported && !verbose)
break;
distance += ret;
}
if (not_supported)
return -1;
return distance;
}
EXPORT_SYMBOL_GPL(pci_p2pdma_distance_many);
/**
* pci_has_p2pmem - check if a given PCI device has published any p2pmem
* @pdev: PCI device to check
*/
bool pci_has_p2pmem(struct pci_dev *pdev)
{
return pdev->p2pdma && pdev->p2pdma->p2pmem_published;
}
EXPORT_SYMBOL_GPL(pci_has_p2pmem);
/**
* pci_p2pmem_find - find a peer-to-peer DMA memory device compatible with
* the specified list of clients and shortest distance (as determined
* by pci_p2pmem_dma())
* @clients: array of devices to check (NULL-terminated)
* @num_clients: number of client devices in the list
*
* If multiple devices are behind the same switch, the one "closest" to the
* client devices in use will be chosen first. (So if one of the providers are
* the same as one of the clients, that provider will be used ahead of any
* other providers that are unrelated). If multiple providers are an equal
* distance away, one will be chosen at random.
*
* Returns a pointer to the PCI device with a reference taken (use pci_dev_put
* to return the reference) or NULL if no compatible device is found. The
* found provider will also be assigned to the client list.
*/
struct pci_dev *pci_p2pmem_find_many(struct device **clients, int num_clients)
{
struct pci_dev *pdev = NULL;
int distance;
int closest_distance = INT_MAX;
struct pci_dev **closest_pdevs;
int dev_cnt = 0;
const int max_devs = PAGE_SIZE / sizeof(*closest_pdevs);
int i;
closest_pdevs = kmalloc(PAGE_SIZE, GFP_KERNEL);
if (!closest_pdevs)
return NULL;
while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev))) {
if (!pci_has_p2pmem(pdev))
continue;
distance = pci_p2pdma_distance_many(pdev, clients,
num_clients, false);
if (distance < 0 || distance > closest_distance)
continue;
if (distance == closest_distance && dev_cnt >= max_devs)
continue;
if (distance < closest_distance) {
for (i = 0; i < dev_cnt; i++)
pci_dev_put(closest_pdevs[i]);
dev_cnt = 0;
closest_distance = distance;
}
closest_pdevs[dev_cnt++] = pci_dev_get(pdev);
}
if (dev_cnt)
pdev = pci_dev_get(closest_pdevs[prandom_u32_max(dev_cnt)]);
for (i = 0; i < dev_cnt; i++)
pci_dev_put(closest_pdevs[i]);
kfree(closest_pdevs);
return pdev;
}
EXPORT_SYMBOL_GPL(pci_p2pmem_find_many);
/**
* pci_alloc_p2p_mem - allocate peer-to-peer DMA memory
* @pdev: the device to allocate memory from
* @size: number of bytes to allocate
*
* Returns the allocated memory or NULL on error.
*/
void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
{
void *ret;
if (unlikely(!pdev->p2pdma))
return NULL;
if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
return NULL;
ret = (void *)gen_pool_alloc(pdev->p2pdma->pool, size);
if (unlikely(!ret))
percpu_ref_put(&pdev->p2pdma->devmap_ref);
return ret;
}
EXPORT_SYMBOL_GPL(pci_alloc_p2pmem);
/**
* pci_free_p2pmem - free peer-to-peer DMA memory
* @pdev: the device the memory was allocated from
* @addr: address of the memory that was allocated
* @size: number of bytes that was allocated
*/
void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size)
{
gen_pool_free(pdev->p2pdma->pool, (uintptr_t)addr, size);
percpu_ref_put(&pdev->p2pdma->devmap_ref);
}
EXPORT_SYMBOL_GPL(pci_free_p2pmem);
/**
* pci_virt_to_bus - return the PCI bus address for a given virtual
* address obtained with pci_alloc_p2pmem()
* @pdev: the device the memory was allocated from
* @addr: address of the memory that was allocated
*/
pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr)
{
if (!addr)
return 0;
if (!pdev->p2pdma)
return 0;
/*
* Note: when we added the memory to the pool we used the PCI
* bus address as the physical address. So gen_pool_virt_to_phys()
* actually returns the bus address despite the misleading name.
*/
return gen_pool_virt_to_phys(pdev->p2pdma->pool, (unsigned long)addr);
}
EXPORT_SYMBOL_GPL(pci_p2pmem_virt_to_bus);
/**
* pci_p2pmem_alloc_sgl - allocate peer-to-peer DMA memory in a scatterlist
* @pdev: the device to allocate memory from
* @nents: the number of SG entries in the list
* @length: number of bytes to allocate
*
* Returns 0 on success
*/
struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
unsigned int *nents, u32 length)
{
struct scatterlist *sg;
void *addr;
sg = kzalloc(sizeof(*sg), GFP_KERNEL);
if (!sg)
return NULL;
sg_init_table(sg, 1);
addr = pci_alloc_p2pmem(pdev, length);
if (!addr)
goto out_free_sg;
sg_set_buf(sg, addr, length);
*nents = 1;
return sg;
out_free_sg:
kfree(sg);
return NULL;
}
EXPORT_SYMBOL_GPL(pci_p2pmem_alloc_sgl);
/**
* pci_p2pmem_free_sgl - free a scatterlist allocated by pci_p2pmem_alloc_sgl()
* @pdev: the device to allocate memory from
* @sgl: the allocated scatterlist
*/
void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl)
{
struct scatterlist *sg;
int count;
for_each_sg(sgl, sg, INT_MAX, count) {
if (!sg)
break;
pci_free_p2pmem(pdev, sg_virt(sg), sg->length);
}
kfree(sgl);
}
EXPORT_SYMBOL_GPL(pci_p2pmem_free_sgl);
/**
* pci_p2pmem_publish - publish the peer-to-peer DMA memory for use by
* other devices with pci_p2pmem_find()
* @pdev: the device with peer-to-peer DMA memory to publish
* @publish: set to true to publish the memory, false to unpublish it
*
* Published memory can be used by other PCI device drivers for
* peer-2-peer DMA operations. Non-published memory is reserved for
* exlusive use of the device driver that registers the peer-to-peer
* memory.
*/
void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
{
if (pdev->p2pdma)
pdev->p2pdma->p2pmem_published = publish;
}
EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
/**
* pci_p2pdma_map_sg - map a PCI peer-to-peer scatterlist for DMA
* @dev: device doing the DMA request
* @sg: scatter list to map
* @nents: elements in the scatterlist
* @dir: DMA direction
*
* Scatterlists mapped with this function should not be unmapped in any way.
*
* Returns the number of SG entries mapped or 0 on error.
*/
int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
enum dma_data_direction dir)
{
struct dev_pagemap *pgmap;
struct scatterlist *s;
phys_addr_t paddr;
int i;
/*
* p2pdma mappings are not compatible with devices that use
* dma_virt_ops. If the upper layers do the right thing
* this should never happen because it will be prevented
* by the check in pci_p2pdma_add_client()
*/
if (WARN_ON_ONCE(IS_ENABLED(CONFIG_DMA_VIRT_OPS) &&
dev->dma_ops == &dma_virt_ops))
return 0;
for_each_sg(sg, s, nents, i) {
pgmap = sg_page(s)->pgmap;
paddr = sg_phys(s);
s->dma_address = paddr - pgmap->pci_p2pdma_bus_offset;
sg_dma_len(s) = s->length;
}
return nents;
}
EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg);
/**
* pci_p2pdma_enable_store - parse a configfs/sysfs attribute store
* to enable p2pdma
* @page: contents of the value to be stored
* @p2p_dev: returns the PCI device that was selected to be used
* (if one was specified in the stored value)
* @use_p2pdma: returns whether to enable p2pdma or not
*
* Parses an attribute value to decide whether to enable p2pdma.
* The value can select a PCI device (using it's full BDF device
* name) or a boolean (in any format strtobool() accepts). A false
* value disables p2pdma, a true value expects the caller
* to automatically find a compatible device and specifying a PCI device
* expects the caller to use the specific provider.
*
* pci_p2pdma_enable_show() should be used as the show operation for
* the attribute.
*
* Returns 0 on success
*/
int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
bool *use_p2pdma)
{
struct device *dev;
dev = bus_find_device_by_name(&pci_bus_type, NULL, page);
if (dev) {
*use_p2pdma = true;
*p2p_dev = to_pci_dev(dev);
if (!pci_has_p2pmem(*p2p_dev)) {
pci_err(*p2p_dev,
"PCI device has no peer-to-peer memory: %s\n",
page);
pci_dev_put(*p2p_dev);
return -ENODEV;
}
return 0;
} else if ((page[0] == '0' || page[0] == '1') && !iscntrl(page[1])) {
/*
* If the user enters a PCI device that doesn't exist
* like "0000:01:00.1", we don't want strtobool to think
* it's a '0' when it's clearly not what the user wanted.
* So we require 0's and 1's to be exactly one character.
*/
} else if (!strtobool(page, use_p2pdma)) {
return 0;
}
pr_err("No such PCI device: %.*s\n", (int)strcspn(page, "\n"), page);
return -ENODEV;
}
EXPORT_SYMBOL_GPL(pci_p2pdma_enable_store);
/**
* pci_p2pdma_enable_show - show a configfs/sysfs attribute indicating
* whether p2pdma is enabled
* @page: contents of the stored value
* @p2p_dev: the selected p2p device (NULL if no device is selected)
* @use_p2pdma: whether p2pdme has been enabled
*
* Attributes that use pci_p2pdma_enable_store() should use this function
* to show the value of the attribute.
*
* Returns 0 on success
*/
ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
bool use_p2pdma)
{
if (!use_p2pdma)
return sprintf(page, "0\n");
if (!p2p_dev)
return sprintf(page, "1\n");
return sprintf(page, "%s\n", pci_name(p2p_dev));
}
EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show);

View file

@ -699,6 +699,7 @@ struct request_queue {
#define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */
#define QUEUE_FLAG_QUIESCED 28 /* queue has been quiesced */
#define QUEUE_FLAG_PREEMPT_ONLY 29 /* only process REQ_PREEMPT requests */
#define QUEUE_FLAG_PCI_P2PDMA 30 /* device supports PCI p2p requests */
#define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
(1 << QUEUE_FLAG_SAME_COMP) | \
@ -731,6 +732,8 @@ bool blk_queue_flag_test_and_clear(unsigned int flag, struct request_queue *q);
#define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
#define blk_queue_scsi_passthrough(q) \
test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
#define blk_queue_pci_p2pdma(q) \
test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags)
#define blk_noretry_request(rq) \
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \

View file

@ -53,11 +53,16 @@ struct vmem_altmap {
* wakeup event whenever a page is unpinned and becomes idle. This
* wakeup is used to coordinate physical address space management (ex:
* fs truncate/hole punch) vs pinned pages (ex: device dma).
*
* MEMORY_DEVICE_PCI_P2PDMA:
* Device memory residing in a PCI BAR intended for use with Peer-to-Peer
* transactions.
*/
enum memory_type {
MEMORY_DEVICE_PRIVATE = 1,
MEMORY_DEVICE_PUBLIC,
MEMORY_DEVICE_FS_DAX,
MEMORY_DEVICE_PCI_P2PDMA,
};
/*
@ -120,6 +125,7 @@ struct dev_pagemap {
struct device *dev;
void *data;
enum memory_type type;
u64 pci_p2pdma_bus_offset;
};
#ifdef CONFIG_ZONE_DEVICE

View file

@ -890,6 +890,19 @@ static inline bool is_device_public_page(const struct page *page)
page->pgmap->type == MEMORY_DEVICE_PUBLIC;
}
#ifdef CONFIG_PCI_P2PDMA
static inline bool is_pci_p2pdma_page(const struct page *page)
{
return is_zone_device_page(page) &&
page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
}
#else /* CONFIG_PCI_P2PDMA */
static inline bool is_pci_p2pdma_page(const struct page *page)
{
return false;
}
#endif /* CONFIG_PCI_P2PDMA */
#else /* CONFIG_DEV_PAGEMAP_OPS */
static inline void dev_pagemap_get_ops(void)
{
@ -913,6 +926,11 @@ static inline bool is_device_public_page(const struct page *page)
{
return false;
}
static inline bool is_pci_p2pdma_page(const struct page *page)
{
return false;
}
#endif /* CONFIG_DEV_PAGEMAP_OPS */
static inline void get_page(struct page *page)

114
include/linux/pci-p2pdma.h Normal file
View file

@ -0,0 +1,114 @@
/* SPDX-License-Identifier: GPL-2.0 */
/*
* PCI Peer 2 Peer DMA support.
*
* Copyright (c) 2016-2018, Logan Gunthorpe
* Copyright (c) 2016-2017, Microsemi Corporation
* Copyright (c) 2017, Christoph Hellwig
* Copyright (c) 2018, Eideticom Inc.
*/
#ifndef _LINUX_PCI_P2PDMA_H
#define _LINUX_PCI_P2PDMA_H
#include <linux/pci.h>
struct block_device;
struct scatterlist;
#ifdef CONFIG_PCI_P2PDMA
int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
u64 offset);
int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients,
int num_clients, bool verbose);
bool pci_has_p2pmem(struct pci_dev *pdev);
struct pci_dev *pci_p2pmem_find_many(struct device **clients, int num_clients);
void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size);
void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size);
pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr);
struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
unsigned int *nents, u32 length);
void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
enum dma_data_direction dir);
int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
bool *use_p2pdma);
ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
bool use_p2pdma);
#else /* CONFIG_PCI_P2PDMA */
static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
size_t size, u64 offset)
{
return -EOPNOTSUPP;
}
static inline int pci_p2pdma_distance_many(struct pci_dev *provider,
struct device **clients, int num_clients, bool verbose)
{
return -1;
}
static inline bool pci_has_p2pmem(struct pci_dev *pdev)
{
return false;
}
static inline struct pci_dev *pci_p2pmem_find_many(struct device **clients,
int num_clients)
{
return NULL;
}
static inline void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
{
return NULL;
}
static inline void pci_free_p2pmem(struct pci_dev *pdev, void *addr,
size_t size)
{
}
static inline pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev,
void *addr)
{
return 0;
}
static inline struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
unsigned int *nents, u32 length)
{
return NULL;
}
static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
struct scatterlist *sgl)
{
}
static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
{
}
static inline int pci_p2pdma_map_sg(struct device *dev,
struct scatterlist *sg, int nents, enum dma_data_direction dir)
{
return 0;
}
static inline int pci_p2pdma_enable_store(const char *page,
struct pci_dev **p2p_dev, bool *use_p2pdma)
{
*use_p2pdma = false;
return 0;
}
static inline ssize_t pci_p2pdma_enable_show(char *page,
struct pci_dev *p2p_dev, bool use_p2pdma)
{
return sprintf(page, "none\n");
}
#endif /* CONFIG_PCI_P2PDMA */
static inline int pci_p2pdma_distance(struct pci_dev *provider,
struct device *client, bool verbose)
{
return pci_p2pdma_distance_many(provider, &client, 1, verbose);
}
static inline struct pci_dev *pci_p2pmem_find(struct device *client)
{
return pci_p2pmem_find_many(&client, 1);
}
#endif /* _LINUX_PCI_P2P_H */

View file

@ -281,6 +281,7 @@ struct pcie_link_state;
struct pci_vpd;
struct pci_sriov;
struct pci_ats;
struct pci_p2pdma;
/* The pci_dev structure describes PCI devices */
struct pci_dev {
@ -440,6 +441,9 @@ struct pci_dev {
#endif
#ifdef CONFIG_PCI_PASID
u16 pasid_features;
#endif
#ifdef CONFIG_PCI_P2PDMA
struct pci_p2pdma *p2pdma;
#endif
phys_addr_t rom; /* Physical address if not from BAR */
size_t romlen; /* Length if not from BAR */