Commit graph

547 commits

Author SHA1 Message Date
Linus Torvalds 617a814f14 ALong with the usual shower of singleton patches, notable patch series in
this pull request are:
 
 "Align kvrealloc() with krealloc()" from Danilo Krummrich.  Adds
 consistency to the APIs and behaviour of these two core allocation
 functions.  This also simplifies/enables Rustification.
 
 "Some cleanups for shmem" from Baolin Wang.  No functional changes - mode
 code reuse, better function naming, logic simplifications.
 
 "mm: some small page fault cleanups" from Josef Bacik.  No functional
 changes - code cleanups only.
 
 "Various memory tiering fixes" from Zi Yan.  A small fix and a little
 cleanup.
 
 "mm/swap: remove boilerplate" from Yu Zhao.  Code cleanups and
 simplifications and .text shrinkage.
 
 "Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt.  This
 is a feature, it adds new feilds to /proc/vmstat such as
 
     $ grep kstack /proc/vmstat
     kstack_1k 3
     kstack_2k 188
     kstack_4k 11391
     kstack_8k 243
     kstack_16k 0
 
 which tells us that 11391 processes used 4k of stack while none at all
 used 16k.  Useful for some system tuning things, but partivularly useful
 for "the dynamic kernel stack project".
 
 "kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov.
 Teaches kmemleak to detect leaksage of percpu memory.
 
 "mm: memcg: page counters optimizations" from Roman Gushchin.  "3
 independent small optimizations of page counters".
 
 "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David
 Hildenbrand.  Improves PTE/PMD splitlock detection, makes powerpc/8xx work
 correctly by design rather than by accident.
 
 "mm: remove arch_make_page_accessible()" from David Hildenbrand.  Some
 folio conversions which make arch_make_page_accessible() unneeded.
 
 "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel.
 Cleans up and fixes our handling of the resetting of the cgroup/process
 peak-memory-use detector.
 
 "Make core VMA operations internal and testable" from Lorenzo Stoakes.
 Rationalizaion and encapsulation of the VMA manipulation APIs.  With a
 view to better enable testing of the VMA functions, even from a
 userspace-only harness.
 
 "mm: zswap: fixes for global shrinker" from Takero Funaki.  Fix issues in
 the zswap global shrinker, resulting in improved performance.
 
 "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao.  Fill in
 some missing info in /proc/zoneinfo.
 
 "mm: replace follow_page() by folio_walk" from David Hildenbrand.  Code
 cleanups and rationalizations (conversion to folio_walk()) resulting in
 the removal of follow_page().
 
 "improving dynamic zswap shrinker protection scheme" from Nhat Pham.  Some
 tuning to improve zswap's dynamic shrinker.  Significant reductions in
 swapin and improvements in performance are shown.
 
 "mm: Fix several issues with unaccepted memory" from Kirill Shutemov.
 Improvements to the new unaccepted memory feature,
 
 "mm/mprotect: Fix dax puds" from Peter Xu.  Implements mprotect on DAX
 PUDs.  This was missing, although nobody seems to have notied yet.
 
 "Introduce a store type enum for the Maple tree" from Sidhartha Kumar.
 Cleanups and modest performance improvements for the maple tree library
 code.
 
 "memcg: further decouple v1 code from v2" from Shakeel Butt.  Move more
 cgroup v1 remnants away from the v2 memcg code.
 
 "memcg: initiate deprecation of v1 features" from Shakeel Butt.  Adds
 various warnings telling users that memcg v1 features are deprecated.
 
 "mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li.
 Greatly improves the success rate of the mTHP swap allocation.
 
 "mm: introduce numa_memblks" from Mike Rapoport.  Moves various disparate
 per-arch implementations of numa_memblk code into generic code.
 
 "mm: batch free swaps for zap_pte_range()" from Barry Song.  Greatly
 improves the performance of munmap() of swap-filled ptes.
 
 "support large folio swap-out and swap-in for shmem" from Baolin Wang.
 With this series we no longer split shmem large folios into simgle-page
 folios when swapping out shmem.
 
 "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao.  Nice performance
 improvements and code reductions for gigantic folios.
 
 "support shmem mTHP collapse" from Baolin Wang.  Adds support for
 khugepaged's collapsing of shmem mTHP folios.
 
 "mm: Optimize mseal checks" from Pedro Falcato.  Fixes an mprotect()
 performance regression due to the addition of mseal().
 
 "Increase the number of bits available in page_type" from Matthew Wilcox.
 Increases the number of bits available in page_type!
 
 "Simplify the page flags a little" from Matthew Wilcox.  Many legacy page
 flags are now folio flags, so the page-based flags and their
 accessors/mutators can be removed.
 
 "mm: store zero pages to be swapped out in a bitmap" from Usama Arif.  An
 optimization which permits us to avoid writing/reading zero-filled zswap
 pages to backing store.
 
 "Avoid MAP_FIXED gap exposure" from Liam Howlett.  Fixes a race window
 which occurs when a MAP_FIXED operqtion is occurring during an unrelated
 vma tree walk.
 
 "mm: remove vma_merge()" from Lorenzo Stoakes.  Major rotorooting of the
 vma_merge() functionality, making ot cleaner, more testable and better
 tested.
 
 "misc fixups for DAMON {self,kunit} tests" from SeongJae Park.  Minor
 fixups of DAMON selftests and kunit tests.
 
 "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang.  Code
 cleanups and folio conversions.
 
 "Shmem mTHP controls and stats improvements" from Ryan Roberts.  Cleanups
 for shmem controls and stats.
 
 "mm: count the number of anonymous THPs per size" from Barry Song.  Expose
 additional anon THP stats to userspace for improved tuning.
 
 "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio
 conversions and removal of now-unused page-based APIs.
 
 "replace per-quota region priorities histogram buffer with per-context
 one" from SeongJae Park.  DAMON histogram rationalization.
 
 "Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae
 Park.  DAMON documentation updates.
 
 "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve
 related doc and warn" from Jason Wang: fixes usage of page allocator
 __GFP_NOFAIL and GFP_ATOMIC flags.
 
 "mm: split underused THPs" from Yu Zhao.  Improve THP=always policy - this
 was overprovisioning THPs in sparsely accessed memory areas.
 
 "zram: introduce custom comp backends API" frm Sergey Senozhatsky.  Add
 support for zram run-time compression algorithm tuning.
 
 "mm: Care about shadow stack guard gap when getting an unmapped area" from
 Mark Brown.  Fix up the various arch_get_unmapped_area() implementations
 to better respect guard areas.
 
 "Improve mem_cgroup_iter()" from Kinsey Ho.  Improve the reliability of
 mem_cgroup_iter() and various code cleanups.
 
 "mm: Support huge pfnmaps" from Peter Xu.  Extends the usage of huge
 pfnmap support.
 
 "resource: Fix region_intersects() vs add_memory_driver_managed()" from
 Huang Ying.  Fix a bug in region_intersects() for systems with CXL memory.
 
 "mm: hwpoison: two more poison recovery" from Kefeng Wang.  Teaches a
 couple more code paths to correctly recover from the encountering of
 poisoned memry.
 
 "mm: enable large folios swap-in support" from Barry Song.  Support the
 swapin of mTHP memory into appropriately-sized folios, rather than into
 single-page folios.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZu1BBwAKCRDdBJ7gKXxA
 jlWNAQDYlqQLun7bgsAN4sSvi27VUuWv1q70jlMXTfmjJAvQqwD/fBFVR6IOOiw7
 AkDbKWP2k0hWPiNJBGwoqxdHHx09Xgo=
 =s0T+
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Along with the usual shower of singleton patches, notable patch series
  in this pull request are:

   - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds
     consistency to the APIs and behaviour of these two core allocation
     functions. This also simplifies/enables Rustification.

   - "Some cleanups for shmem" from Baolin Wang. No functional changes -
     mode code reuse, better function naming, logic simplifications.

   - "mm: some small page fault cleanups" from Josef Bacik. No
     functional changes - code cleanups only.

   - "Various memory tiering fixes" from Zi Yan. A small fix and a
     little cleanup.

   - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and
     simplifications and .text shrinkage.

   - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel
     Butt. This is a feature, it adds new feilds to /proc/vmstat such as

       $ grep kstack /proc/vmstat
       kstack_1k 3
       kstack_2k 188
       kstack_4k 11391
       kstack_8k 243
       kstack_16k 0

     which tells us that 11391 processes used 4k of stack while none at
     all used 16k. Useful for some system tuning things, but
     partivularly useful for "the dynamic kernel stack project".

   - "kmemleak: support for percpu memory leak detect" from Pavel
     Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory.

   - "mm: memcg: page counters optimizations" from Roman Gushchin. "3
     independent small optimizations of page counters".

   - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from
     David Hildenbrand. Improves PTE/PMD splitlock detection, makes
     powerpc/8xx work correctly by design rather than by accident.

   - "mm: remove arch_make_page_accessible()" from David Hildenbrand.
     Some folio conversions which make arch_make_page_accessible()
     unneeded.

   - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David
     Finkel. Cleans up and fixes our handling of the resetting of the
     cgroup/process peak-memory-use detector.

   - "Make core VMA operations internal and testable" from Lorenzo
     Stoakes. Rationalizaion and encapsulation of the VMA manipulation
     APIs. With a view to better enable testing of the VMA functions,
     even from a userspace-only harness.

   - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix
     issues in the zswap global shrinker, resulting in improved
     performance.

   - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill
     in some missing info in /proc/zoneinfo.

   - "mm: replace follow_page() by folio_walk" from David Hildenbrand.
     Code cleanups and rationalizations (conversion to folio_walk())
     resulting in the removal of follow_page().

   - "improving dynamic zswap shrinker protection scheme" from Nhat
     Pham. Some tuning to improve zswap's dynamic shrinker. Significant
     reductions in swapin and improvements in performance are shown.

   - "mm: Fix several issues with unaccepted memory" from Kirill
     Shutemov. Improvements to the new unaccepted memory feature,

   - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on
     DAX PUDs. This was missing, although nobody seems to have notied
     yet.

   - "Introduce a store type enum for the Maple tree" from Sidhartha
     Kumar. Cleanups and modest performance improvements for the maple
     tree library code.

   - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move
     more cgroup v1 remnants away from the v2 memcg code.

   - "memcg: initiate deprecation of v1 features" from Shakeel Butt.
     Adds various warnings telling users that memcg v1 features are
     deprecated.

   - "mm: swap: mTHP swap allocator base on swap cluster order" from
     Chris Li. Greatly improves the success rate of the mTHP swap
     allocation.

   - "mm: introduce numa_memblks" from Mike Rapoport. Moves various
     disparate per-arch implementations of numa_memblk code into generic
     code.

   - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly
     improves the performance of munmap() of swap-filled ptes.

   - "support large folio swap-out and swap-in for shmem" from Baolin
     Wang. With this series we no longer split shmem large folios into
     simgle-page folios when swapping out shmem.

   - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice
     performance improvements and code reductions for gigantic folios.

   - "support shmem mTHP collapse" from Baolin Wang. Adds support for
     khugepaged's collapsing of shmem mTHP folios.

   - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect()
     performance regression due to the addition of mseal().

   - "Increase the number of bits available in page_type" from Matthew
     Wilcox. Increases the number of bits available in page_type!

   - "Simplify the page flags a little" from Matthew Wilcox. Many legacy
     page flags are now folio flags, so the page-based flags and their
     accessors/mutators can be removed.

   - "mm: store zero pages to be swapped out in a bitmap" from Usama
     Arif. An optimization which permits us to avoid writing/reading
     zero-filled zswap pages to backing store.

   - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race
     window which occurs when a MAP_FIXED operqtion is occurring during
     an unrelated vma tree walk.

   - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of
     the vma_merge() functionality, making ot cleaner, more testable and
     better tested.

   - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park.
     Minor fixups of DAMON selftests and kunit tests.

   - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang.
     Code cleanups and folio conversions.

   - "Shmem mTHP controls and stats improvements" from Ryan Roberts.
     Cleanups for shmem controls and stats.

   - "mm: count the number of anonymous THPs per size" from Barry Song.
     Expose additional anon THP stats to userspace for improved tuning.

   - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more
     folio conversions and removal of now-unused page-based APIs.

   - "replace per-quota region priorities histogram buffer with
     per-context one" from SeongJae Park. DAMON histogram
     rationalization.

   - "Docs/damon: update GitHub repo URLs and maintainer-profile" from
     SeongJae Park. DAMON documentation updates.

   - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and
     improve related doc and warn" from Jason Wang: fixes usage of page
     allocator __GFP_NOFAIL and GFP_ATOMIC flags.

   - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy.
     This was overprovisioning THPs in sparsely accessed memory areas.

   - "zram: introduce custom comp backends API" frm Sergey Senozhatsky.
     Add support for zram run-time compression algorithm tuning.

   - "mm: Care about shadow stack guard gap when getting an unmapped
     area" from Mark Brown. Fix up the various arch_get_unmapped_area()
     implementations to better respect guard areas.

   - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability
     of mem_cgroup_iter() and various code cleanups.

   - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge
     pfnmap support.

   - "resource: Fix region_intersects() vs add_memory_driver_managed()"
     from Huang Ying. Fix a bug in region_intersects() for systems with
     CXL memory.

   - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches
     a couple more code paths to correctly recover from the encountering
     of poisoned memry.

   - "mm: enable large folios swap-in support" from Barry Song. Support
     the swapin of mTHP memory into appropriately-sized folios, rather
     than into single-page folios"

* tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits)
  zram: free secondary algorithms names
  uprobes: turn xol_area->pages[2] into xol_area->page
  uprobes: introduce the global struct vm_special_mapping xol_mapping
  Revert "uprobes: use vm_special_mapping close() functionality"
  mm: support large folios swap-in for sync io devices
  mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
  mm: fix swap_read_folio_zeromap() for large folios with partial zeromap
  mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries
  set_memory: add __must_check to generic stubs
  mm/vma: return the exact errno in vms_gather_munmap_vmas()
  memcg: cleanup with !CONFIG_MEMCG_V1
  mm/show_mem.c: report alloc tags in human readable units
  mm: support poison recovery from copy_present_page()
  mm: support poison recovery from do_cow_fault()
  resource, kunit: add test case for region_intersects()
  resource: make alloc_free_mem_region() works for iomem_resource
  mm: z3fold: deprecate CONFIG_Z3FOLD
  vfio/pci: implement huge_fault support
  mm/arm64: support large pfn mappings
  mm/x86: support large pfn mappings
  ...
2024-09-21 07:29:05 -07:00
Jason Wang 955abe0a1b vduse: avoid using __GFP_NOFAIL
Patch series "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL
and improve related doc and warn", v4.

__GFP_NOFAIL carries the semantics of never failing, so its callers do not
check the return value:

  %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
  cannot handle allocation failures. The allocation could block
  indefinitely but will never return with failure. Testing for
  failure is pointless.

However, __GFP_NOFAIL can sometimes fail if it exceeds size limits or is
used with GFP_ATOMIC/GFP_NOWAIT in a non-sleepable context.  This patchset
handles illegal using __GFP_NOFAIL together with GFP_ATOMIC lacking
__GFP_DIRECT_RECLAIM(without this, we can't do anything to reclaim memory
to satisfy the nofail requirement) and improve related document and
warnings.

The proper size limits for __GFP_NOFAIL will be handled separately after
more discussions.


This patch (of 3):

mm doesn't support non-blockable __GFP_NOFAIL allocation.  Because
persisting in providing __GFP_NOFAIL services for non-block users who
cannot perform direct memory reclaim may only result in an endless busy
loop.

Therefore, in such cases, the current mm-core may directly return a NULL
pointer:

static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
                                                struct alloc_context *ac)
{
        ...
        if (gfp_mask & __GFP_NOFAIL) {
                /*
                 * All existing users of the __GFP_NOFAIL are blockable, so warn
                 * of any new users that actually require GFP_NOWAIT
                 */
                if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
                        goto fail;
                ...
        }
        ...
fail:
        warn_alloc(gfp_mask, ac->nodemask,
                        "page allocation failure: order:%u", order);
got_pg:
        return page;
}

Unfortuantely, vpda does that nofail allocation under non-sleepable lock. 
A possible way to fix that is to move the pages allocation out of the lock
into the caller, but having to allocate a huge number of pages and
auxiliary page array seems to be problematic as well per Tetsuon: " You
should implement proper error handling instead of using __GFP_NOFAIL if
count can become large."

So I chose another way, which does not release kernel bounce pages when
user tries to register userspace bounce pages.  Then we can avoid
allocating in paths where failure is not expected.(e.g in the release). 
We pay this for more memory usage as we don't release kernel bounce pages
but further optimizations could be done on top.

[v-songbaohua@oppo.com: Refine the changelog]
Link: https://lkml.kernel.org/r/20240830202823.21478-1-21cnbao@gmail.com
Link: https://lkml.kernel.org/r/20240830202823.21478-2-21cnbao@gmail.com
Fixes: 6c77ed2288 ("vduse: Support using userspace pages as bounce buffer")
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Reviewed-by: Xie Yongji <xieyongji@bytedance.com>
Tested-by: Xie Yongji <xieyongji@bytedance.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hailong.Liu <hailong.liu@oppo.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Eugenio Pérez" <eperezma@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:02 -07:00
Christoph Hellwig de6c85bf91 dma-mapping: clearly mark DMA ops as an architecture feature
DMA ops are a helper for architectures and not for drivers to override
the DMA implementation.

Unfortunately driver authors keep ignoring this.  Make the fact more
clear by renaming the symbol to ARCH_HAS_DMA_OPS and having the two drivers
overriding their dma_ops depend on that.  These drivers should probably be
marked broken, but we can give them a bit of a grace period for that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com> # for IPU6
Acked-by: Robin Murphy <robin.murphy@arm.com>
2024-09-04 07:08:51 +03:00
Christoph Hellwig 92193b3569 vdpa_sim: don't select DMA_OPS
vdpa_sim has been fixed to not override the dma_map_ops in commit
6c3d329e64 ("vdpa_sim: get rid of DMA ops"), so don't select the
symbol and don't depend on HAS_DMA.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-04 07:08:38 +03:00
Linus Torvalds 10826505f5 virtio: fixes
This fixes 3 issues:
 - prevent admin commands on one VF blocking another:
   prevents a bad VF from blocking a good one, as well
   as fixing a scalability issue with large # of VFs
 - correctly return error on command failure on octeon:
   currently if any commands fail we treat it as success
 - fix modpost warning when building virtio_dma_buf:
   harmless, but the fix is trivial
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQFCBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmajTPEPHG1zdEByZWRo
 YXQuY29tAAoJECgfDbjSjVRposcH90Gjchks1cqznVEGiH/FRzclYeK2VZKZDR/A
 EVfoe+vnSgCO1vAx0fla6Db2IR/FM81jBFolfwpCKhA4W/vbczeUTop2/kAM+JQD
 uP+yjQ0xpEIqVBp+tpUVuekiGGxfJd1Ky3jCp6kYf7TPbQ50+BczCUd3YBnZ7YtY
 WHNod1wpxTDj66bvHPBxb3dnAcHsVUHWmB8kpgFQMsd6CNVk9mvA9WIj9UgIfbj1
 /PNyETZRyqEjwl8/A11LdTqcUeZKm5PJjB7FfWc/ZfJOqtRpP/9848ESVvlgPjMd
 I4YSN5E5rLjv1R1ErTKfgiq5R82JJv6WhzqxdbhhlKb+TJaTDQ==
 =wH2A
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio fixes from Michael Tsirkin:
 "The biggest thing here is the adminq change - but it looks like the
  only way to avoid headq blocking causing indefinite stalls.

  This fixes three issues:

   - Prevent admin commands on one VF blocking another.

     This prevents a bad VF from blocking a good one, as well as fixing
     a scalability issue with large # of VFs

   - Correctly return error on command failure on octeon. We used to
     treat failed commands as a success.

   - Fix modpost warning when building virtio_dma_buf. Harmless, but the
     fix is trivial"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  virtio_pci_modern: remove admin queue serialization lock
  virtio_pci_modern: use completion instead of busy loop to wait on admin cmd result
  virtio_pci_modern: pass cmd as an identification token
  virtio_pci_modern: create admin queue of queried size
  virtio: create admin queues alongside other virtqueues
  virtio_pci: pass vq info as an argument to vp_setup_vq()
  virtio: push out code to vp_avq_index()
  virtio_pci_modern: treat vp_dev->admin_vq.info.vq pointer as static
  virtio_pci: introduce vector allocation fallback for slow path virtqueues
  virtio_pci: pass vector policy enum to vp_find_one_vq_msix()
  virtio_pci: pass vector policy enum to vp_find_vqs_msix()
  virtio_pci: simplify vp_request_msix_vectors() call a bit
  virtio_pci: push out single vq find code to vp_find_one_vq_msix()
  vdpa/octeon_ep: Fix error code in octep_process_mbox()
  virtio: add missing MODULE_DESCRIPTION() macro
2024-07-29 12:53:37 -07:00
Linus Torvalds c2a96b7f18 Driver core changes for 6.11-rc1
Here is the big set of driver core changes for 6.11-rc1.
 
 Lots of stuff in here, with not a huge diffstat, but apis are evolving
 which required lots of files to be touched.  Highlights of the changes
 in here are:
   - platform remove callback api final fixups (Uwe took many releases to
     get here, finally!)
   - Rust bindings for basic firmware apis and initial driver-core
     interactions.  It's not all that useful for a "write a whole driver
     in rust" type of thing, but the firmware bindings do help out the
     phy rust drivers, and the driver core bindings give a solid base on
     which others can start their work.  There is still a long way to go
     here before we have a multitude of rust drivers being added, but
     it's a great first step.
   - driver core const api changes.  This reached across all bus types,
     and there are some fix-ups for some not-common bus types that
     linux-next and 0-day testing shook out.  This work is being done to
     help make the rust bindings more safe, as well as the C code, moving
     toward the end-goal of allowing us to put driver structures into
     read-only memory.  We aren't there yet, but are getting closer.
   - minor devres cleanups and fixes found by code inspection
   - arch_topology minor changes
   - other minor driver core cleanups
 
 All of these have been in linux-next for a very long time with no
 reported problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZqH+aQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymoOQCfVBdLcBjEDAGh3L8qHRGMPy4rV2EAoL/r+zKm
 cJEYtJpGtWX6aAtugm9E
 =ZyJV
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the big set of driver core changes for 6.11-rc1.

  Lots of stuff in here, with not a huge diffstat, but apis are evolving
  which required lots of files to be touched. Highlights of the changes
  in here are:

   - platform remove callback api final fixups (Uwe took many releases
     to get here, finally!)

   - Rust bindings for basic firmware apis and initial driver-core
     interactions.

     It's not all that useful for a "write a whole driver in rust" type
     of thing, but the firmware bindings do help out the phy rust
     drivers, and the driver core bindings give a solid base on which
     others can start their work.

     There is still a long way to go here before we have a multitude of
     rust drivers being added, but it's a great first step.

   - driver core const api changes.

     This reached across all bus types, and there are some fix-ups for
     some not-common bus types that linux-next and 0-day testing shook
     out.

     This work is being done to help make the rust bindings more safe,
     as well as the C code, moving toward the end-goal of allowing us to
     put driver structures into read-only memory. We aren't there yet,
     but are getting closer.

   - minor devres cleanups and fixes found by code inspection

   - arch_topology minor changes

   - other minor driver core cleanups

  All of these have been in linux-next for a very long time with no
  reported problems"

* tag 'driver-core-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (55 commits)
  ARM: sa1100: make match function take a const pointer
  sysfs/cpu: Make crash_hotplug attribute world-readable
  dio: Have dio_bus_match() callback take a const *
  zorro: make match function take a const pointer
  driver core: module: make module_[add|remove]_driver take a const *
  driver core: make driver_find_device() take a const *
  driver core: make driver_[create|remove]_file take a const *
  firmware_loader: fix soundness issue in `request_internal`
  firmware_loader: annotate doctests as `no_run`
  devres: Correct code style for functions that return a pointer type
  devres: Initialize an uninitialized struct member
  devres: Fix memory leakage caused by driver API devm_free_percpu()
  devres: Fix devm_krealloc() wasting memory
  driver core: platform: Switch to use kmemdup_array()
  driver core: have match() callback in struct bus_type take a const *
  MAINTAINERS: add Rust device abstractions to DRIVER CORE
  device: rust: improve safety comments
  MAINTAINERS: add Danilo as FIRMWARE LOADER maintainer
  MAINTAINERS: add Rust FW abstractions to FIRMWARE LOADER
  firmware: rust: improve safety comments
  ...
2024-07-25 10:42:22 -07:00
Dan Carpenter e14f480df7 vdpa/octeon_ep: Fix error code in octep_process_mbox()
Return -EINVAL for invalid signatures.  Don't return success.

Fixes: 8b6c724cda ("virtio: vdpa: vDPA driver for Marvell OCTEON DPU devices")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Message-Id: <623e885b-1a05-479e-ab97-01bcf10bf5b8@stanley.mountain>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-17 05:43:20 -04:00
Dragos Tatulea 8e0751af1b vdpa/mlx5: Don't enable non-active VQs in .set_vq_ready()
VQ indices in the range [cur_num_qps, max_vqs) represent queues that
have not yet been activated. .set_vq_ready should not activate these
VQs.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-24-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:51 -04:00
Dragos Tatulea 2638134f71 vdpa/mlx5: Don't reset VQs more than necessary
The vdpa device can be reset many times in sequence without any
significant state changes in between. Previously this was not a problem:
VQs were torn down only on first reset. But after VQ pre-creation was
introduced, each reset will delete and re-create the hardware VQs and
their associated resources.

To solve this problem, avoid resetting hardware VQs if the VQs are still
in a blank state.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-23-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:50 -04:00
Dragos Tatulea 0fe963d6fc vdpa/mlx5: Re-create HW VQs under certain conditions
There are a few conditions under which the hardware VQs need a full
teardown and setup:

- VQ size changed to something else than default value. Hardware VQ size
  modification is not supported.

- User turns off certain device features: mergeable buffers, checksum
  virtio 1.0 compliance. In these cases, the TIR and RQT need to be
  re-created.

Add a needs_teardown configuration variable and set it when detecting
the above scenarios. On next DRIVER_OK, the resources will be torn down
first.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-22-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:50 -04:00
Dragos Tatulea ffb1aae43e vdpa/mlx5: Pre-create hardware VQs at vdpa .dev_add time
Currently, hardware VQs are created right when the vdpa device gets into
DRIVER_OK state. That is easier because most of the VQ state is known by
then.

This patch switches to creating all VQs and their associated resources
at device creation time. The motivation is to reduce the vdpa device
live migration downtime by moving the expensive operation of creating
all the hardware VQs and their associated resources out of downtime on
the destination VM.

The VQs are now created in a blank state. The VQ configuration will
happen later, on DRIVER_OK. Then the configuration will be applied when
the VQs are moved to the Ready state.

When .set_vq_ready() is called on a VQ before DRIVER_OK, special care is
needed: now that the VQ is already created a resume_vq() will be
triggered too early when no mr has been configured yet. Skip calling
resume_vq() in this case, let it be handled during DRIVER_OK.

For virtio-vdpa, the device configuration is done earlier during
.vdpa_dev_add() by vdpa_register_device(). Avoid calling
setup_vq_resources() a second time in that case.

On a 64 CPU, 256 GB VM with 1 vDPA device of 16 VQps, the full VQ
resource creation + resume time was ~370ms. Now it's down to 60 ms
(only VQ config and resume). The measurements were done on a ConnectX6DX
based vDPA device.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-21-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:49 -04:00
Dragos Tatulea 3b3adb3bbf vdpa/mlx5: Use suspend/resume during VQP change
Resume a VQ if it is already created when the number of VQ pairs
increases. This is done in preparation for VQ pre-creation which is
coming in a later patch. It is necessary because calling setup_vq() on
an already created VQ will return early and will not enable the queue.

For symmetry, suspend a VQ instead of tearing it down when the number of
VQ pairs decreases. But only if the resume operation is supported.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-20-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:49 -04:00
Dragos Tatulea ac85cd904d vdpa/mlx5: Forward error in suspend/resume device
Start using the suspend/resume_vq() error return codes previously added.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-19-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
2024-07-09 08:42:49 -04:00
Dragos Tatulea 843250271b vdpa/mlx5: Consolidate all VQ modify to Ready to use resume_vq()
There are a few more places modifying the VQ to Ready directly. Let's
consolidate them into resume_vq().

The redundant warnings for resume_vq() errors can also be dropped.

There is one special case that needs to be handled for virtio-vdpa:
the initialized flag must be set to true earlier in setup_vq() so that
resume_vq() doesn't return early.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-18-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:48 -04:00
Dragos Tatulea b89bb349f2 vdpa/mlx5: Add error code for suspend/resume VQ
Instead of blindly calling suspend/resume_vqs(), make then return error
codes.

To keep compatibility, keep suspending or resuming VQs on error and
return the last error code. The assumption here is that the error code
would be the same.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-17-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:48 -04:00
Dragos Tatulea fc9af25d04 vdpa/mlx5: Accept Init -> Ready VQ transition in resume_vq()
Until now resume_vq() was used only for the suspend/resume scenario.
This change also allows calling resume_vq() to bring it from Init to
Ready state (VQ initialization).

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-16-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
2024-07-09 08:42:48 -04:00
Dragos Tatulea e60e9eeb36 vdpa/mlx5: Allow creation of blank VQs
Based on the filled flag, create VQs that are filled or blank.
Blank VQs will be filled in later through VQ modify.

Downstream patches will make use of this to pre-create blank VQs at
vdpa device creation.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-15-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
2024-07-09 08:42:47 -04:00
Dragos Tatulea ebebaf45e8 vdpa/mlx5: Set mkey modified flags on all VQs
Otherwise, when virtqueues are moved from INIT to READY the latest mkey
will not be set appropriately.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-14-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:47 -04:00
Dragos Tatulea 1e8dac7bb6 vdpa/mlx5: Start off rqt_size with max VQPs
Currently rqt_size is initialized during device flag configuration.
That's because it is the earliest moment when device knows if MQ
(multi queue) is on or off.

Shift this configuration earlier to device creation time. This implies
that non-MQ devices will have a larger RQT size. But the configuration
will still be correct.

This is done in preparation for the pre-creation of hardware virtqueues
at device add time. When that change will be added, RQT will be created
at device creation time so it needs to be initialized to its max size.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-13-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:46 -04:00
Dragos Tatulea ad9758fdaf vdpa/mlx5: Set an initial size on the VQ
The virtqueue size is a pre-requisite for setting up any virtqueue
resources. For the upcoming optimization of creating virtqueues at
device add, the virtqueue size has to be configured.

The queue size check in setup_vq() will always be false. So remove it.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-12-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:46 -04:00
Dragos Tatulea cdc3c7eaae vdpa/mlx5: Add support for modifying the VQ features field
This is done in preparation for the pre-creation of hardware virtqueues
at device add time.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-11-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:46 -04:00
Dragos Tatulea f70080c5bc vdpa/mlx5: Add support for modifying the virtio_version VQ field
This is done in preparation for the pre-creation of hardware virtqueues
at device add time.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-10-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:45 -04:00
Dragos Tatulea 4a19f2942a vdpa/mlx5: Rename init_mvqs
Function is used to set default values, so name it accordingly.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-9-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
2024-07-09 08:42:45 -04:00
Dragos Tatulea e5bcbd1de6 vdpa/mlx5: Clear and reinitialize software VQ data on reset
The hardware VQ configuration is mirrored by data in struct
mlx5_vdpa_virtqueue . Instead of clearing just a few fields at reset,
fully clear the struct and initialize with the appropriate default
values.

As clear_vqs_ready() is used only during reset, get rid of it.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-8-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:45 -04:00
Dragos Tatulea 1835ed4a5d vdpa/mlx5: Initialize and reset device with one queue pair
The virtio spec says that a vdpa device should start off with one queue
pair. The driver is already compliant.

This patch moves the initialization to device add and reset times. This
is done in preparation for the pre-creation of hardware virtqueues at
device add time.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-7-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
2024-07-09 08:42:44 -04:00
Dragos Tatulea a366465b48 vdpa/mlx5: Remove duplicate suspend code
Use the dedicated suspend_vqs() function instead.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-6-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:44 -04:00
Dragos Tatulea 34bd86c720 vdpa/mlx5: Iterate over active VQs during suspend/resume
No need to iterate over max number of VQs.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-5-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:44 -04:00
Dragos Tatulea ad80739262 vdpa/mlx5: Drop redundant check in teardown_virtqueues()
The check is done inside teardown_vq().

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-4-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:43 -04:00
Dragos Tatulea 4c90a60ac2 vdpa/mlx5: Drop redundant code
Originally, the second loop initialized the CVQ. But (acde392949
("vdpa/mlx5: Use consistent RQT size") initialized all the queues in the
first loop, so the second iteration in init_mvqs() is never called
because the first one will iterate up to max_vqs.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-3-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:43 -04:00
Dragos Tatulea 63f0cbad97 vdpa/mlx5: Make setup/teardown_vq_resources() symmetrical
... by changing the setup_vq_resources() parameter type.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-2-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:43 -04:00
Dragos Tatulea 1f5d6476f1 vdpa/mlx5: Clarify meaning thorough function rename
setup_driver()/teardown_driver() are a bit vague. These functions are
used for virtqueue resources.

Same for alloc_resources()/teardown_resources(): they represent fixed
resources that are meant to exist during the device lifetime.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-1-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:42 -04:00
Jeff Johnson e7909ad6cb vDPA: add missing MODULE_DESCRIPTION() macros
With ARCH=x86, make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/vdpa/vdpa.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/vdpa/ifcvf/ifcvf.o

Add the missing invocations of the MODULE_DESCRIPTION() macro.

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Message-Id: <20240611-md-drivers-vdpa-v1-1-efaf2de15152@quicinc.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:41 -04:00
Srujana Challa 8b6c724cda virtio: vdpa: vDPA driver for Marvell OCTEON DPU devices
This commit introduces a new vDPA driver specifically designed for
managing the virtio control plane over the vDPA bus for OCTEON DPU
devices. The driver consists of two layers:

1. Octep HW Layer (Octeon Endpoint): Responsible for handling hardware
operations and configurations related to the DPU device.

2. Octep Main Layer: Compliant with the vDPA bus framework, this layer
implements device operations for the vDPA bus. It handles device
probing, bus attachment, vring operations, and other relevant tasks.

Signed-off-by: Srujana Challa <schalla@marvell.com>
Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com>
Signed-off-by: Shijith Thotton <sthotton@marvell.com>
Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20240614144659.1776067-1-schalla@marvell.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:39 -04:00
Greg Kroah-Hartman d69d804845 driver core: have match() callback in struct bus_type take a const *
In the match() callback, the struct device_driver * should not be
changed, so change the function callback to be a const *.  This is one
step of many towards making the driver core safe to have struct
device_driver in read-only memory.

Because the match() callback is in all busses, all busses are modified
to handle this properly.  This does entail switching some container_of()
calls to container_of_const() to properly handle the constant *.

For some busses, like PCI and USB and HV, the const * is cast away in
the match callback as those busses do want to modify those structures at
this point in time (they have a local lock in the driver structure.)
That will have to be changed in the future if they wish to have their
struct device * in read-only-memory.

Cc: Rafael J. Wysocki <rafael@kernel.org>
Reviewed-by: Alex Elder <elder@kernel.org>
Acked-by: Sumit Garg <sumit.garg@linaro.org>
Link: https://lore.kernel.org/r/2024070136-wrongdoer-busily-01e8@gregkh
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-03 15:16:54 +02:00
Michael S. Tsirkin c0a7233f58 Merge tag 'stable/vduse-virtio-net' into vhost
This adds support for virtio-net to vduse.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-05-22 08:32:48 -04:00
Yuxue Liu 4d685629b7 vp_vdpa: don't allocate unused msix vectors
When there is a ctlq and it doesn't require interrupt
callbacks,the original method of calculating vectors
wastes hardware msi or msix resources as well as system
IRQ resources.

When conducting performance testing using testpmd in the
guest os, it was found that the performance was lower compared
to directly using vfio-pci to passthrough the device

In scenarios where the virtio device in the guest os does
not utilize interrupts, the vdpa driver still configures
the hardware's msix vector. Therefore, the hardware still
sends interrupts to the host os. Because of this unnecessary
action by the hardware, hardware performance decreases, and
it also affects the performance of the host os.

Before modification:(interrupt mode)
 32:  0   0  0  0 PCI-MSI 32768-edge    vp-vdpa[0000:00:02.0]-0
 33:  0   0  0  0 PCI-MSI 32769-edge    vp-vdpa[0000:00:02.0]-1
 34:  0   0  0  0 PCI-MSI 32770-edge    vp-vdpa[0000:00:02.0]-2
 35:  0   0  0  0 PCI-MSI 32771-edge    vp-vdpa[0000:00:02.0]-config

After modification:(interrupt mode)
 32:  0  0  1  7   PCI-MSI 32768-edge  vp-vdpa[0000:00:02.0]-0
 33: 36  0  3  0   PCI-MSI 32769-edge  vp-vdpa[0000:00:02.0]-1
 34:  0  0  0  0   PCI-MSI 32770-edge  vp-vdpa[0000:00:02.0]-config

Before modification:(virtio pmd mode for guest os)
 32:  0   0  0  0 PCI-MSI 32768-edge    vp-vdpa[0000:00:02.0]-0
 33:  0   0  0  0 PCI-MSI 32769-edge    vp-vdpa[0000:00:02.0]-1
 34:  0   0  0  0 PCI-MSI 32770-edge    vp-vdpa[0000:00:02.0]-2
 35:  0   0  0  0 PCI-MSI 32771-edge    vp-vdpa[0000:00:02.0]-config

After modification:(virtio pmd mode for guest os)
 32: 0  0  0   0   PCI-MSI 32768-edge   vp-vdpa[0000:00:02.0]-config

To verify the use of the virtio PMD mode in the guest operating
system, the following patch needs to be applied to QEMU:
https://lore.kernel.org/all/20240408073311.2049-1-yuxue.liu@jaguarmicro.com

Signed-off-by: Yuxue Liu <yuxue.liu@jaguarmicro.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Heng Qi <hengqi@linux.alibaba.com>
Message-Id: <20240410033020.1310-1-yuxue.liu@jaguarmicro.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-05-22 08:31:18 -04:00
Li Zhijian 7b1b5c7fff vdpa: Convert sprintf/snprintf to sysfs_emit
Per filesystems/sysfs.rst, show() should only use sysfs_emit()
or sysfs_emit_at() when formatting the value to be returned to user space.

coccinelle complains that there are still a couple of functions that use
snprintf(). Convert them to sysfs_emit().

sprintf() will be converted as weel if they have.

Generally, this patch is generated by
make coccicheck M=<path/to/file> MODE=patch \
COCCI=scripts/coccinelle/api/device_attr_show.cocci

No functional change intended

CC: "Michael S. Tsirkin" <mst@redhat.com>
CC: Jason Wang <jasowang@redhat.com>
CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
CC: virtualization@lists.linux.dev
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Message-Id: <20240314095853.1326111-1-lizhijian@fujitsu.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-05-22 08:31:15 -04:00
Yuxue Liu f181a3736b vp_vdpa: Fix return value check vp_vdpa_request_irq
In the vp_vdpa_set_status function, when setting the device status to
VIRTIO_CONFIG_S_DRIVER_OK, the vp_vdpa_request_irq function may fail.
In such cases, the device status should not be set to DRIVER_OK. Add
exception printing to remind the user.

Signed-off-by: Yuxue Liu <yuxue.liu@jaguarmicro.com>
Message-Id: <20240325105448.235-1-gavin.liu@jaguarmicro.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-05-22 08:31:15 -04:00
Zhu Lingshan 98a821546b vDPA: code clean for vhost_vdpa uapi
This commit cleans up the uapi for vhost_vdpa by
better naming some of the enums which report blk
information to user space, and they are not
in any official releases yet.

Fixes: 1ac61ddfee ("vDPA: report virtio-blk flush info to user space")
Fixes: ae1374b7f7 ("vDPA: report virtio-block read-only info to user space")
Fixes: 330b8aea69 ("vDPA: report virtio-block max segment size to user space")
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240415111047.1047774-1-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-04-22 17:07:13 -04:00
Maxime Coquelin 894452180d vduse: enable Virtio-net device type
This patch adds Virtio-net device type to the supported
devices types.

Initialization fails if the device does not support
VIRTIO_F_VERSION_1 feature, in order to guarantee the
configuration space is read-only. It also fails with
-EPERM if the CAP_NET_ADMIN is missing.

Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Message-Id: <20240109111025.1320976-4-maxime.coquelin@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Xie Yongji <xieyongji@bytedance.com>
2024-03-19 08:19:15 -04:00
Maxime Coquelin 56e71885b0 vduse: Temporarily fail if control queue feature requested
Virtio-net driver control queue implementation is not safe
when used with VDUSE. If the VDUSE application does not
reply to control queue messages, it currently ends up
hanging the kernel thread sending this command.

Some work is on-going to make the control queue
implementation robust with VDUSE. Until it is completed,
let's fail features check if control-queue feature is
requested.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Message-Id: <20240109111025.1320976-3-maxime.coquelin@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Reviewed-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
2024-03-19 08:19:14 -04:00
Maxime Coquelin a115b5716f vduse: validate block features only with block devices
This patch is preliminary work to enable network device
type support to VDUSE.

As VIRTIO_BLK_F_CONFIG_WCE shares the same value as
VIRTIO_NET_F_HOST_TSO4, we need to restrict its check
to Virtio-blk device type.

Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xie Yongji <xieyongji@bytedance.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Message-Id: <20240109111025.1320976-2-maxime.coquelin@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-19 08:18:35 -04:00
Zhu Lingshan 1ac61ddfee vDPA: report virtio-blk flush info to user space
This commit reports whether a virtio-blk device
support cache flush command to user space

Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240218185606.13509-11-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-19 02:45:51 -04:00
Zhu Lingshan ae1374b7f7 vDPA: report virtio-block read-only info to user space
This commit report read-only information of
virtio-blk devices to user space.

Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240218185606.13509-10-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-19 02:45:51 -04:00
Zhu Lingshan 6bdc7846e6 vDPA: report virtio-block write zeroes configuration to user space
This commits reports write zeroes configuration of
virtio-block devices to user space, includes:
1)maximum write zeroes sectors size
2)maximum write zeroes segment number

Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240218185606.13509-9-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-19 02:45:51 -04:00
Zhu Lingshan 65848f46e1 vDPA: report virtio-block discarding configuration to user space
This commit reports virtio-blk discarding configuration
to user space,includes:
1) the maximum discard sectors
2) maximum number of discard segments for the block driver to use
3) the alignment for splitting a discarding request

Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240218185606.13509-8-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-19 02:45:51 -04:00
Zhu Lingshan c9d989b4ab vDPA: report virtio-block topology info to user space
This commit allows vDPA reporting topology information of
virtio-blk devices to user space, includes:
1) the number of logical blocks per physical block
2) offset of first aligned logical block
3) suggested minimum I/O size in blocks
4) optimal (suggested maximum) I/O size in blocks

Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240218185606.13509-7-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-19 02:45:51 -04:00
Zhu Lingshan 54fb04b02e vDPA: report virtio-block MQ info to user space
This commits allows vDPA reporting virtio-block multi-queue
configuration to user sapce.

Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240218185606.13509-6-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-19 02:45:51 -04:00
Zhu Lingshan 81f64e1d91 vDPA: report virtio-block max segments in a request to user space
This commit allows vDPA reporting the maximum number of
segments in a request of virtio-block devices to
user space.

Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240218185606.13509-5-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-19 02:45:51 -04:00
Zhu Lingshan 3a1d33fb7e vDPA: report virtio-block block-size to user space
This commit allows reporting the block size of a
virtio-block device to user space.

Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240218185606.13509-4-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-19 02:45:51 -04:00