Commit graph

866 commits

Author SHA1 Message Date
Linus Torvalds 2ef32ad224 virtio: features, fixes, cleanups
Several new features here:
 
 - virtio-net is finally supported in vduse.
 
 - Virtio (balloon and mem) interaction with suspend is improved
 
 - vhost-scsi now handles signals better/faster.
 
 Fixes, cleanups all over the place.
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmZN570PHG1zdEByZWRo
 YXQuY29tAAoJECgfDbjSjVRp2JUH/1K3fZOHymop6Y5Z3USFS7YdlF+dniedY/vg
 TKyWERkXOlxq1d9DVxC0mN7tk72DweuWI0YJjLXofrEW1VuW29ecSbyFXxpeWJls
 b7ErffxDAFRas5jkMCngD8TuFnbEegU0mGP5kbiHpEndBydQ2hH99Gg0x7swW+cE
 xsvU5zonCCLwLGIP2DrVrn9qGOHtV6o8eZfVKDVXfvicn3lFBkUSxlwEYsO9RMup
 aKxV4FT2Pb1yBicwBK4TH1oeEXqEGy1YLEn+kAHRbgoC/5L0/LaiqrkzwzwwOIPj
 uPGkacf8CIbX0qZo5EzD8kvfcYL1xhU3eT9WBmpp2ZwD+4bINd4=
 =nax1
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:
 "Several new features here:

   - virtio-net is finally supported in vduse

   - virtio (balloon and mem) interaction with suspend is improved

   - vhost-scsi now handles signals better/faster

  And fixes, cleanups all over the place"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (48 commits)
  virtio-pci: Check if is_avq is NULL
  virtio: delete vq in vp_find_vqs_msix() when request_irq() fails
  MAINTAINERS: add Eugenio Pérez as reviewer
  vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API
  vp_vdpa: don't allocate unused msix vectors
  sound: virtio: drop owner assignment
  fuse: virtio: drop owner assignment
  scsi: virtio: drop owner assignment
  rpmsg: virtio: drop owner assignment
  nvdimm: virtio_pmem: drop owner assignment
  wifi: mac80211_hwsim: drop owner assignment
  vsock/virtio: drop owner assignment
  net: 9p: virtio: drop owner assignment
  net: virtio: drop owner assignment
  net: caif: virtio: drop owner assignment
  misc: nsm: drop owner assignment
  iommu: virtio: drop owner assignment
  drm/virtio: drop owner assignment
  gpio: virtio: drop owner assignment
  firmware: arm_scmi: virtio: drop owner assignment
  ...
2024-05-23 12:04:36 -07:00
Li Zhang c8fae27d14 virtio-pci: Check if is_avq is NULL
[bug]
In the virtio_pci_common.c function vp_del_vqs, vp_dev->is_avq is involved
to determine whether it is admin virtqueue, but this function vp_dev->is_avq
 may be empty. For installations, virtio_pci_legacy does not assign a value
 to vp_dev->is_avq.

[fix]
Check whether it is vp_dev->is_avq before use.

[test]
Test with virsh Attach device
Before this patch, the following command would crash the guest system

After applying the patch, everything seems to be working fine.

Signed-off-by: Li Zhang <zhanglikernel@gmail.com>
Message-Id: <1710566754-3532-1-git-send-email-zhanglikernel@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-05-22 08:39:41 -04:00
Jiri Pirko 89875151fc virtio: delete vq in vp_find_vqs_msix() when request_irq() fails
When request_irq() fails, error path calls vp_del_vqs(). There, as vq is
present in the list, free_irq() is called for the same vector. That
causes following splat:

[    0.414355] Trying to free already-free IRQ 27
[    0.414403] WARNING: CPU: 1 PID: 1 at kernel/irq/manage.c:1899 free_irq+0x1a1/0x2d0
[    0.414510] Modules linked in:
[    0.414540] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.9.0-rc4+ #27
[    0.414540] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
[    0.414540] RIP: 0010:free_irq+0x1a1/0x2d0
[    0.414540] Code: 1e 00 48 83 c4 08 48 89 e8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 90 8b 74 24 04 48 c7 c7 98 80 6c b1 e8 00 c9 f7 ff 90 <0f> 0b 90 90 48 89 ee 4c 89 ef e8 e0 20 b8 00 49 8b 47 40 48 8b 40
[    0.414540] RSP: 0000:ffffb71480013ae0 EFLAGS: 00010086
[    0.414540] RAX: 0000000000000000 RBX: ffffa099c2722000 RCX: 0000000000000000
[    0.414540] RDX: 0000000000000000 RSI: ffffb71480013998 RDI: 0000000000000001
[    0.414540] RBP: 0000000000000246 R08: 00000000ffffdfff R09: 0000000000000001
[    0.414540] R10: 00000000ffffdfff R11: ffffffffb18729c0 R12: ffffa099c1c91760
[    0.414540] R13: ffffa099c1c916a4 R14: ffffa099c1d2f200 R15: ffffa099c1c91600
[    0.414540] FS:  0000000000000000(0000) GS:ffffa099fec40000(0000) knlGS:0000000000000000
[    0.414540] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.414540] CR2: 0000000000000000 CR3: 0000000008e3e001 CR4: 0000000000370ef0
[    0.414540] Call Trace:
[    0.414540]  <TASK>
[    0.414540]  ? __warn+0x80/0x120
[    0.414540]  ? free_irq+0x1a1/0x2d0
[    0.414540]  ? report_bug+0x164/0x190
[    0.414540]  ? handle_bug+0x3b/0x70
[    0.414540]  ? exc_invalid_op+0x17/0x70
[    0.414540]  ? asm_exc_invalid_op+0x1a/0x20
[    0.414540]  ? free_irq+0x1a1/0x2d0
[    0.414540]  vp_del_vqs+0xc1/0x220
[    0.414540]  vp_find_vqs_msix+0x305/0x470
[    0.414540]  vp_find_vqs+0x3e/0x1a0
[    0.414540]  vp_modern_find_vqs+0x1b/0x70
[    0.414540]  init_vqs+0x387/0x600
[    0.414540]  virtnet_probe+0x50a/0xc80
[    0.414540]  virtio_dev_probe+0x1e0/0x2b0
[    0.414540]  really_probe+0xc0/0x2c0
[    0.414540]  ? __pfx___driver_attach+0x10/0x10
[    0.414540]  __driver_probe_device+0x73/0x120
[    0.414540]  driver_probe_device+0x1f/0xe0
[    0.414540]  __driver_attach+0x88/0x180
[    0.414540]  bus_for_each_dev+0x85/0xd0
[    0.414540]  bus_add_driver+0xec/0x1f0
[    0.414540]  driver_register+0x59/0x100
[    0.414540]  ? __pfx_virtio_net_driver_init+0x10/0x10
[    0.414540]  virtio_net_driver_init+0x90/0xb0
[    0.414540]  do_one_initcall+0x58/0x230
[    0.414540]  kernel_init_freeable+0x1a3/0x2d0
[    0.414540]  ? __pfx_kernel_init+0x10/0x10
[    0.414540]  kernel_init+0x1a/0x1c0
[    0.414540]  ret_from_fork+0x31/0x50
[    0.414540]  ? __pfx_kernel_init+0x10/0x10
[    0.414540]  ret_from_fork_asm+0x1a/0x30
[    0.414540]  </TASK>

Fix this by calling deleting the current vq when request_irq() fails.

Fixes: 0b0f9dc52e ("Revert "virtio_pci: use shared interrupts for virtqueues"")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Message-Id: <20240426150845.3999481-1-jiri@resnulli.us>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-05-22 08:31:38 -04:00
Krzysztof Kozlowski b0e55a950e virtio: mem: drop owner assignment
virtio core already sets the .owner, so driver does not need to.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>

Message-Id: <20240331-module-owner-virtio-v2-4-98f04bfaf46a@linaro.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-05-22 08:31:16 -04:00
Krzysztof Kozlowski ebfb92a967 virtio: input: drop owner assignment
virtio core already sets the .owner, so driver does not need to.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>

Message-Id: <20240331-module-owner-virtio-v2-3-98f04bfaf46a@linaro.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-05-22 08:31:16 -04:00
Krzysztof Kozlowski 7cc5d6caaf virtio: balloon: drop owner assignment
virtio core already sets the .owner, so driver does not need to.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>

Message-Id: <20240331-module-owner-virtio-v2-2-98f04bfaf46a@linaro.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-05-22 08:31:16 -04:00
David Stevens c578123e63 virtio_balloon: Treat stats requests as wakeup events
Treat stats requests as wakeup events to ensure that the driver responds
to device requests in a timely manner.

Signed-off-by: David Stevens <stevensd@chromium.org>
Acked-by: David Hildenbrand <david@redhat.com>
Message-Id: <20240321012445.1593685-3-stevensd@google.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-05-22 08:31:16 -04:00
David Stevens 810d831bbb virtio_balloon: Give the balloon its own wakeup source
Wakeup sources don't support nesting multiple events, so sharing a
single object between multiple drivers can result in one driver
overriding the wakeup event processing period specified by another
driver. Have the virtio balloon driver use the wakeup source of the
device it is bound to rather than the wakeup source of the parent
device, to avoid conflicts with the transport layer.

Note that although the virtio balloon's virtio_device itself isn't what
actually wakes up the device, it is responsible for processing wakeup
events. In the same way that EPOLLWAKEUP uses a dedicated wakeup_source
to prevent suspend when userspace is processing wakeup events, a
dedicated wakeup_source is necessary when processing wakeup events in a
higher layer in the kernel.

Fixes: b12fbc3f78 ("virtio_balloon: stay awake while adjusting balloon")
Signed-off-by: David Stevens <stevensd@chromium.org>
Acked-by: David Hildenbrand <david@redhat.com>
Message-Id: <20240321012445.1593685-2-stevensd@google.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-05-22 08:31:16 -04:00
David Hildenbrand e4544c550e virtio-mem: support suspend+resume
With virtio-mem, primarily hibernation is problematic: as the machine shuts
down, the virtio-mem device loses its state. Powering the machine back up
is like losing a bunch of DIMMs. While there would be ways to add limited
support, suspend+resume is more commonly used for VMs and "easier" to
support cleanly.

s2idle can be supported without any device dependencies. Similarly, one
would expect suspend-to-ram (i.e., S3) to work out of the box. However,
QEMU currently unplugs all device memory when resuming the VM, using a
cold reset on the "wakeup" path. In order to support S3, we need a feature
flag for the device to tell us if memory remains plugged when waking up. In
the future, QEMU will implement this feature.

So let's always support s2idle and support S3 with plugged memory only if
the device indicates support. Block hibernation early using the PM
notifier.

Trying to hibernate now fails early:
	# echo disk > /sys/power/state
	[   26.455369] PM: hibernation: hibernation entry
	[   26.458271] virtio_mem virtio0: hibernation is not supported.
	[   26.462498] PM: hibernation: hibernation exit
	-bash: echo: write error: Operation not permitted

s2idle works even without the new feature bit:
	# echo s2idle > /sys/power/mem_sleep
	# echo mem > /sys/power/state
	[   52.083725] PM: suspend entry (s2idle)
	[   52.095950] Filesystems sync: 0.010 seconds
	[   52.101493] Freezing user space processes
	[   52.104213] Freezing user space processes completed (elapsed 0.001 seconds)
	[   52.106520] OOM killer disabled.
	[   52.107655] Freezing remaining freezable tasks
	[   52.110880] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
	[   52.113296] printk: Suspending console(s) (use no_console_suspend to debug)

S3 does not work without the feature bit when memory is plugged:
	# echo deep > /sys/power/mem_sleep
	# echo mem > /sys/power/state
	[   32.788281] PM: suspend entry (deep)
	[   32.816630] Filesystems sync: 0.027 seconds
	[   32.820029] Freezing user space processes
	[   32.823870] Freezing user space processes completed (elapsed 0.001 seconds)
	[   32.827756] OOM killer disabled.
	[   32.829608] Freezing remaining freezable tasks
	[   32.833842] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
	[   32.837953] printk: Suspending console(s) (use no_console_suspend to debug)
	[   32.916172] virtio_mem virtio0: suspend+resume with plugged memory is not supported
	[   32.916181] virtio-pci 0000:00:02.0: PM: pci_pm_suspend(): virtio_pci_freeze+0x0/0x50 returns -1
	[   32.916197] virtio-pci 0000:00:02.0: PM: dpm_run_callback(): pci_pm_suspend+0x0/0x170 returns -1
	[   32.916210] virtio-pci 0000:00:02.0: PM: failed to suspend async: error -1

But S3 works with the new feature bit when memory is plugged (patched
QEMU):
	# echo deep > /sys/power/mem_sleep
	# echo mem > /sys/power/state
	[   33.983694] PM: suspend entry (deep)
	[   34.009828] Filesystems sync: 0.024 seconds
	[   34.013589] Freezing user space processes
	[   34.016722] Freezing user space processes completed (elapsed 0.001 seconds)
	[   34.019092] OOM killer disabled.
	[   34.020291] Freezing remaining freezable tasks
	[   34.023549] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
	[   34.026090] printk: Suspending console(s) (use no_console_suspend to debug)

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20240318120645.105664-1-david@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-05-22 08:31:15 -04:00
Linus Torvalds 61307b7be4 The usual shower of singleton fixes and minor series all over MM,
documented (hopefully adequately) in the respective changelogs.  Notable
 series include:
 
 - Lucas Stach has provided some page-mapping
   cleanup/consolidation/maintainability work in the series "mm/treewide:
   Remove pXd_huge() API".
 
 - In the series "Allow migrate on protnone reference with
   MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
   MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one
   test.
 
 - In their series "Memory allocation profiling" Kent Overstreet and
   Suren Baghdasaryan have contributed a means of determining (via
   /proc/allocinfo) whereabouts in the kernel memory is being allocated:
   number of calls and amount of memory.
 
 - Matthew Wilcox has provided the series "Various significant MM
   patches" which does a number of rather unrelated things, but in largely
   similar code sites.
 
 - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes
   Weiner has fixed the page allocator's handling of migratetype requests,
   with resulting improvements in compaction efficiency.
 
 - In the series "make the hugetlb migration strategy consistent" Baolin
   Wang has fixed a hugetlb migration issue, which should improve hugetlb
   allocation reliability.
 
 - Liu Shixin has hit an I/O meltdown caused by readahead in a
   memory-tight memcg.  Addressed in the series "Fix I/O high when memory
   almost met memcg limit".
 
 - In the series "mm/filemap: optimize folio adding and splitting" Kairui
   Song has optimized pagecache insertion, yielding ~10% performance
   improvement in one test.
 
 - Baoquan He has cleaned up and consolidated the early zone
   initialization code in the series "mm/mm_init.c: refactor
   free_area_init_core()".
 
 - Baoquan has also redone some MM initializatio code in the series
   "mm/init: minor clean up and improvement".
 
 - MM helper cleanups from Christoph Hellwig in his series "remove
   follow_pfn".
 
 - More cleanups from Matthew Wilcox in the series "Various page->flags
   cleanups".
 
 - Vlastimil Babka has contributed maintainability improvements in the
   series "memcg_kmem hooks refactoring".
 
 - More folio conversions and cleanups in Matthew Wilcox's series
 
 	"Convert huge_zero_page to huge_zero_folio"
 	"khugepaged folio conversions"
 	"Remove page_idle and page_young wrappers"
 	"Use folio APIs in procfs"
 	"Clean up __folio_put()"
 	"Some cleanups for memory-failure"
 	"Remove page_mapping()"
 	"More folio compat code removal"
 
 - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb
   functions to work on folis".
 
 - Code consolidation and cleanup work related to GUP's handling of
   hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".
 
 - Rick Edgecombe has developed some fixes to stack guard gaps in the
   series "Cover a guard gap corner case".
 
 - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series
   "mm/ksm: fix ksm exec support for prctl".
 
 - Baolin Wang has implemented NUMA balancing for multi-size THPs.  This
   is a simple first-cut implementation for now.  The series is "support
   multi-size THP numa balancing".
 
 - Cleanups to vma handling helper functions from Matthew Wilcox in the
   series "Unify vma_address and vma_pgoff_address".
 
 - Some selftests maintenance work from Dev Jain in the series
   "selftests/mm: mremap_test: Optimizations and style fixes".
 
 - Improvements to the swapping of multi-size THPs from Ryan Roberts in
   the series "Swap-out mTHP without splitting".
 
 - Kefeng Wang has significantly optimized the handling of arm64's
   permission page faults in the series
 
 	"arch/mm/fault: accelerate pagefault when badaccess"
 	"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"
 
 - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it
   GUP-fast".
 
 - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to
   use struct vm_fault".
 
 - selftests build fixes from John Hubbard in the series "Fix
   selftests/mm build without requiring "make headers"".
 
 - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
   series "Improved Memory Tier Creation for CPUless NUMA Nodes".  Fixes
   the initialization code so that migration between different memory types
   works as intended.
 
 - David Hildenbrand has improved follow_pte() and fixed an errant driver
   in the series "mm: follow_pte() improvements and acrn follow_pte()
   fixes".
 
 - David also did some cleanup work on large folio mapcounts in his
   series "mm: mapcount for large folios + page_mapcount() cleanups".
 
 - Folio conversions in KSM in Alex Shi's series "transfer page to folio
   in KSM".
 
 - Barry Song has added some sysfs stats for monitoring multi-size THP's
   in the series "mm: add per-order mTHP alloc and swpout counters".
 
 - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled
   and limit checking cleanups".
 
 - Matthew Wilcox has been looking at buffer_head code and found the
   documentation to be lacking.  The series is "Improve buffer head
   documentation".
 
 - Multi-size THPs get more work, this time from Lance Yang.  His series
   "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes
   the freeing of these things.
 
 - Kemeng Shi has added more userspace-visible writeback instrumentation
   in the series "Improve visibility of writeback".
 
 - Kemeng Shi then sent some maintenance work on top in the series "Fix
   and cleanups to page-writeback".
 
 - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the
   series "Improve anon_vma scalability for anon VMAs".  Intel's test bot
   reported an improbable 3x improvement in one test.
 
 - SeongJae Park adds some DAMON feature work in the series
 
 	"mm/damon: add a DAMOS filter type for page granularity access recheck"
 	"selftests/damon: add DAMOS quota goal test"
 
 - Also some maintenance work in the series
 
 	"mm/damon/paddr: simplify page level access re-check for pageout"
 	"mm/damon: misc fixes and improvements"
 
 - David Hildenbrand has disabled some known-to-fail selftests ni the
   series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL".
 
 - memcg metadata storage optimizations from Shakeel Butt in "memcg:
   reduce memory consumption by memcg stats".
 
 - DAX fixes and maintenance work from Vishal Verma in the series
   "dax/bus.c: Fixups for dax-bus locking".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZkgQYwAKCRDdBJ7gKXxA
 jrdKAP9WVJdpEcXxpoub/vVE0UWGtffr8foifi9bCwrQrGh5mgEAx7Yf0+d/oBZB
 nvA4E0DcPrUAFy144FNM0NTCb7u9vAw=
 =V3R/
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull mm updates from Andrew Morton:
 "The usual shower of singleton fixes and minor series all over MM,
  documented (hopefully adequately) in the respective changelogs.
  Notable series include:

   - Lucas Stach has provided some page-mapping cleanup/consolidation/
     maintainability work in the series "mm/treewide: Remove pXd_huge()
     API".

   - In the series "Allow migrate on protnone reference with
     MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
     MPOL_PREFERRED_MANY mode, yielding almost doubled performance in
     one test.

   - In their series "Memory allocation profiling" Kent Overstreet and
     Suren Baghdasaryan have contributed a means of determining (via
     /proc/allocinfo) whereabouts in the kernel memory is being
     allocated: number of calls and amount of memory.

   - Matthew Wilcox has provided the series "Various significant MM
     patches" which does a number of rather unrelated things, but in
     largely similar code sites.

   - In his series "mm: page_alloc: freelist migratetype hygiene"
     Johannes Weiner has fixed the page allocator's handling of
     migratetype requests, with resulting improvements in compaction
     efficiency.

   - In the series "make the hugetlb migration strategy consistent"
     Baolin Wang has fixed a hugetlb migration issue, which should
     improve hugetlb allocation reliability.

   - Liu Shixin has hit an I/O meltdown caused by readahead in a
     memory-tight memcg. Addressed in the series "Fix I/O high when
     memory almost met memcg limit".

   - In the series "mm/filemap: optimize folio adding and splitting"
     Kairui Song has optimized pagecache insertion, yielding ~10%
     performance improvement in one test.

   - Baoquan He has cleaned up and consolidated the early zone
     initialization code in the series "mm/mm_init.c: refactor
     free_area_init_core()".

   - Baoquan has also redone some MM initializatio code in the series
     "mm/init: minor clean up and improvement".

   - MM helper cleanups from Christoph Hellwig in his series "remove
     follow_pfn".

   - More cleanups from Matthew Wilcox in the series "Various
     page->flags cleanups".

   - Vlastimil Babka has contributed maintainability improvements in the
     series "memcg_kmem hooks refactoring".

   - More folio conversions and cleanups in Matthew Wilcox's series:
	"Convert huge_zero_page to huge_zero_folio"
	"khugepaged folio conversions"
	"Remove page_idle and page_young wrappers"
	"Use folio APIs in procfs"
	"Clean up __folio_put()"
	"Some cleanups for memory-failure"
	"Remove page_mapping()"
	"More folio compat code removal"

   - David Hildenbrand chipped in with "fs/proc/task_mmu: convert
     hugetlb functions to work on folis".

   - Code consolidation and cleanup work related to GUP's handling of
     hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".

   - Rick Edgecombe has developed some fixes to stack guard gaps in the
     series "Cover a guard gap corner case".

   - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the
     series "mm/ksm: fix ksm exec support for prctl".

   - Baolin Wang has implemented NUMA balancing for multi-size THPs.
     This is a simple first-cut implementation for now. The series is
     "support multi-size THP numa balancing".

   - Cleanups to vma handling helper functions from Matthew Wilcox in
     the series "Unify vma_address and vma_pgoff_address".

   - Some selftests maintenance work from Dev Jain in the series
     "selftests/mm: mremap_test: Optimizations and style fixes".

   - Improvements to the swapping of multi-size THPs from Ryan Roberts
     in the series "Swap-out mTHP without splitting".

   - Kefeng Wang has significantly optimized the handling of arm64's
     permission page faults in the series
	"arch/mm/fault: accelerate pagefault when badaccess"
	"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"

   - GUP cleanups from David Hildenbrand in "mm/gup: consistently call
     it GUP-fast".

   - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault
     path to use struct vm_fault".

   - selftests build fixes from John Hubbard in the series "Fix
     selftests/mm build without requiring "make headers"".

   - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
     series "Improved Memory Tier Creation for CPUless NUMA Nodes".
     Fixes the initialization code so that migration between different
     memory types works as intended.

   - David Hildenbrand has improved follow_pte() and fixed an errant
     driver in the series "mm: follow_pte() improvements and acrn
     follow_pte() fixes".

   - David also did some cleanup work on large folio mapcounts in his
     series "mm: mapcount for large folios + page_mapcount() cleanups".

   - Folio conversions in KSM in Alex Shi's series "transfer page to
     folio in KSM".

   - Barry Song has added some sysfs stats for monitoring multi-size
     THP's in the series "mm: add per-order mTHP alloc and swpout
     counters".

   - Some zswap cleanups from Yosry Ahmed in the series "zswap
     same-filled and limit checking cleanups".

   - Matthew Wilcox has been looking at buffer_head code and found the
     documentation to be lacking. The series is "Improve buffer head
     documentation".

   - Multi-size THPs get more work, this time from Lance Yang. His
     series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free"
     optimizes the freeing of these things.

   - Kemeng Shi has added more userspace-visible writeback
     instrumentation in the series "Improve visibility of writeback".

   - Kemeng Shi then sent some maintenance work on top in the series
     "Fix and cleanups to page-writeback".

   - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in
     the series "Improve anon_vma scalability for anon VMAs". Intel's
     test bot reported an improbable 3x improvement in one test.

   - SeongJae Park adds some DAMON feature work in the series
	"mm/damon: add a DAMOS filter type for page granularity access recheck"
	"selftests/damon: add DAMOS quota goal test"

   - Also some maintenance work in the series
	"mm/damon/paddr: simplify page level access re-check for pageout"
	"mm/damon: misc fixes and improvements"

   - David Hildenbrand has disabled some known-to-fail selftests ni the
     series "selftests: mm: cow: flag vmsplice() hugetlb tests as
     XFAIL".

   - memcg metadata storage optimizations from Shakeel Butt in "memcg:
     reduce memory consumption by memcg stats".

   - DAX fixes and maintenance work from Vishal Verma in the series
     "dax/bus.c: Fixups for dax-bus locking""

* tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits)
  memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order
  selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime
  mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp
  mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault
  selftests: cgroup: add tests to verify the zswap writeback path
  mm: memcg: make alloc_mem_cgroup_per_node_info() return bool
  mm/damon/core: fix return value from damos_wmark_metric_value
  mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED
  selftests: cgroup: remove redundant enabling of memory controller
  Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree
  Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT
  Docs/mm/damon/design: use a list for supported filters
  Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command
  Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file
  selftests/damon: classify tests for functionalities and regressions
  selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None'
  selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts
  selftests/damon/_damon_sysfs: check errors from nr_schemes file reads
  mm/damon/core: initialize ->esz_bp from damos_quota_init_priv()
  selftests/damon: add a test for DAMOS quota goal
  ...
2024-05-19 09:21:03 -07:00
Xuan Zhuo f9dac92ba9 virtio_ring: enable premapped mode whatever use_dma_api
Now, we have virtio DMA APIs, the driver can be the premapped
mode whatever the virtio core uses dma api or not.

So remove the limit of checking use_dma_api from
virtqueue_set_dma_premapped().

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20240511031404.30903-2-xuanzhuo@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-13 17:07:41 -07:00
Uwe Kleine-König e117d9b6e7 virtio-mmio: Convert to platform remove callback returning void
The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.

To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new(), which already returns void. Eventually after all drivers
are converted, .remove_new() will be renamed to .remove().

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Message-Id: <ef71f955531d5e41b20d801e1149bb08d155679a.1709886922.git.u.kleine-koenig@pengutronix.de>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-05-01 12:08:40 -04:00
Jiri Pirko 96a8326d69 virtio: add debugfs infrastructure to allow to debug virtio features
Currently there is no way for user to set what features the driver
should obey or not, it is hard wired in the code.

In order to be able to debug the device behavior in case some feature is
disabled, introduce a debugfs infrastructure with couple of files
allowing user to see what features the device advertises and
to set filter for features used by driver.

Example:
$cat /sys/bus/virtio/devices/virtio0/features
1110010111111111111101010000110010000000100000000000000000000000
$ echo "5" >/sys/kernel/debug/virtio/virtio0/filter_feature_add
$ cat /sys/kernel/debug/virtio/virtio0/filter_features
5
$ echo "virtio0" > /sys/bus/virtio/drivers/virtio_net/unbind
$ echo "virtio0" > /sys/bus/virtio/drivers/virtio_net/bind
$ cat /sys/bus/virtio/devices/virtio0/features
1110000111111111111101010000110010000000100000000000000000000000

Note that sysfs "features" now already exists, this patch does not
touch it.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-26 13:26:53 +02:00
Kent Overstreet 0069455bcb fix missing vmalloc.h includes
Patch series "Memory allocation profiling", v6.

Overview:
Low overhead [1] per-callsite memory allocation profiling. Not just for
debug kernels, overhead low enough to be deployed in production.

Example output:
  root@moria-kvm:~# sort -rn /proc/allocinfo
   127664128    31168 mm/page_ext.c:270 func:alloc_page_ext
    56373248     4737 mm/slub.c:2259 func:alloc_slab_page
    14880768     3633 mm/readahead.c:247 func:page_cache_ra_unbounded
    14417920     3520 mm/mm_init.c:2530 func:alloc_large_system_hash
    13377536      234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs
    11718656     2861 mm/filemap.c:1919 func:__filemap_get_folio
     9192960     2800 kernel/fork.c:307 func:alloc_thread_stack_node
     4206592        4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable
     4136960     1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start
     3940352      962 mm/memory.c:4214 func:alloc_anon_folio
     2894464    22613 fs/kernfs/dir.c:615 func:__kernfs_new_node
     ...

Usage:
kconfig options:
 - CONFIG_MEM_ALLOC_PROFILING
 - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
 - CONFIG_MEM_ALLOC_PROFILING_DEBUG
   adds warnings for allocations that weren't accounted because of a
   missing annotation

sysctl:
  /proc/sys/vm/mem_profiling

Runtime info:
  /proc/allocinfo

Notes:

[1]: Overhead
To measure the overhead we are comparing the following configurations:
(1) Baseline with CONFIG_MEMCG_KMEM=n
(2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)
(3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y)
(4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1)
(5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT
(6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)  && CONFIG_MEMCG_KMEM=y
(7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y

Performance overhead:
To evaluate performance we implemented an in-kernel test executing
multiple get_free_page/free_page and kmalloc/kfree calls with allocation
sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
affinity set to a specific CPU to minimize the noise. Below are results
from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
56 core Intel Xeon:

                        kmalloc                 pgalloc
(1 baseline)            6.764s                  16.902s
(2 default disabled)    6.793s  (+0.43%)        17.007s (+0.62%)
(3 default enabled)     7.197s  (+6.40%)        23.666s (+40.02%)
(4 runtime enabled)     7.405s  (+9.48%)        23.901s (+41.41%)
(5 memcg)               13.388s (+97.94%)       48.460s (+186.71%)
(6 def disabled+memcg)  13.332s (+97.10%)       48.105s (+184.61%)
(7 def enabled+memcg)   13.446s (+98.78%)       54.963s (+225.18%)

Memory overhead:
Kernel size:

   text           data        bss         dec         diff
(1) 26515311	      18890222    17018880    62424413
(2) 26524728	      19423818    16740352    62688898    264485
(3) 26524724	      19423818    16740352    62688894    264481
(4) 26524728	      19423818    16740352    62688898    264485
(5) 26541782	      18964374    16957440    62463596    39183

Memory consumption on a 56 core Intel CPU with 125GB of memory:
Code tags:           192 kB
PageExts:         262144 kB (256MB)
SlabExts:           9876 kB (9.6MB)
PcpuExts:            512 kB (0.5MB)

Total overhead is 0.2% of total memory.

Benchmarks:

Hackbench tests run 100 times:
hackbench -s 512 -l 200 -g 15 -f 25 -P
      baseline       disabled profiling           enabled profiling
avg   0.3543         0.3559 (+0.0016)             0.3566 (+0.0023)
stdev 0.0137         0.0188                       0.0077


hackbench -l 10000
      baseline       disabled profiling           enabled profiling
avg   6.4218         6.4306 (+0.0088)             6.5077 (+0.0859)
stdev 0.0933         0.0286                       0.0489

stress-ng tests:
stress-ng --class memory --seq 4 -t 60
stress-ng --class cpu --seq 4 -t 60
Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/

[2] https://lore.kernel.org/all/20240306182440.2003814-1-surenb@google.com/


This patch (of 37):

The next patch drops vmalloc.h from a system header in order to fix a
circular dependency; this adds it to all the files that were pulling it in
implicitly.

[kent.overstreet@linux.dev: fix arch/alpha/lib/memcpy.c]
  Link: https://lkml.kernel.org/r/20240327002152.3339937-1-kent.overstreet@linux.dev
[surenb@google.com: fix arch/x86/mm/numa_32.c]
  Link: https://lkml.kernel.org/r/20240402180933.1663992-1-surenb@google.com
[kent.overstreet@linux.dev: a few places were depending on sizes.h]
  Link: https://lkml.kernel.org/r/20240404034744.1664840-1-kent.overstreet@linux.dev
[arnd@arndb.de: fix mm/kasan/hw_tags.c]
  Link: https://lkml.kernel.org/r/20240404124435.3121534-1-arnd@kernel.org
[surenb@google.com: fix arc build]
  Link: https://lkml.kernel.org/r/20240405225115.431056-1-surenb@google.com
Link: https://lkml.kernel.org/r/20240321163705.3067592-1-surenb@google.com
Link: https://lkml.kernel.org/r/20240321163705.3067592-2-surenb@google.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andreas Hindborg <a.hindborg@samsung.com>
Cc: Benno Lossin <benno.lossin@proton.me>
Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Gary Guo <gary@garyguo.net>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:55:49 -07:00
Krzysztof Kozlowski ffe6176b7f virtio: store owner from modules with register_virtio_driver()
Modules registering driver with register_virtio_driver() might forget to
set .owner field.  i2c-virtio.c for example has it missing.  The field
is used by some other kernel parts for reference counting
(try_module_get()), so it is expected that drivers will set it.

Solve the problem by moving this task away from the drivers to the core
virtio code, just like we did for platform_driver in
commit 9447057eaf ("platform_device: use a macro instead of
platform_driver_register").

Fixes: 3cfc883804 ("i2c: virtio: add a virtio i2c frontend driver")
Cc: "Jie Deng" <jie.deng@intel.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Message-Id: <20240331-module-owner-virtio-v2-1-98f04bfaf46a@linaro.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-04-08 04:11:04 -04:00
Xuan Zhuo d5c0ed17fe virtio: packed: fix unmap leak for indirect desc table
When use_dma_api and premapped are true, then the do_unmap is false.

Because the do_unmap is false, vring_unmap_extra_packed is not called by
detach_buf_packed.

  if (unlikely(vq->do_unmap)) {
                curr = id;
                for (i = 0; i < state->num; i++) {
                        vring_unmap_extra_packed(vq,
                                                 &vq->packed.desc_extra[curr]);
                        curr = vq->packed.desc_extra[curr].next;
                }
  }

So the indirect desc table is not unmapped. This causes the unmap leak.

So here, we check vq->use_dma_api instead. Synchronously, dma info is
updated based on use_dma_api judgment

This bug does not occur, because no driver use the premapped with
indirect.

Fixes: b319940f83 ("virtio_ring: skip unmap for premapped")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Message-Id: <20240223071833.26095-1-xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-19 03:19:22 -04:00
Ricardo B. Marliere 2b666ee296 virtio: make virtio_bus const
Now that the driver core can properly handle constant struct bus_type,
move the virtio_bus variable to be a constant structure as well,
placing it into read-only memory which can not be modified at runtime.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Message-Id: <20240204-bus_cleanup-virtio-v1-1-3bcb2212aaa0@marliere.net>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Jason Wang <jasowang@redhat.com>
2024-03-19 02:45:50 -04:00
Zhu Lingshan 273ae08fa0 virtio_vdpa: create vqs with the actual size
The size of a virtqueue is a per vq configuration,
this commit allows virtio_vdpa to create
virtqueues with the actual size of a specific
vq size that supported by the backend device.

Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240202163905.8834-9-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-19 02:45:50 -04:00
David Hildenbrand 310227f428 virtio: reenable config if freezing device failed
Currently, we don't reenable the config if freezing the device failed.

For example, virtio-mem currently doesn't support suspend+resume, and
trying to freeze the device will always fail. Afterwards, the device
will no longer respond to resize requests, because it won't get notified
about config changes.

Let's fix this by re-enabling the config if freezing fails.

Fixes: 22b7050a02 ("virtio: defer config changed notifications")
Cc: <stable@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20240213135425.795001-1-david@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-19 02:45:49 -04:00
Linus Torvalds 0b7359ccdd virtio: features, fixes
vdpa/mlx5: support for resumable vqs
 virtio_scsi: mq_poll support
 3virtio_pmem: support SHMEM_REGION
 virtio_balloon: stay awake while adjusting balloon
 virtio: support for no-reset virtio PCI PM
 
 Fixes, cleanups.
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmWmgP8PHG1zdEByZWRo
 YXQuY29tAAoJECgfDbjSjVRpcfgH/0RD2S+NFY0ZEJz8BuI6GjykzYnyRW9iyxcw
 epTLjPUcoEBttlw8TA+3PiPoNIJGfuU8Q4iKXJ8Jzql081tP9G1UxTIbj0v3Hx+q
 0L2DUXfdAMYMLo5WQVl/PADV/10xLgExEh9jMqpU3IJIxVaLE/knD9ghRCDvDbs/
 fOo3sSUGaNsSHYZs5bH73Q7cRKKmTLO+MzvHBbavFfz2fQ1b3vwecmJuQtAtK0JC
 6JxH6Y38VfOl8jA6IHeEpGIHeF661HABkDDUh4UVEGOeyBl4E6ZcG4fjWSMinZ08
 U3TbQLYOq10i8ki2LJKgoZHRv1HkxbM1Ogn0bsIh1hish8dPORM=
 =RWjR
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:

 - vdpa/mlx5: support for resumable vqs

 - virtio_scsi: mq_poll support

 - 3virtio_pmem: support SHMEM_REGION

 - virtio_balloon: stay awake while adjusting balloon

 - virtio: support for no-reset virtio PCI PM

 - Fixes, cleanups

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  vdpa/mlx5: Add mkey leak detection
  vdpa/mlx5: Introduce reference counting to mrs
  vdpa/mlx5: Use vq suspend/resume during .set_map
  vdpa/mlx5: Mark vq state for modification in hw vq
  vdpa/mlx5: Mark vq addrs for modification in hw vq
  vdpa/mlx5: Introduce per vq and device resume
  vdpa/mlx5: Allow modifying multiple vq fields in one modify command
  vdpa/mlx5: Expose resumable vq capability
  vdpa: Block vq property changes in DRIVER_OK
  vdpa: Track device suspended state
  scsi: virtio_scsi: Add mq_poll support
  virtio_pmem: support feature SHMEM_REGION
  virtio_balloon: stay awake while adjusting balloon
  vdpa: Remove usage of the deprecated ida_simple_xx() API
  virtio: Add support for no-reset virtio PCI PM
  virtio_net: fix missing dma unmap for resize
  vhost-vdpa: account iommu allocations
  vdpa: Fix an error handling path in eni_vdpa_probe()
2024-01-18 16:44:03 -08:00
Linus Torvalds 244aefb1c6 VFIO updates for v6.8-rc1
- Add debugfs support, initially used for reporting device migration
    state. (Longfang Liu)
 
  - Fixes and support for migration dirty tracking across multiple IOVA
    regions in the pds-vfio-pci driver. (Brett Creeley)
 
  - Improved IOMMU allocation accounting visibility. (Pasha Tatashin)
 
  - Virtio infrastructure and a new virtio-vfio-pci variant driver, which
    provides emulation of a legacy virtio interfaces on modern virtio
    hardware for virtio-net VF devices where the PF driver exposes
    support for legacy admin queues, ie. an emulated IO BAR on an SR-IOV
    VF to provide driver ABI compatibility to legacy devices.
    (Yishai Hadas & Feng Liu)
 
  - Migration fixes for the hisi-acc-vfio-pci variant driver.
    (Shameer Kolothum)
 
  - Kconfig dependency fix for new virtio-vfio-pci variant driver.
    (Arnd Bergmann)
 -----BEGIN PGP SIGNATURE-----
 
 iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmWhkhEbHGFsZXgud2ls
 bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsiCLgQAJv6mzD79dVWKAZH27Lj
 PK0ZSyu3fwgPxTmhRXysKKMs79WI2GlVx6nyW8pVe3w+OGWpdTcbZK2H/T/FryZQ
 QsbKteueG83ni1cIdJFzmIM1jO79jhtsPxpclRS/VmECRhYA6+c7smynHyZNrVAL
 wWkJIkS2uUEx3eUefzH4U2CRen3TILwHAXi27fJ8pHbr6Yor+XvUOgM3eQDjUj+t
 eABL/pJr0qFDQstom6k7GLAsenRHKMLUG88ziSciSJxOg5YiT4py7zeLXuoEhVD1
 kI9KE+Vle5EdZe8MzLLhmzLZoFVfhjyNfj821QjtfP3Gkj6TqnUWBKJAptMuQpdf
 HklOLNmabrZbat+i6QqswrnQ5Z1doPz1uNBsl2lH+2/KIaT8bHZI+QgjK7pg2H2L
 O679My0od4rVLpjnSLDdRoXlcLd6mmvq3663gPogziHBNdNl3oQBI3iIa7ixljkA
 lxJbOZIDBAjzPk+t5NLYwkTsab1AY4zGlfr0M3Sk3q7tyj/MlBcX/fuqyhXjUfqR
 Zhqaw2OaWD8R0EqfSK+wRXr1+z7EWJO/y1iq8RYlD5Mozo+6YMVThjLDUO+8mrtV
 6/PL0woGALw0Tq1u0tw3rLjzCd9qwD9BD2fFUQwUWEe3j3wG2HCLLqyomxcmaKS8
 WgvUXtufWyvonCcIeLKXI9Kt
 =IuK2
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v6.8-rc1' of https://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

 - Add debugfs support, initially used for reporting device migration
   state (Longfang Liu)

 - Fixes and support for migration dirty tracking across multiple IOVA
   regions in the pds-vfio-pci driver (Brett Creeley)

 - Improved IOMMU allocation accounting visibility (Pasha Tatashin)

 - Virtio infrastructure and a new virtio-vfio-pci variant driver, which
   provides emulation of a legacy virtio interfaces on modern virtio
   hardware for virtio-net VF devices where the PF driver exposes
   support for legacy admin queues, ie. an emulated IO BAR on an SR-IOV
   VF to provide driver ABI compatibility to legacy devices (Yishai
   Hadas & Feng Liu)

 - Migration fixes for the hisi-acc-vfio-pci variant driver (Shameer
   Kolothum)

 - Kconfig dependency fix for new virtio-vfio-pci variant driver (Arnd
   Bergmann)

* tag 'vfio-v6.8-rc1' of https://github.com/awilliam/linux-vfio: (22 commits)
  vfio/virtio: fix virtio-pci dependency
  hisi_acc_vfio_pci: Update migration data pointer correctly on saving/resume
  vfio/virtio: Declare virtiovf_pci_aer_reset_done() static
  vfio/virtio: Introduce a vfio driver over virtio devices
  vfio/pci: Expose vfio_pci_core_iowrite/read##size()
  vfio/pci: Expose vfio_pci_core_setup_barmap()
  virtio-pci: Introduce APIs to execute legacy IO admin commands
  virtio-pci: Initialize the supported admin commands
  virtio-pci: Introduce admin commands
  virtio-pci: Introduce admin command sending function
  virtio-pci: Introduce admin virtqueue
  virtio: Define feature bit for administration virtqueue
  vfio/type1: account iommu allocations
  vfio/pds: Add multi-region support
  vfio/pds: Move seq/ack bitmaps into region struct
  vfio/pds: Pass region info to relevant functions
  vfio/pds: Move and rename region specific info
  vfio/pds: Only use a single SGL for both seq and ack
  vfio/pds: Fix calculations in pds_vfio_dirty_sync
  MAINTAINERS: Add vfio debugfs interface doc link
  ...
2024-01-18 15:57:25 -08:00
David Stevens b12fbc3f78 virtio_balloon: stay awake while adjusting balloon
A virtio_balloon's parent device may be configured so that a
configuration change interrupt is a wakeup event. Extend the processing
of such a wakeup event until the balloon finishes inflating or deflating
by calling pm_stay_awake/pm_relax in the virtio_balloon driver. Note
that these calls are no-ops if the parent device doesn't support wakeup
events or if the wakeup events are not enabled.

This change allows the guest to use system power states such as s2idle
without running the risk the virtio_balloon's cooperative memory
management becoming unresponsive to the host's requests.

Tested-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: David Stevens <stevensd@chromium.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Message-Id: <20240110021925.1137333-1-stevensd@google.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-01-10 13:01:06 -05:00
Linus Torvalds fb46e22a9e Many singleton patches against the MM code. The patch series which
are included in this merge do the following:
 
 - Peng Zhang has done some mapletree maintainance work in the
   series
 
 	"maple_tree: add mt_free_one() and mt_attr() helpers"
 	"Some cleanups of maple tree"
 
 - In the series "mm: use memmap_on_memory semantics for dax/kmem"
   Vishal Verma has altered the interworking between memory-hotplug
   and dax/kmem so that newly added 'device memory' can more easily
   have its memmap placed within that newly added memory.
 
 - Matthew Wilcox continues folio-related work (including a few
   fixes) in the patch series
 
 	"Add folio_zero_tail() and folio_fill_tail()"
 	"Make folio_start_writeback return void"
 	"Fix fault handler's handling of poisoned tail pages"
 	"Convert aops->error_remove_page to ->error_remove_folio"
 	"Finish two folio conversions"
 	"More swap folio conversions"
 
 - Kefeng Wang has also contributed folio-related work in the series
 
 	"mm: cleanup and use more folio in page fault"
 
 - Jim Cromie has improved the kmemleak reporting output in the
   series "tweak kmemleak report format".
 
 - In the series "stackdepot: allow evicting stack traces" Andrey
   Konovalov to permits clients (in this case KASAN) to cause
   eviction of no longer needed stack traces.
 
 - Charan Teja Kalla has fixed some accounting issues in the page
   allocator's atomic reserve calculations in the series "mm:
   page_alloc: fixes for high atomic reserve caluculations".
 
 - Dmitry Rokosov has added to the samples/ dorectory some sample
   code for a userspace memcg event listener application.  See the
   series "samples: introduce cgroup events listeners".
 
 - Some mapletree maintanance work from Liam Howlett in the series
   "maple_tree: iterator state changes".
 
 - Nhat Pham has improved zswap's approach to writeback in the
   series "workload-specific and memory pressure-driven zswap
   writeback".
 
 - DAMON/DAMOS feature and maintenance work from SeongJae Park in
   the series
 
 	"mm/damon: let users feed and tame/auto-tune DAMOS"
 	"selftests/damon: add Python-written DAMON functionality tests"
 	"mm/damon: misc updates for 6.8"
 
 - Yosry Ahmed has improved memcg's stats flushing in the series
   "mm: memcg: subtree stats flushing and thresholds".
 
 - In the series "Multi-size THP for anonymous memory" Ryan Roberts
   has added a runtime opt-in feature to transparent hugepages which
   improves performance by allocating larger chunks of memory during
   anonymous page faults.
 
 - Matthew Wilcox has also contributed some cleanup and maintenance
   work against eh buffer_head code int he series "More buffer_head
   cleanups".
 
 - Suren Baghdasaryan has done work on Andrea Arcangeli's series
   "userfaultfd move option".  UFFDIO_MOVE permits userspace heap
   compaction algorithms to move userspace's pages around rather than
   UFFDIO_COPY'a alloc/copy/free.
 
 - Stefan Roesch has developed a "KSM Advisor", in the series
   "mm/ksm: Add ksm advisor".  This is a governor which tunes KSM's
   scanning aggressiveness in response to userspace's current needs.
 
 - Chengming Zhou has optimized zswap's temporary working memory
   use in the series "mm/zswap: dstmem reuse optimizations and
   cleanups".
 
 - Matthew Wilcox has performed some maintenance work on the
   writeback code, both code and within filesystems.  The series is
   "Clean up the writeback paths".
 
 - Andrey Konovalov has optimized KASAN's handling of alloc and
   free stack traces for secondary-level allocators, in the series
   "kasan: save mempool stack traces".
 
 - Andrey also performed some KASAN maintenance work in the series
   "kasan: assorted clean-ups".
 
 - David Hildenbrand has gone to town on the rmap code.  Cleanups,
   more pte batching, folio conversions and more.  See the series
   "mm/rmap: interface overhaul".
 
 - Kinsey Ho has contributed some maintenance work on the MGLRU
   code in the series "mm/mglru: Kconfig cleanup".
 
 - Matthew Wilcox has contributed lruvec page accounting code
   cleanups in the series "Remove some lruvec page accounting
   functions".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA
 jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27
 Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU=
 =0NHs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Many singleton patches against the MM code. The patch series which are
  included in this merge do the following:

   - Peng Zhang has done some mapletree maintainance work in the series

	'maple_tree: add mt_free_one() and mt_attr() helpers'
	'Some cleanups of maple tree'

   - In the series 'mm: use memmap_on_memory semantics for dax/kmem'
     Vishal Verma has altered the interworking between memory-hotplug
     and dax/kmem so that newly added 'device memory' can more easily
     have its memmap placed within that newly added memory.

   - Matthew Wilcox continues folio-related work (including a few fixes)
     in the patch series

	'Add folio_zero_tail() and folio_fill_tail()'
	'Make folio_start_writeback return void'
	'Fix fault handler's handling of poisoned tail pages'
	'Convert aops->error_remove_page to ->error_remove_folio'
	'Finish two folio conversions'
	'More swap folio conversions'

   - Kefeng Wang has also contributed folio-related work in the series

	'mm: cleanup and use more folio in page fault'

   - Jim Cromie has improved the kmemleak reporting output in the series
     'tweak kmemleak report format'.

   - In the series 'stackdepot: allow evicting stack traces' Andrey
     Konovalov to permits clients (in this case KASAN) to cause eviction
     of no longer needed stack traces.

   - Charan Teja Kalla has fixed some accounting issues in the page
     allocator's atomic reserve calculations in the series 'mm:
     page_alloc: fixes for high atomic reserve caluculations'.

   - Dmitry Rokosov has added to the samples/ dorectory some sample code
     for a userspace memcg event listener application. See the series
     'samples: introduce cgroup events listeners'.

   - Some mapletree maintanance work from Liam Howlett in the series
     'maple_tree: iterator state changes'.

   - Nhat Pham has improved zswap's approach to writeback in the series
     'workload-specific and memory pressure-driven zswap writeback'.

   - DAMON/DAMOS feature and maintenance work from SeongJae Park in the
     series

	'mm/damon: let users feed and tame/auto-tune DAMOS'
	'selftests/damon: add Python-written DAMON functionality tests'
	'mm/damon: misc updates for 6.8'

   - Yosry Ahmed has improved memcg's stats flushing in the series 'mm:
     memcg: subtree stats flushing and thresholds'.

   - In the series 'Multi-size THP for anonymous memory' Ryan Roberts
     has added a runtime opt-in feature to transparent hugepages which
     improves performance by allocating larger chunks of memory during
     anonymous page faults.

   - Matthew Wilcox has also contributed some cleanup and maintenance
     work against eh buffer_head code int he series 'More buffer_head
     cleanups'.

   - Suren Baghdasaryan has done work on Andrea Arcangeli's series
     'userfaultfd move option'. UFFDIO_MOVE permits userspace heap
     compaction algorithms to move userspace's pages around rather than
     UFFDIO_COPY'a alloc/copy/free.

   - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm:
     Add ksm advisor'. This is a governor which tunes KSM's scanning
     aggressiveness in response to userspace's current needs.

   - Chengming Zhou has optimized zswap's temporary working memory use
     in the series 'mm/zswap: dstmem reuse optimizations and cleanups'.

   - Matthew Wilcox has performed some maintenance work on the writeback
     code, both code and within filesystems. The series is 'Clean up the
     writeback paths'.

   - Andrey Konovalov has optimized KASAN's handling of alloc and free
     stack traces for secondary-level allocators, in the series 'kasan:
     save mempool stack traces'.

   - Andrey also performed some KASAN maintenance work in the series
     'kasan: assorted clean-ups'.

   - David Hildenbrand has gone to town on the rmap code. Cleanups, more
     pte batching, folio conversions and more. See the series 'mm/rmap:
     interface overhaul'.

   - Kinsey Ho has contributed some maintenance work on the MGLRU code
     in the series 'mm/mglru: Kconfig cleanup'.

   - Matthew Wilcox has contributed lruvec page accounting code cleanups
     in the series 'Remove some lruvec page accounting functions'"

* tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits)
  mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
  mm, treewide: introduce NR_PAGE_ORDERS
  selftests/mm: add separate UFFDIO_MOVE test for PMD splitting
  selftests/mm: skip test if application doesn't has root privileges
  selftests/mm: conform test to TAP format output
  selftests: mm: hugepage-mmap: conform to TAP format output
  selftests/mm: gup_test: conform test to TAP format output
  mm/selftests: hugepage-mremap: conform test to TAP format output
  mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING
  mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large
  mm/memcontrol: remove __mod_lruvec_page_state()
  mm/khugepaged: use a folio more in collapse_file()
  slub: use a folio in __kmalloc_large_node
  slub: use folio APIs in free_large_kmalloc()
  slub: use alloc_pages_node() in alloc_slab_page()
  mm: remove inc/dec lruvec page state functions
  mm: ratelimit stat flush from workingset shrinker
  kasan: stop leaking stack trace handles
  mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE
  mm/mglru: add dummy pmd_dirty()
  ...
2024-01-09 11:18:47 -08:00
Kirill A. Shutemov 5e0a760b44 mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
commit 23baf831a3 ("mm, treewide: redefine MAX_ORDER sanely") has
changed the definition of MAX_ORDER to be inclusive.  This has caused
issues with code that was not yet upstream and depended on the previous
definition.

To draw attention to the altered meaning of the define, rename MAX_ORDER
to MAX_PAGE_ORDER.

Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-08 15:27:15 -08:00
David Stevens dff4fa0e57 virtio: Add support for no-reset virtio PCI PM
If a virtio_pci_device supports native PCI power management and has the
No_Soft_Reset bit set, then skip resetting and reinitializing the device
when suspending and restoring the device. This allows system-wide low
power states like s2idle to be used in systems with stateful virtio
devices that can't simply be re-initialized (e.g. virtio-fs).

Signed-off-by: David Stevens <stevensd@chromium.org>
Message-Id: <20231208070754.3132339-1-stevensd@chromium.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
2023-12-27 16:10:38 -05:00
Yishai Hadas c3fc3e098b virtio-pci: Introduce APIs to execute legacy IO admin commands
Introduce APIs to execute legacy IO admin commands.

It includes: io_legacy_read/write for both common and the device
configuration, io_legacy_notify_info.

In addition, exposing an API to check whether the legacy IO commands are
supported. (i.e. virtio_pci_admin_has_legacy_io()).

Those APIs will be used by the next patches from this series.

Note:
Unlike modern drivers which support hardware virtio devices, legacy
drivers assume software-based devices: e.g. they don't use proper memory
barriers on ARM, use big endian on PPC, etc. X86 drivers are mostly ok
though, more or less by chance. For now, only support legacy IO on X86.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20231219093247.170936-7-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-19 11:51:33 -07:00
Yishai Hadas f51e146f1e virtio-pci: Initialize the supported admin commands
Initialize the supported admin commands upon activating the admin queue.

The supported commands are saved as part of the admin queue context.

Next patches in this series will expose APIs to use them.

Reviewed-by: Feng Liu <feliu@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20231219093247.170936-6-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-19 11:51:32 -07:00
Feng Liu 92792ac752 virtio-pci: Introduce admin command sending function
Add support for sending admin command through admin virtqueue interface.
Abort any inflight admin commands once device reset completes. Activate
admin queue when device becomes ready; deactivate on device reset.

To comply to the below specification statement [1], the admin virtqueue
is activated for upper layer users only after setting DRIVER_OK status.

[1] The driver MUST NOT send any buffer available notifications to the
device before setting DRIVER_OK.

Signed-off-by: Feng Liu <feliu@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20231219093247.170936-4-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-19 11:51:31 -07:00
Feng Liu fd27ef6b44 virtio-pci: Introduce admin virtqueue
Introduce support for the admin virtqueue. By negotiating
VIRTIO_F_ADMIN_VQ feature, driver detects capability and creates one
administration virtqueue. Administration virtqueue implementation in
virtio pci generic layer, enables multiple types of upper layer
drivers such as vfio, net, blk to utilize it.

Signed-off-by: Feng Liu <feliu@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20231219093247.170936-3-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-19 11:51:31 -07:00
Xuan Zhuo 1f475cd572 virtio_ring: fix syncs DMA memory with different direction
Now the APIs virtqueue_dma_sync_single_range_for_{cpu,device} ignore
the parameter 'dir', that is a mistake.

[    6.101666] ------------[ cut here ]------------
[    6.102079] DMA-API: virtio-pci 0000:00:04.0: device driver syncs DMA memory with different direction [device address=0x00000000ae010000] [size=32752 bytes] [mapped with DMA_FROM_DEVICE] [synced with DMA_BIDIRECTIONAL]
[    6.103630] WARNING: CPU: 6 PID: 0 at kernel/dma/debug.c:1125 check_sync+0x53e/0x6c0
[    6.107420] CPU: 6 PID: 0 Comm: swapper/6 Tainted: G            E      6.6.0+ #290
[    6.108030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[    6.108936] RIP: 0010:check_sync+0x53e/0x6c0
[    6.109289] Code: 24 10 e8 f5 d9 74 00 4c 8b 4c 24 10 4c 8b 44 24 18 48 8b 4c 24 20 48 89 c6 41 56 4c 89 ea 48 c7 c7 b0 f1 50 82 e8 32 fc f3 ff <0f> 0b 48 c7 c7 48 4b 4a 82 e8 74 d9 fc ff 8b 73 4c 48 8d 7b 50 31
[    6.110750] RSP: 0018:ffffc90000180cd8 EFLAGS: 00010092
[    6.111178] RAX: 00000000000000ce RBX: ffff888100aa5900 RCX: 0000000000000000
[    6.111744] RDX: 0000000000000104 RSI: ffffffff824c3208 RDI: 00000000ffffffff
[    6.112316] RBP: ffffc90000180d40 R08: 0000000000000000 R09: 00000000fffeffff
[    6.112893] R10: ffffc90000180b98 R11: ffffffff82f63308 R12: ffffffff83d5af00
[    6.113460] R13: ffff888100998200 R14: ffffffff824a4b5f R15: 0000000000000286
[    6.114027] FS:  0000000000000000(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
[    6.114665] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    6.115128] CR2: 00007f10f1e03030 CR3: 0000000108272004 CR4: 0000000000770ee0
[    6.115701] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    6.116272] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    6.116842] PKRU: 55555554
[    6.117069] Call Trace:
[    6.117275]  <IRQ>
[    6.117452]  ? __warn+0x84/0x140
[    6.117727]  ? check_sync+0x53e/0x6c0
[    6.118034]  ? __report_bug+0xea/0x100
[    6.118353]  ? check_sync+0x53e/0x6c0
[    6.118653]  ? report_bug+0x41/0xc0
[    6.118944]  ? handle_bug+0x3c/0x70
[    6.119237]  ? exc_invalid_op+0x18/0x70
[    6.119551]  ? asm_exc_invalid_op+0x1a/0x20
[    6.119900]  ? check_sync+0x53e/0x6c0
[    6.120199]  ? check_sync+0x53e/0x6c0
[    6.120499]  debug_dma_sync_single_for_cpu+0x5c/0x70
[    6.120906]  ? dma_sync_single_for_cpu+0xb7/0x100
[    6.121291]  virtnet_rq_unmap+0x158/0x170 [virtio_net]
[    6.121716]  virtnet_receive+0x196/0x220 [virtio_net]
[    6.122135]  virtnet_poll+0x48/0x1b0 [virtio_net]
[    6.122524]  __napi_poll+0x29/0x1b0
[    6.123083]  net_rx_action+0x282/0x360
[    6.123612]  __do_softirq+0xf3/0x2fb
[    6.124138]  __irq_exit_rcu+0x8e/0xf0
[    6.124663]  common_interrupt+0xbc/0xe0
[    6.125202]  </IRQ>

We need to enable CONFIG_DMA_API_DEBUG and work with need sync mode(such
as swiotlb) to reproduce this warn.

Fixes: 8bd2f71054 ("virtio_ring: introduce dma sync api for virtqueue")
Reported-by: "Ning, Hongyu" <hongyu.ning@linux.intel.com>
Closes: https://lore.kernel.org/all/f37cb55a-6fc8-4e21-8789-46d468325eea@linux.intel.com/
Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Message-Id: <20231201033303.25141-1-xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
2023-12-04 08:36:25 -05:00
Linus Torvalds 372bed5fbb vhost,virtio,vdpa,firmware: bugfixes
bugfixes all over the place
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmVT1w0PHG1zdEByZWRo
 YXQuY29tAAoJECgfDbjSjVRpYXMH/AxTzofnpIJOvmY0QYLbBg3+wu0ml4X1pvJc
 Ia2pm5NZLTPucesw+AFM1q8YIiRX5mBRTmxtGLFzqiC+miAKPvMJK2hQe1jlNqgM
 7zBb0jeHvURvo4rvDi8a2Xvf6T2ypGapBhmcNy7VtEugLrXGHGMMzjE3IS1/Zsbm
 AK70P0ujTpLZtp4TFy2qqZoaikijbAdHhu2GTo1KfFh3RcwSc1uhbrtoTHUQ42bw
 0nXz8eLyuu9VmUryVgLy+j9pnS9KDMNhVCpjpVIHD4iI9L6sQVj8fwc/Bnxq8mh7
 yDcagvcIK54M786cAmLWC+zQu7feMQ8FcC2zduJT+Yl2X4fX04E=
 =xFJf
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio fixes from Michael Tsirkin:
 "Bugfixes all over the place"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  vhost-vdpa: fix use after free in vhost_vdpa_probe()
  virtio_pci: Switch away from deprecated irq_set_affinity_hint
  riscv, qemu_fw_cfg: Add support for RISC-V architecture
  vdpa_sim_blk: allocate the buffer zeroed
  virtio_pci: move structure to a header
2023-11-16 07:39:37 -05:00
Linus Torvalds 77fa2fbe87 vhost,virtio,vdpa: features, fixes, cleanups
vdpa/mlx5:
 	VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK
 	new maintainer
 vdpa:
 	support for vq descriptor mappings
 	decouple reset of iotlb mapping from device reset
 
 fixes, cleanups all over the place
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmVCUMYPHG1zdEByZWRo
 YXQuY29tAAoJECgfDbjSjVRp4L0H/RKcnNXPRqzhhBI1XVQ11Z8CO8WjovcmJalu
 ADHNEGmvuWnY79fp9eLiZ4iVaTx1qbzqIB5Q500DJ65jh71W7UQ8ww6CGjNUoRGs
 Zoe4G09WoOf4bvDZZzVV7ml/AzMdsHWSZK8pxY3QI9CsC9Zfp9hg20QYxPylCqYx
 SIJx7w2MkoojfmtOHRx1WUxaQz99yfU4Z0C5PxtRE1HGN6/a1aY0P0CAl5jq8uCK
 U5sCRsfCmP7VKlspeEddMiPA35ADbCiysSobCbwGVQEs5cHpMUX7KWa+oV0tF/PY
 9uyJb2rJy6zG3tXmL4XNib665ZR86HX6qiWRfm2nBQQStuHaJyg=
 =mXgo
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:
 "vhost,virtio,vdpa: features, fixes, cleanups.

  vdpa/mlx5:
   - VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK
   - new maintainer

  vdpa:
   - support for vq descriptor mappings
   - decouple reset of iotlb mapping from device reset

  and fixes, cleanups all over the place"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (34 commits)
  vdpa_sim: implement .reset_map support
  vdpa/mlx5: implement .reset_map driver op
  vhost-vdpa: clean iotlb map during reset for older userspace
  vdpa: introduce .compat_reset operation callback
  vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
  vhost-vdpa: reset vendor specific mapping to initial state in .release
  vdpa: introduce .reset_map operation callback
  virtio_pci: add check for common cfg size
  virtio-blk: fix implicit overflow on virtio_max_dma_size
  virtio_pci: add build offset check for the new common cfg items
  virtio: add definition of VIRTIO_F_NOTIF_CONFIG_DATA feature bit
  vduse: make vduse_class constant
  vhost-scsi: Spelling s/preceeding/preceding/g
  virtio: kdoc for struct virtio_pci_modern_device
  vdpa: Update sysfs ABI documentation
  MAINTAINERS: Add myself as mlx5_vdpa driver
  virtio-balloon: correct the comment of virtballoon_migratepage()
  mlx5_vdpa: offer VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK
  vdpa/mlx5: Update cvq iotlb mapping on ASID change
  vdpa/mlx5: Make iotlb helper functions more generic
  ...
2023-11-05 09:02:32 -10:00
Linus Torvalds ecae0bd517 Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
 
 - Kemeng Shi has contributed some compation maintenance work in the
   series "Fixes and cleanups to compaction".
 
 - Joel Fernandes has a patchset ("Optimize mremap during mutual
   alignment within PMD") which fixes an obscure issue with mremap()'s
   pagetable handling during a subsequent exec(), based upon an
   implementation which Linus suggested.
 
 - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the
   following patch series:
 
 	mm/damon: misc fixups for documents, comments and its tracepoint
 	mm/damon: add a tracepoint for damos apply target regions
 	mm/damon: provide pseudo-moving sum based access rate
 	mm/damon: implement DAMOS apply intervals
 	mm/damon/core-test: Fix memory leaks in core-test
 	mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
 
 - In the series "Do not try to access unaccepted memory" Adrian Hunter
   provides some fixups for the recently-added "unaccepted memory' feature.
   To increase the feature's checking coverage.  "Plug a few gaps where
   RAM is exposed without checking if it is unaccepted memory".
 
 - In the series "cleanups for lockless slab shrink" Qi Zheng has done
   some maintenance work which is preparation for the lockless slab
   shrinking code.
 
 - Qi Zheng has redone the earlier (and reverted) attempt to make slab
   shrinking lockless in the series "use refcount+RCU method to implement
   lockless slab shrink".
 
 - David Hildenbrand contributes some maintenance work for the rmap code
   in the series "Anon rmap cleanups".
 
 - Kefeng Wang does more folio conversions and some maintenance work in
   the migration code.  Series "mm: migrate: more folio conversion and
   unification".
 
 - Matthew Wilcox has fixed an issue in the buffer_head code which was
   causing long stalls under some heavy memory/IO loads.  Some cleanups
   were added on the way.  Series "Add and use bdev_getblk()".
 
 - In the series "Use nth_page() in place of direct struct page
   manipulation" Zi Yan has fixed a potential issue with the direct
   manipulation of hugetlb page frames.
 
 - In the series "mm: hugetlb: Skip initialization of gigantic tail
   struct pages if freed by HVO" has improved our handling of gigantic
   pages in the hugetlb vmmemmep optimizaton code.  This provides
   significant boot time improvements when significant amounts of gigantic
   pages are in use.
 
 - Matthew Wilcox has sent the series "Small hugetlb cleanups" - code
   rationalization and folio conversions in the hugetlb code.
 
 - Yin Fengwei has improved mlock()'s handling of large folios in the
   series "support large folio for mlock"
 
 - In the series "Expose swapcache stat for memcg v1" Liu Shixin has
   added statistics for memcg v1 users which are available (and useful)
   under memcg v2.
 
 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
   prctl so that userspace may direct the kernel to not automatically
   propagate the denial to child processes.  The series is named "MDWE
   without inheritance".
 
 - Kefeng Wang has provided the series "mm: convert numa balancing
   functions to use a folio" which does what it says.
 
 - In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch
   makes is possible for a process to propagate KSM treatment across
   exec().
 
 - Huang Ying has enhanced memory tiering's calculation of memory
   distances.  This is used to permit the dax/kmem driver to use "high
   bandwidth memory" in addition to Optane Data Center Persistent Memory
   Modules (DCPMM).  The series is named "memory tiering: calculate
   abstract distance based on ACPI HMAT"
 
 - In the series "Smart scanning mode for KSM" Stefan Roesch has
   optimized KSM by teaching it to retain and use some historical
   information from previous scans.
 
 - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the
   series "mm: memcg: fix tracking of pending stats updates values".
 
 - In the series "Implement IOCTL to get and optionally clear info about
   PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits
   us to atomically read-then-clear page softdirty state.  This is mainly
   used by CRIU.
 
 - Hugh Dickins contributed the series "shmem,tmpfs: general maintenance"
   - a bunch of relatively minor maintenance tweaks to this code.
 
 - Matthew Wilcox has increased the use of the VMA lock over file-backed
   page faults in the series "Handle more faults under the VMA lock".  Some
   rationalizations of the fault path became possible as a result.
 
 - In the series "mm/rmap: convert page_move_anon_rmap() to
   folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups
   and folio conversions.
 
 - In the series "various improvements to the GUP interface" Lorenzo
   Stoakes has simplified and improved the GUP interface with an eye to
   providing groundwork for future improvements.
 
 - Andrey Konovalov has sent along the series "kasan: assorted fixes and
   improvements" which does those things.
 
 - Some page allocator maintenance work from Kemeng Shi in the series
   "Two minor cleanups to break_down_buddy_pages".
 
 - In thes series "New selftest for mm" Breno Leitao has developed
   another MM self test which tickles a race we had between madvise() and
   page faults.
 
 - In the series "Add folio_end_read" Matthew Wilcox provides cleanups
   and an optimization to the core pagecache code.
 
 - Nhat Pham has added memcg accounting for hugetlb memory in the series
   "hugetlb memcg accounting".
 
 - Cleanups and rationalizations to the pagemap code from Lorenzo
   Stoakes, in the series "Abstract vma_merge() and split_vma()".
 
 - Audra Mitchell has fixed issues in the procfs page_owner code's new
   timestamping feature which was causing some misbehaviours.  In the
   series "Fix page_owner's use of free timestamps".
 
 - Lorenzo Stoakes has fixed the handling of new mappings of sealed files
   in the series "permit write-sealed memfd read-only shared mappings".
 
 - Mike Kravetz has optimized the hugetlb vmemmap optimization in the
   series "Batch hugetlb vmemmap modification operations".
 
 - Some buffer_head folio conversions and cleanups from Matthew Wilcox in
   the series "Finish the create_empty_buffers() transition".
 
 - As a page allocator performance optimization Huang Ying has added
   automatic tuning to the allocator's per-cpu-pages feature, in the series
   "mm: PCP high auto-tuning".
 
 - Roman Gushchin has contributed the patchset "mm: improve performance
   of accounted kernel memory allocations" which improves their performance
   by ~30% as measured by a micro-benchmark.
 
 - folio conversions from Kefeng Wang in the series "mm: convert page
   cpupid functions to folios".
 
 - Some kmemleak fixups in Liu Shixin's series "Some bugfix about
   kmemleak".
 
 - Qi Zheng has improved our handling of memoryless nodes by keeping them
   off the allocation fallback list.  This is done in the series "handle
   memoryless nodes more appropriately".
 
 - khugepaged conversions from Vishal Moola in the series "Some
   khugepaged folio conversions".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA
 jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y
 FgeUPAD1oasg6CP+INZvCj34waNxwAc=
 =E+Y4
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Many singleton patches against the MM code. The patch series which are
  included in this merge do the following:

   - Kemeng Shi has contributed some compation maintenance work in the
     series 'Fixes and cleanups to compaction'

   - Joel Fernandes has a patchset ('Optimize mremap during mutual
     alignment within PMD') which fixes an obscure issue with mremap()'s
     pagetable handling during a subsequent exec(), based upon an
     implementation which Linus suggested

   - More DAMON/DAMOS maintenance and feature work from SeongJae Park i
     the following patch series:

	mm/damon: misc fixups for documents, comments and its tracepoint
	mm/damon: add a tracepoint for damos apply target regions
	mm/damon: provide pseudo-moving sum based access rate
	mm/damon: implement DAMOS apply intervals
	mm/damon/core-test: Fix memory leaks in core-test
	mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval

   - In the series 'Do not try to access unaccepted memory' Adrian
     Hunter provides some fixups for the recently-added 'unaccepted
     memory' feature. To increase the feature's checking coverage. 'Plug
     a few gaps where RAM is exposed without checking if it is
     unaccepted memory'

   - In the series 'cleanups for lockless slab shrink' Qi Zheng has done
     some maintenance work which is preparation for the lockless slab
     shrinking code

   - Qi Zheng has redone the earlier (and reverted) attempt to make slab
     shrinking lockless in the series 'use refcount+RCU method to
     implement lockless slab shrink'

   - David Hildenbrand contributes some maintenance work for the rmap
     code in the series 'Anon rmap cleanups'

   - Kefeng Wang does more folio conversions and some maintenance work
     in the migration code. Series 'mm: migrate: more folio conversion
     and unification'

   - Matthew Wilcox has fixed an issue in the buffer_head code which was
     causing long stalls under some heavy memory/IO loads. Some cleanups
     were added on the way. Series 'Add and use bdev_getblk()'

   - In the series 'Use nth_page() in place of direct struct page
     manipulation' Zi Yan has fixed a potential issue with the direct
     manipulation of hugetlb page frames

   - In the series 'mm: hugetlb: Skip initialization of gigantic tail
     struct pages if freed by HVO' has improved our handling of gigantic
     pages in the hugetlb vmmemmep optimizaton code. This provides
     significant boot time improvements when significant amounts of
     gigantic pages are in use

   - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code
     rationalization and folio conversions in the hugetlb code

   - Yin Fengwei has improved mlock()'s handling of large folios in the
     series 'support large folio for mlock'

   - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has
     added statistics for memcg v1 users which are available (and
     useful) under memcg v2

   - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
     prctl so that userspace may direct the kernel to not automatically
     propagate the denial to child processes. The series is named 'MDWE
     without inheritance'

   - Kefeng Wang has provided the series 'mm: convert numa balancing
     functions to use a folio' which does what it says

   - In the series 'mm/ksm: add fork-exec support for prctl' Stefan
     Roesch makes is possible for a process to propagate KSM treatment
     across exec()

   - Huang Ying has enhanced memory tiering's calculation of memory
     distances. This is used to permit the dax/kmem driver to use 'high
     bandwidth memory' in addition to Optane Data Center Persistent
     Memory Modules (DCPMM). The series is named 'memory tiering:
     calculate abstract distance based on ACPI HMAT'

   - In the series 'Smart scanning mode for KSM' Stefan Roesch has
     optimized KSM by teaching it to retain and use some historical
     information from previous scans

   - Yosry Ahmed has fixed some inconsistencies in memcg statistics in
     the series 'mm: memcg: fix tracking of pending stats updates
     values'

   - In the series 'Implement IOCTL to get and optionally clear info
     about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap
     which permits us to atomically read-then-clear page softdirty
     state. This is mainly used by CRIU

   - Hugh Dickins contributed the series 'shmem,tmpfs: general
     maintenance', a bunch of relatively minor maintenance tweaks to
     this code

   - Matthew Wilcox has increased the use of the VMA lock over
     file-backed page faults in the series 'Handle more faults under the
     VMA lock'. Some rationalizations of the fault path became possible
     as a result

   - In the series 'mm/rmap: convert page_move_anon_rmap() to
     folio_move_anon_rmap()' David Hildenbrand has implemented some
     cleanups and folio conversions

   - In the series 'various improvements to the GUP interface' Lorenzo
     Stoakes has simplified and improved the GUP interface with an eye
     to providing groundwork for future improvements

   - Andrey Konovalov has sent along the series 'kasan: assorted fixes
     and improvements' which does those things

   - Some page allocator maintenance work from Kemeng Shi in the series
     'Two minor cleanups to break_down_buddy_pages'

   - In thes series 'New selftest for mm' Breno Leitao has developed
     another MM self test which tickles a race we had between madvise()
     and page faults

   - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups
     and an optimization to the core pagecache code

   - Nhat Pham has added memcg accounting for hugetlb memory in the
     series 'hugetlb memcg accounting'

   - Cleanups and rationalizations to the pagemap code from Lorenzo
     Stoakes, in the series 'Abstract vma_merge() and split_vma()'

   - Audra Mitchell has fixed issues in the procfs page_owner code's new
     timestamping feature which was causing some misbehaviours. In the
     series 'Fix page_owner's use of free timestamps'

   - Lorenzo Stoakes has fixed the handling of new mappings of sealed
     files in the series 'permit write-sealed memfd read-only shared
     mappings'

   - Mike Kravetz has optimized the hugetlb vmemmap optimization in the
     series 'Batch hugetlb vmemmap modification operations'

   - Some buffer_head folio conversions and cleanups from Matthew Wilcox
     in the series 'Finish the create_empty_buffers() transition'

   - As a page allocator performance optimization Huang Ying has added
     automatic tuning to the allocator's per-cpu-pages feature, in the
     series 'mm: PCP high auto-tuning'

   - Roman Gushchin has contributed the patchset 'mm: improve
     performance of accounted kernel memory allocations' which improves
     their performance by ~30% as measured by a micro-benchmark

   - folio conversions from Kefeng Wang in the series 'mm: convert page
     cpupid functions to folios'

   - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about
     kmemleak'

   - Qi Zheng has improved our handling of memoryless nodes by keeping
     them off the allocation fallback list. This is done in the series
     'handle memoryless nodes more appropriately'

   - khugepaged conversions from Vishal Moola in the series 'Some
     khugepaged folio conversions'"

[ bcachefs conflicts with the dynamically allocated shrinkers have been
  resolved as per Stephen Rothwell in

     https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/

  with help from Qi Zheng.

  The clone3 test filtering conflict was half-arsed by yours truly ]

* tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits)
  mm/damon/sysfs: update monitoring target regions for online input commit
  mm/damon/sysfs: remove requested targets when online-commit inputs
  selftests: add a sanity check for zswap
  Documentation: maple_tree: fix word spelling error
  mm/vmalloc: fix the unchecked dereference warning in vread_iter()
  zswap: export compression failure stats
  Documentation: ubsan: drop "the" from article title
  mempolicy: migration attempt to match interleave nodes
  mempolicy: mmap_lock is not needed while migrating folios
  mempolicy: alloc_pages_mpol() for NUMA policy without vma
  mm: add page_rmappable_folio() wrapper
  mempolicy: remove confusing MPOL_MF_LAZY dead code
  mempolicy: mpol_shared_policy_init() without pseudo-vma
  mempolicy trivia: use pgoff_t in shared mempolicy tree
  mempolicy trivia: slightly more consistent naming
  mempolicy trivia: delete those ancient pr_debug()s
  mempolicy: fix migrate_pages(2) syscall return nr_failed
  kernfs: drop shared NUMA mempolicy hooks
  hugetlbfs: drop shared NUMA mempolicy pretence
  mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
  ...
2023-11-02 19:38:47 -10:00
Jakub Sitnicki b2c8b644fa virtio_pci: Switch away from deprecated irq_set_affinity_hint
Since commit 65c7cdedeb ("genirq: Provide new interfaces for affinity
hints") irq_set_affinity_hint is being phased out.

Switch to new interfaces for setting and applying irq affinity hints.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Message-Id: <20231025145319.380775-1-jakub@cloudflare.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
2023-11-01 09:31:16 -04:00
Michael S. Tsirkin 3503895788 virtio_pci: move structure to a header
These are guest/host interfaces, so they belong in the header where e.g.
qemu will know to find them.

Note: we added a new structure as opposed to extending existing one
because someone might be relying on the size of the existing structure
staying unchanged.  Add a warning to avoid using sizeof.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
2023-11-01 09:31:16 -04:00
Si-Wei Liu bc91df5c70 vhost-vdpa: clean iotlb map during reset for older userspace
Using .compat_reset op from the previous patch, the buggy .reset
behaviour can be kept as-is on older userspace apps, which don't ack the
IOTLB_PERSIST backend feature. As this compatibility quirk is limited to
those drivers that used to be buggy in the past, it won't affect change
the behaviour or affect ABI on the setups with API compliant driver.

The separation of .compat_reset from the regular .reset allows
vhost-vdpa able to know which driver had broken behaviour before, so it
can apply the corresponding compatibility quirk to the individual driver
whenever needed.  Compared to overloading the existing .reset with
flags, .compat_reset won't cause any extra burden to the implementation
of every compliant driver.

[mst: squashed in two fixup commits]

Message-Id: <1697880319-4937-6-git-send-email-si-wei.liu@oracle.com>
Message-Id: <1698102863-21122-1-git-send-email-si-wei.liu@oracle.com>
Reported-by: Dragos Tatulea <dtatulea@nvidia.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <1698275594-19204-1-git-send-email-si-wei.liu@oracle.com>
Reported-by: Lei Yang <leiyang@redhat.com>
Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
2023-11-01 09:20:00 -04:00
Xuan Zhuo e0592acd1e virtio_pci: add check for common cfg size
Some buggy devices, the common cfg size may not match the features.

This patch checks the common cfg size for the
features(VIRTIO_F_NOTIF_CONFIG_DATA, VIRTIO_F_RING_RESET). When the
common cfg size does not match the corresponding feature, we fail the
probe and print error message.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20231019034902.7346-1-xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-11-01 09:19:59 -04:00
Xuan Zhuo 327e0ab32c virtio_pci: add build offset check for the new common cfg items
Add checks to the check_offsets(void) for queue_notify_data and
queue_reset.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20231010031120.81272-5-xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-11-01 09:19:59 -04:00
Xueshi Hu 0cd43eef0a virtio-balloon: correct the comment of virtballoon_migratepage()
After commit 68f2736a85 ("mm: Convert all PageMovable users to
movable_operations"), the execution path has been changed to

move_to_new_folio
	movable_operations->migrate_page
		balloon_page_migrate
			balloon_page_migrate->balloon_page_migrate
				balloon_page_migrate

Correct the outdated comment.

Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com>
Message-Id: <20230813140709.835536-1-xueshi.hu@smartx.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
2023-11-01 09:19:58 -04:00
Xuan Zhuo 061b39fdfe virtio_pci: fix the common cfg map size
The function vp_modern_map_capability() takes the size parameter,
which corresponds to the size of virtio_pci_common_cfg. As a result,
this indicates the size of memory area to map.

Now the size is the size of virtio_pci_common_cfg, but some feature(such
as the _F_RING_RESET) needs the virtio_pci_modern_common_cfg, so this
commit changes the size to the size of virtio_pci_modern_common_cfg.

Cc: stable@vger.kernel.org
Fixes: 0b50cece0b ("virtio_pci: introduce helper to get/set queue reset")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Message-Id: <20231010031120.81272-3-xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
2023-10-18 11:30:12 -04:00
Gavin Shan 07622bd415 virtio_balloon: Fix endless deflation and inflation on arm64
The deflation request to the target, which isn't unaligned to the
guest page size causes endless deflation and inflation actions. For
example, we receive the flooding QMP events for the changes on memory
balloon's size after a deflation request to the unaligned target is
sent for the ARM64 guest, where we have 64KB base page size.

  /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64      \
  -accel kvm -machine virt,gic-version=host -cpu host          \
  -smp maxcpus=8,cpus=8,sockets=2,clusters=2,cores=2,threads=1 \
  -m 1024M,slots=16,maxmem=64G                                 \
  -object memory-backend-ram,id=mem0,size=512M                 \
  -object memory-backend-ram,id=mem1,size=512M                 \
  -numa node,nodeid=0,memdev=mem0,cpus=0-3                     \
  -numa node,nodeid=1,memdev=mem1,cpus=4-7                     \
    :                                                          \
  -device virtio-balloon-pci,id=balloon0,bus=pcie.10

  { "execute" : "balloon", "arguments": { "value" : 1073672192 } }
  {"return": {}}
  {"timestamp": {"seconds": 1693272173, "microseconds": 88667},   \
   "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
  {"timestamp": {"seconds": 1693272174, "microseconds": 89704},   \
   "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
  {"timestamp": {"seconds": 1693272175, "microseconds": 90819},   \
   "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
  {"timestamp": {"seconds": 1693272176, "microseconds": 91961},   \
   "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
  {"timestamp": {"seconds": 1693272177, "microseconds": 93040},   \
   "event": "BALLOON_CHANGE", "data": {"actual": 1073676288}}
  {"timestamp": {"seconds": 1693272178, "microseconds": 94117},   \
   "event": "BALLOON_CHANGE", "data": {"actual": 1073676288}}
  {"timestamp": {"seconds": 1693272179, "microseconds": 95337},   \
   "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
  {"timestamp": {"seconds": 1693272180, "microseconds": 96615},   \
   "event": "BALLOON_CHANGE", "data": {"actual": 1073676288}}
  {"timestamp": {"seconds": 1693272181, "microseconds": 97626},   \
   "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
  {"timestamp": {"seconds": 1693272182, "microseconds": 98693},   \
   "event": "BALLOON_CHANGE", "data": {"actual": 1073676288}}
  {"timestamp": {"seconds": 1693272183, "microseconds": 99698},   \
   "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
  {"timestamp": {"seconds": 1693272184, "microseconds": 100727},  \
   "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
  {"timestamp": {"seconds": 1693272185, "microseconds": 90430},   \
   "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
  {"timestamp": {"seconds": 1693272186, "microseconds": 102999},  \
   "event": "BALLOON_CHANGE", "data": {"actual": 1073676288}}
     :
  <The similar QMP events repeat>

Fix it by aligning the target up to the guest page size, 64KB in this
specific case. With this applied, no flooding QMP events are observed
and the memory balloon's size can be stablizied to 0x3ffe0000 soon
after the deflation request is sent.

  { "execute" : "balloon", "arguments": { "value" : 1073672192 } }
  {"return": {}}
  {"timestamp": {"seconds": 1693273328, "microseconds": 793075},  \
   "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}}
  { "execute" : "query-balloon" }
  {"return": {"actual": 1073610752}}

Cc: stable@vger.kernel.org
Signed-off-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zhenyu Zhang <zhenyzha@redhat.com>
Message-Id: <20230831011007.1032822-1-gshan@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
2023-10-18 11:29:35 -04:00
Maximilian Heyne fab7f25922 virtio-mmio: fix memory leak of vm_dev
With the recent removal of vm_dev from devres its memory is only freed
via the callback virtio_mmio_release_dev. However, this only takes
effect after device_add is called by register_virtio_device. Until then
it's an unmanaged resource and must be explicitly freed on error exit.

This bug was discovered and resolved using Coverity Static Analysis
Security Testing (SAST) by Synopsys, Inc.

Cc: stable@vger.kernel.org
Fixes: 55c91fedd0 ("virtio-mmio: don't break lifecycle of vm_dev")
Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>

Message-Id: <20230911090328.40538-1-mheyne@amazon.de>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
2023-10-18 11:29:09 -04:00
Qi Zheng 0fbb99698b virtio_balloon: dynamically allocate the virtio-balloon shrinker
In preparation for implementing lockless slab shrink, use new APIs to
dynamically allocate the virtio-balloon shrinker, so that it can be freed
asynchronously via RCU. Then it doesn't need to wait for RCU read-side
critical section when releasing the struct virtio_balloon.

Link: https://lkml.kernel.org/r/20230911094444.68966-29-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Chuck Lever <cel@kernel.org>
Cc: Coly Li <colyli@suse.de>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nadav Amit <namit@vmware.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sean Paul <sean@poorly.run>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Song Liu <song@kernel.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:25 -07:00
Yuan Yao 1acfe2c122 virtio_ring: fix avail_wrap_counter in virtqueue_add_packed
In current packed virtqueue implementation, the avail_wrap_counter won't
flip, in the case when the driver supplies a descriptor chain with a
length equals to the queue size; total_sg == vq->packed.vring.num.

Let’s assume the following situation:
vq->packed.vring.num=4
vq->packed.next_avail_idx: 1
vq->packed.avail_wrap_counter: 0

Then the driver adds a descriptor chain containing 4 descriptors.

We expect the following result with avail_wrap_counter flipped:
vq->packed.next_avail_idx: 1
vq->packed.avail_wrap_counter: 1

But, the current implementation gives the following result:
vq->packed.next_avail_idx: 1
vq->packed.avail_wrap_counter: 0

To reproduce the bug, you can set a packed queue size as small as
possible, so that the driver is more likely to provide a descriptor
chain with a length equal to the packed queue size. For example, in
qemu run following commands:
sudo qemu-system-x86_64 \
-enable-kvm \
-nographic \
-kernel "path/to/kernel_image" \
-m 1G \
-drive file="path/to/rootfs",if=none,id=disk \
-device virtio-blk,drive=disk \
-drive file="path/to/disk_image",if=none,id=rwdisk \
-device virtio-blk,drive=rwdisk,packed=on,queue-size=4,\
indirect_desc=off \
-append "console=ttyS0 root=/dev/vda rw init=/bin/bash"

Inside the VM, create a directory and mount the rwdisk device on it. The
rwdisk will hang and mount operation will not complete.

This commit fixes the wrap counter error by flipping the
packed.avail_wrap_counter, when start of descriptor chain equals to the
end of descriptor chain (head == i).

Fixes: 1ce9e6055f ("virtio_ring: introduce packed ring support")
Signed-off-by: Yuan Yao <yuanyaogoog@chromium.org>
Message-Id: <20230808051110.3492693-1-yuanyaogoog@chromium.org>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-09-03 18:10:24 -04:00
Jason Wang ae15aceaa9 virtio_vdpa: build affinity masks conditionally
We try to build affinity mask via create_affinity_masks()
unconditionally which may lead several issues:

- the affinity mask is not used for parent without affinity support
  (only VDUSE support the affinity now)
- the logic of create_affinity_masks() might not work for devices
  other than block. For example it's not rare in the networking device
  where the number of queues could exceed the number of CPUs. Such
  case breaks the current affinity logic which is based on
  group_cpus_evenly() who assumes the number of CPUs are not less than
  the number of groups. This can trigger a warning[1]:

	if (ret >= 0)
		WARN_ON(nr_present + nr_others < numgrps);

Fixing this by only build the affinity masks only when

- Driver passes affinity descriptor, driver like virtio-blk can make
  sure to limit the number of queues when it exceeds the number of CPUs
- Parent support affinity setting config ops

This help to avoid the warning. More optimizations could be done on
top.

[1]
[  682.146655] WARNING: CPU: 6 PID: 1550 at lib/group_cpus.c:400 group_cpus_evenly+0x1aa/0x1c0
[  682.146668] CPU: 6 PID: 1550 Comm: vdpa Not tainted 6.5.0-rc5jason+ #79
[  682.146671] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[  682.146673] RIP: 0010:group_cpus_evenly+0x1aa/0x1c0
[  682.146676] Code: 4c 89 e0 5b 5d 41 5c 41 5d 41 5e c3 cc cc cc cc e8 1b c4 74 ff 48 89 ef e8 13 ac 98 ff 4c 89 e7 45 31 e4 e8 08 ac 98 ff eb c2 <0f> 0b eb b6 e8 fd 05 c3 00 45 31 e4 eb e5 cc cc cc cc cc cc cc cc
[  682.146679] RSP: 0018:ffffc9000215f498 EFLAGS: 00010293
[  682.146682] RAX: 000000000001f1e0 RBX: 0000000000000041 RCX: 0000000000000000
[  682.146684] RDX: ffff888109922058 RSI: 0000000000000041 RDI: 0000000000000030
[  682.146686] RBP: ffff888109922058 R08: ffffc9000215f498 R09: ffffc9000215f4a0
[  682.146687] R10: 00000000000198d0 R11: 0000000000000030 R12: ffff888107e02800
[  682.146689] R13: 0000000000000030 R14: 0000000000000030 R15: 0000000000000041
[  682.146692] FS:  00007fef52315740(0000) GS:ffff888237380000(0000) knlGS:0000000000000000
[  682.146695] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  682.146696] CR2: 00007fef52509000 CR3: 0000000110dbc004 CR4: 0000000000370ee0
[  682.146698] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  682.146700] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  682.146701] Call Trace:
[  682.146703]  <TASK>
[  682.146705]  ? __warn+0x7b/0x130
[  682.146709]  ? group_cpus_evenly+0x1aa/0x1c0
[  682.146712]  ? report_bug+0x1c8/0x1e0
[  682.146717]  ? handle_bug+0x3c/0x70
[  682.146721]  ? exc_invalid_op+0x14/0x70
[  682.146723]  ? asm_exc_invalid_op+0x16/0x20
[  682.146727]  ? group_cpus_evenly+0x1aa/0x1c0
[  682.146729]  ? group_cpus_evenly+0x15c/0x1c0
[  682.146731]  create_affinity_masks+0xaf/0x1a0
[  682.146735]  virtio_vdpa_find_vqs+0x83/0x1d0
[  682.146738]  ? __pfx_default_calc_sets+0x10/0x10
[  682.146742]  virtnet_find_vqs+0x1f0/0x370
[  682.146747]  virtnet_probe+0x501/0xcd0
[  682.146749]  ? vp_modern_get_status+0x12/0x20
[  682.146751]  ? get_cap_addr.isra.0+0x10/0xc0
[  682.146754]  virtio_dev_probe+0x1af/0x260
[  682.146759]  really_probe+0x1a5/0x410

Fixes: 3dad56823b ("virtio-vdpa: Support interrupt affinity spreading mechanism")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230811091539.1359865-1-jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-09-03 18:10:24 -04:00
Xuan Zhuo 8bd2f71054 virtio_ring: introduce dma sync api for virtqueue
These API has been introduced:

* virtqueue_dma_need_sync
* virtqueue_dma_sync_single_range_for_cpu
* virtqueue_dma_sync_single_range_for_device

These APIs can be used together with the premapped mechanism to sync the
DMA address.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Message-Id: <20230810123057.43407-12-xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-09-03 18:10:23 -04:00
Xuan Zhuo b6253b4e21 virtio_ring: introduce dma map api for virtqueue
Added virtqueue_dma_map_api* to map DMA addresses for virtual memory in
advance. The purpose is to keep memory mapped across multiple add/get
buf operations.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Message-Id: <20230810123057.43407-11-xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-09-03 18:10:23 -04:00
Xuan Zhuo ba3e0c47c0 virtio_ring: introduce virtqueue_reset()
Introduce virtqueue_reset() to release all buffer inside vq.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230810123057.43407-10-xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-09-03 18:10:23 -04:00
Xuan Zhuo ad48d53b5b virtio_ring: separate the logic of reset/enable from virtqueue_resize
The subsequent reset function will reuse these logic.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230810123057.43407-9-xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-09-03 18:10:23 -04:00
Xuan Zhuo 4d09f24080 virtio_ring: correct the expression of the description of virtqueue_resize()
Modify the "useless" to a more accurate "unused".

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230810123057.43407-8-xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-09-03 18:10:23 -04:00