mirror of
https://github.com/torvalds/linux
synced 2024-10-08 20:34:15 +00:00
7935 commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Linus Torvalds | 9e06150d3c |
New code for 6.5:
* Fix a problem where shrink would blow out the space reserve by declining to shrink the filesystem. * Drop the EXPERIMENTAL tag for the large extent counts feature. * Set FMODE_CAN_ODIRECT and get rid of an address space op. * Fix an AG count overflow bug in growfs if the new device size is redonkulously large. Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHQEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZIs45AAKCRBKO3ySh0YR ps5NAP92oOaMlXeaxTTGLnbCe/sQhQiVfjE45sQL2BziHN/s2gD2OX01yn2w+Mpg CdQ6HChUzL2fU3eleh1yMNR7McuaCA== =hQX7 -----END PGP SIGNATURE----- Merge tag 'xfs-6.5-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs updates from Darrick Wong: "There's not much going on this cycle -- the large extent counts feature graduated, so now users can create more extremely fragmented files! :P The rest are bug fixes; and I'll be sending more next week. - Fix a problem where shrink would blow out the space reserve by declining to shrink the filesystem - Drop the EXPERIMENTAL tag for the large extent counts feature - Set FMODE_CAN_ODIRECT and get rid of an address space op - Fix an AG count overflow bug in growfs if the new device size is redonkulously large" * tag 'xfs-6.5-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix ag count overflow during growfs xfs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method xfs: drop EXPERIMENTAL tag for large extent counts xfs: don't deplete the reserve pool when trying to shrink the fs |
||
Linus Torvalds | 6e17c6de3d |
- Yosry Ahmed brought back some cgroup v1 stats in OOM logs.
- Yosry has also eliminated cgroup's atomic rstat flushing. - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability. - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning. - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface. - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree. - Johannes Weiner has done some cleanup work on the compaction code. - David Hildenbrand has contributed additional selftests for get_user_pages(). - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code. - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code. - Huang Ying has done some maintenance on the swap code's usage of device refcounting. - Christoph Hellwig has some cleanups for the filemap/directio code. - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses. - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings. - John Hubbard has a series of fixes to the MM selftesting code. - ZhangPeng continues the folio conversion campaign. - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock. - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8. - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management. - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code. - Vishal Moola also has done some folio conversion work. - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh J0qXCzqaaN8+BuEyLGDVPaXur9KirwY= =B7yQ -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: - Yosry Ahmed brought back some cgroup v1 stats in OOM logs - Yosry has also eliminated cgroup's atomic rstat flushing - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree - Johannes Weiner has done some cleanup work on the compaction code - David Hildenbrand has contributed additional selftests for get_user_pages() - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code - Huang Ying has done some maintenance on the swap code's usage of device refcounting - Christoph Hellwig has some cleanups for the filemap/directio code - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings - John Hubbard has a series of fixes to the MM selftesting code - ZhangPeng continues the folio conversion campaign - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8 - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code - Vishal Moola also has done some folio conversion work - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch * tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits) mm/hugetlb: remove hugetlb_set_page_subpool() mm: nommu: correct the range of mmap_sem_read_lock in task_mem() hugetlb: revert use of page_cache_next_miss() Revert "page cache: fix page_cache_next/prev_miss off by one" mm/vmscan: fix root proactive reclaim unthrottling unbalanced node mm: memcg: rename and document global_reclaim() mm: kill [add|del]_page_to_lru_list() mm: compaction: convert to use a folio in isolate_migratepages_block() mm: zswap: fix double invalidate with exclusive loads mm: remove unnecessary pagevec includes mm: remove references to pagevec mm: rename invalidate_mapping_pagevec to mapping_try_invalidate mm: remove struct pagevec net: convert sunrpc from pagevec to folio_batch i915: convert i915_gpu_error to use a folio_batch pagevec: rename fbatch_count() mm: remove check_move_unevictable_pages() drm: convert drm_gem_put_pages() to use a folio_batch i915: convert shmem_sg_free_table() to use a folio_batch scatterlist: add sg_set_folio() ... |
||
Linus Torvalds | 582c161cf3 |
hardening updates for v6.5-rc1
- Fix KMSAN vs FORTIFY in strlcpy/strlcat (Alexander Potapenko) - Convert strreplace() to return string start (Andy Shevchenko) - Flexible array conversions (Arnd Bergmann, Wyes Karny, Kees Cook) - Add missing function prototypes seen with W=1 (Arnd Bergmann) - Fix strscpy() kerndoc typo (Arne Welzel) - Replace strlcpy() with strscpy() across many subsystems which were either Acked by respective maintainers or were trivial changes that went ignored for multiple weeks (Azeem Shaikh) - Remove unneeded cc-option test for UBSAN_TRAP (Nick Desaulniers) - Add KUnit tests for strcat()-family - Enable KUnit tests of FORTIFY wrappers under UML - Add more complete FORTIFY protections for strlcat() - Add missed disabling of FORTIFY for all arch purgatories. - Enable -fstrict-flex-arrays=3 globally - Tightening UBSAN_BOUNDS when using GCC - Improve checkpatch to check for strcpy, strncpy, and fake flex arrays - Improve use of const variables in FORTIFY - Add requested struct_size_t() helper for types not pointers - Add __counted_by macro for annotating flexible array size members -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmSbftQWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJj0MD/9X9jzJzCmsAU+yNldeoAzC84Sk GVU3RBxGcTNysL1gZXynkIgigw7DWc4htMGeSABHHwQRVP65JCH1Kw/VqIkyumbx 9LdX6IklMJb4pRT4PVU3azebV4eNmSjlur2UxMeW54Czm91/6I8RHbJOyAPnOUmo 2oomGdP/hpEHtKR7hgy8Axc6w5ySwQixh2V5sVZG3VbvCS5WKTmTXbs6puuRT5hz iHt7v+7VtEg/Qf1W7J2oxfoghvVBsaRrSLrExWT/oZYh1ZxM7DsCAAoG/IsDgHGA 9LBXiRECgAFThbHVxLvvKZQMXdVk0i8iXLX43XMKC0wTA+NTyH7wlcQQ4RWNMuo8 sfA9Qm9gMArXaf64aymr3Uwn20Zan0391HdlbhOJZAE6v3PPJbleUnM58AzD2d3r 5Lz6AIFBxDImy+3f9iDWgacCT5/PkeiXTHzk9QnKhJyKKtRA58XJxj4q2+rPnGJP n4haXqoxD5FJbxdXiGKk31RS0U5HBug7wkOcUrTqDHUbc/QNU2b7dxTKUx+zYtCU uV5emPzpF4H4z+91WpO47n9gkMAfwV0lt9S2dwS8pxsgqctbmIan+Jgip7rsqZ2G OgLXBsb43eEs+6WgO8tVt/ZHYj9ivGMdrcNcsIfikzNs/xweUJ53k2xSEn2xEa5J cwANDmkL6QQK7yfeeg== =s0j1 -----END PGP SIGNATURE----- Merge tag 'hardening-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: "There are three areas of note: A bunch of strlcpy()->strscpy() conversions ended up living in my tree since they were either Acked by maintainers for me to carry, or got ignored for multiple weeks (and were trivial changes). The compiler option '-fstrict-flex-arrays=3' has been enabled globally, and has been in -next for the entire devel cycle. This changes compiler diagnostics (though mainly just -Warray-bounds which is disabled) and potential UBSAN_BOUNDS and FORTIFY _warning_ coverage. In other words, there are no new restrictions, just potentially new warnings. Any new FORTIFY warnings we've seen have been fixed (usually in their respective subsystem trees). For more details, see commit |
||
Linus Torvalds | a0433f8cae |
for-6.5/block-2023-06-23
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8dwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpilGD/9Yys1oxIXJpRf00fzrylAlBthRxMjFQVWw zAut106hAQiBHvU8IkmGA3MvEFVHxtzwYhHI7IR8K3aZBIqscweCqmVI9JyogJw9 U9Twnzel47VmuKdM94FeoN+hbj1fP8EWTjzmy67/zEEfFCdmHvNlMi3lSrGYIpFy 39LxTB99Y4UarM5PtWbes37GYYljzMSWKuo4AfBkvq1eQa+sZ0Vq2xAABKq3UM7f apqhgHtkJooRePDP0eQp+kAyyVMgW2jIK+oIdJDxNF3CKTu2w40RzaYz6fp+jVSU H4R/xS59GW4/xql+VBJDh/qJg9K62DPPYjlW8BmSR8+IjvfFpsyH3/MacE50CD3P 20fs/Mnj49H79fDrQEHJI53cOOb2EmUitbwLbvOcColNTPpt8loBtdQxjF2RMU8R Nyort9DJPFclYCxky1LYg1CNEC2Ln4Zy/jD47wPvqRmOQphOoVlV/hPnOEqvjaZC 49Vn70W2DeE9cXvYI7ha+XIg6/oj+Gs3iusEbV08Ci7EAtXgI+ZUUsQ97K8UNiUh h2lqSJtuI7lBpYP9sf+BeCch5UCC+xGYyTdoM5f58lehWBBPtbs0g7S9RyRyOYxe n+yxEUo3dAGzJ/xsKAjinbZfeWIpr0b1TkAh4w3Cq/BKzRr9Bp8lBAxYuancbQ+Y 1ADPteUOTA== =zP4Y -----END PGP SIGNATURE----- Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - NVMe pull request via Keith: - Various cleanups all around (Irvin, Chaitanya, Christophe) - Better struct packing (Christophe JAILLET) - Reduce controller error logs for optional commands (Keith) - Support for >=64KiB block sizes (Daniel Gomez) - Fabrics fixes and code organization (Max, Chaitanya, Daniel Wagner) - bcache updates via Coly: - Fix a race at init time (Mingzhe Zou) - Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye) - use page pinning in the block layer for dio (David) - convert old block dio code to page pinning (David, Christoph) - cleanups for pktcdvd (Andy) - cleanups for rnbd (Guoqing) - use the unchecked __bio_add_page() for the initial single page additions (Johannes) - fix overflows in the Amiga partition handling code (Michael) - improve mq-deadline zoned device support (Bart) - keep passthrough requests out of the IO schedulers (Christoph, Ming) - improve support for flush requests, making them less special to deal with (Christoph) - add bdev holder ops and shutdown methods (Christoph) - fix the name_to_dev_t() situation and use cases (Christoph) - decouple the block open flags from fmode_t (Christoph) - ublk updates and cleanups, including adding user copy support (Ming) - BFQ sanity checking (Bart) - convert brd from radix to xarray (Pankaj) - constify various structures (Thomas, Ivan) - more fine grained persistent reservation ioctl capability checks (Jingbo) - misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan, Jordy, Li, Min, Yu, Zhong, Waiman) * tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits) scsi/sg: don't grab scsi host module reference ext4: Fix warning in blkdev_put() block: don't return -EINVAL for not found names in devt_from_devname cdrom: Fix spectre-v1 gadget block: Improve kernel-doc headers blk-mq: don't insert passthrough request into sw queue bsg: make bsg_class a static const structure ublk: make ublk_chr_class a static const structure aoe: make aoe_class a static const structure block/rnbd: make all 'class' structures const block: fix the exclusive open mask in disk_scan_partitions block: add overflow checks for Amiga partition support block: change all __u32 annotations to __be32 in affs_hardblocks.h block: fix signed int overflow in Amiga partition support block: add capacity validation in bdev_add_partition() block: fine-granular CAP_SYS_ADMIN for Persistent Reservation block: disallow Persistent Reservation on partitions reiserfs: fix blkdev_put() warning from release_journal_dev() block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions() block: document the holder argument to blkdev_get_by_path ... |
||
Linus Torvalds | 3eccc0c886 |
for-6.5/splice-2023-06-23
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8QgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpupIEADKEZvpxDyaxHjYZFFeoSJRkh+AEJHe0Xtr J5vUL8t8zmAV3F7i8XaoAEcR0dC0VQcoTc8fAOty71+5hsc7gvtyyNjqU/YWRVqK Xr+VJuSJ+OGx3MzpRWEkepagfPyqP5cyyCOK6gqIgqzc3IwqkR/3QHVRc6oR8YbY AQd7tqm2fQXK9WDHEy5hcaQeqb9uKZjQQoZejpPPerpJM+9RMgKxpCGtnLLIUhr/ sgl7KyLIQPBmveO2vfOR+dmsJBqsLqneqkXDKMAIfpeVEEkHHAlCH4E5Ne1XUS+s ie4If+reuyn1Ktt5Ry1t7w2wr8cX1fcay3K28tgwjE2Bvremc5YnYgb3pyUDW38f tXXkpg/eTXd/Pn0Crpagoa9zJ927tt5JXIO1/PagPEP1XOqUuthshDFsrVqfqbs+ 36gqX2JWB4NJTg9B9KBHA3+iVCJyZLjUqOqws7hOJOvhQytZVm/IwkGBg1Slhe1a J5WemBlqX8lTgXz0nM7cOhPYTZeKe6hazCcb5VwxTUTj9SGyYtsMfqqTwRJO9kiF j1VzbOAgExDYe+GvfqOFPh9VqZho66+DyOD/Xtca4eH7oYyHSmP66o8nhRyPBPZA maBxQhUkPQn4/V/0fL2TwIdWYKsbj8bUyINKPZ2L35YfeICiaYIctTwNJxtRmItB M3VxWD3GZQ== =KhW4 -----END PGP SIGNATURE----- Merge tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux Pull splice updates from Jens Axboe: "This kills off ITER_PIPE to avoid a race between truncate, iov_iter_revert() on the pipe and an as-yet incomplete DMA to a bio with unpinned/unref'ed pages from an O_DIRECT splice read. This causes memory corruption. Instead, we either use (a) filemap_splice_read(), which invokes the buffered file reading code and splices from the pagecache into the pipe; (b) copy_splice_read(), which bulk-allocates a buffer, reads into it and then pushes the filled pages into the pipe; or (c) handle it in filesystem-specific code. Summary: - Rename direct_splice_read() to copy_splice_read() - Simplify the calculations for the number of pages to be reclaimed in copy_splice_read() - Turn do_splice_to() into a helper, vfs_splice_read(), so that it can be used by overlayfs and coda to perform the checks on the lower fs - Make vfs_splice_read() jump to copy_splice_read() to handle direct-I/O and DAX - Provide shmem with its own splice_read to handle non-existent pages in the pagecache. We don't want a ->read_folio() as we don't want to populate holes, but filemap_get_pages() requires it - Provide overlayfs with its own splice_read to call down to a lower layer as overlayfs doesn't provide ->read_folio() - Provide coda with its own splice_read to call down to a lower layer as coda doesn't provide ->read_folio() - Direct ->splice_read to copy_splice_read() in tty, procfs, kernfs and random files as they just copy to the output buffer and don't splice pages - Provide wrappers for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3, ocfs2, orangefs, xfs and zonefs to do locking and/or revalidation - Make cifs use filemap_splice_read() - Replace pointers to generic_file_splice_read() with pointers to filemap_splice_read() as DIO and DAX are handled in the caller; filesystems can still provide their own alternate ->splice_read() op - Remove generic_file_splice_read() - Remove ITER_PIPE and its paraphernalia as generic_file_splice_read was the only user" * tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux: (31 commits) splice: kdoc for filemap_splice_read() and copy_splice_read() iov_iter: Kill ITER_PIPE splice: Remove generic_file_splice_read() splice: Use filemap_splice_read() instead of generic_file_splice_read() cifs: Use filemap_splice_read() trace: Convert trace/seq to use copy_splice_read() zonefs: Provide a splice-read wrapper xfs: Provide a splice-read wrapper orangefs: Provide a splice-read wrapper ocfs2: Provide a splice-read wrapper ntfs3: Provide a splice-read wrapper nfs: Provide a splice-read wrapper f2fs: Provide a splice-read wrapper ext4: Provide a splice-read wrapper ecryptfs: Provide a splice-read wrapper ceph: Provide a splice-read wrapper afs: Provide a splice-read wrapper 9p: Add splice_read wrapper net: Make sock_splice_read() use copy_splice_read() by default tty, proc, kernfs, random: Use copy_splice_read() ... |
||
Long Li | c3b880acad |
xfs: fix ag count overflow during growfs
I found a corruption during growfs: XFS (loop0): Internal error agbno >= mp->m_sb.sb_agblocks at line 3661 of file fs/xfs/libxfs/xfs_alloc.c. Caller __xfs_free_extent+0x28e/0x3c0 CPU: 0 PID: 573 Comm: xfs_growfs Not tainted 6.3.0-rc7-next-20230420-00001-gda8c95746257 Call Trace: <TASK> dump_stack_lvl+0x50/0x70 xfs_corruption_error+0x134/0x150 __xfs_free_extent+0x2c1/0x3c0 xfs_ag_extend_space+0x291/0x3e0 xfs_growfs_data+0xd72/0xe90 xfs_file_ioctl+0x5f9/0x14a0 __x64_sys_ioctl+0x13e/0x1c0 do_syscall_64+0x39/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd XFS (loop0): Corruption detected. Unmount and run xfs_repair XFS (loop0): Internal error xfs_trans_cancel at line 1097 of file fs/xfs/xfs_trans.c. Caller xfs_growfs_data+0x691/0xe90 CPU: 0 PID: 573 Comm: xfs_growfs Not tainted 6.3.0-rc7-next-20230420-00001-gda8c95746257 Call Trace: <TASK> dump_stack_lvl+0x50/0x70 xfs_error_report+0x93/0xc0 xfs_trans_cancel+0x2c0/0x350 xfs_growfs_data+0x691/0xe90 xfs_file_ioctl+0x5f9/0x14a0 __x64_sys_ioctl+0x13e/0x1c0 do_syscall_64+0x39/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f2d86706577 The bug can be reproduced with the following sequence: # truncate -s 1073741824 xfs_test.img # mkfs.xfs -f -b size=1024 -d agcount=4 xfs_test.img # truncate -s 2305843009213693952 xfs_test.img # mount -o loop xfs_test.img /mnt/test # xfs_growfs -D 1125899907891200 /mnt/test The root cause is that during growfs, user space passed in a large value of newblcoks to xfs_growfs_data_private(), due to current sb_agblocks is too small, new AG count will exceed UINT_MAX. Because of AG number type is unsigned int and it would overflow, that caused nagcount much smaller than the actual value. During AG extent space, delta blocks in xfs_resizefs_init_new_ags() will much larger than the actual value due to incorrect nagcount, even exceed UINT_MAX. This will cause corruption and be detected in __xfs_free_extent. Fix it by growing the filesystem to up to the maximally allowed AGs and not return EINVAL when new AG count overflow. Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> |
||
Christoph Hellwig | b294349993 |
xfs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method
Since commit
|
||
Darrick J. Wong | 61d7e8274c |
xfs: drop EXPERIMENTAL tag for large extent counts
This feature has been baking in upstream for ~10mo with no bug reports. It seems to work fine here, let's get rid of the scary warnings? Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> |
||
Darrick J. Wong | 06f3ef6e17 |
xfs: don't deplete the reserve pool when trying to shrink the fs
Every now and then, xfs/168 fails with this logged in dmesg:
Reserve blocks depleted! Consider increasing reserve pool size.
EXPERIMENTAL online shrink feature in use. Use at your own risk!
Per-AG reservation for AG 1 failed. Filesystem may run out of space.
Per-AG reservation for AG 1 failed. Filesystem may run out of space.
Error -28 reserving per-AG metadata reserve pool.
Corruption of in-memory data (0x8) detected at xfs_ag_shrink_space+0x23c/0x3b0 [xfs] (fs/xfs/libxfs/xfs_ag.c:1007). Shutting down filesystem.
It's silly to deplete the reserved blocks pool just to shrink the
filesystem, particularly since the fs goes down after that.
Fixes:
|
||
Christoph Hellwig | 05bdb99653 |
block: replace fmode_t with a block-specific type for block open flags
The only overlap between the block open flags mapped into the fmode_t and other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and ->ioctl and stop abusing fmode_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
Christoph Hellwig | 2736e8eeb0 |
block: use the holder as indication for exclusive opens
The current interface for exclusive opens is rather confusing as it requires both the FMODE_EXCL flag and a holder. Remove the need to pass FMODE_EXCL and just key off the exclusive open off a non-NULL holder. For blkdev_put this requires adding the holder argument, which provides better debug checking that only the holder actually releases the hold, but at the same time allows removing the now superfluous mode argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
Christoph Hellwig | 219580eea1 |
iomap: update ki_pos in iomap_file_buffered_write
All callers of iomap_file_buffered_write need to updated ki_pos, move it into common code. Link: https://lkml.kernel.org/r/20230601145904.1385409-8-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Damien Le Moal <dlemoal@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Christoph Hellwig | 0d625446d0 |
backing_dev: remove current->backing_dev_info
Patch series "cleanup the filemap / direct I/O interaction", v4.
This series cleans up some of the generic write helper calling conventions
and the page cache writeback / invalidation for direct I/O. This is a
spinoff from the no-bufferhead kernel project, for which we'll want to an
use iomap based buffered write path in the block layer.
This patch (of 12):
The last user of current->backing_dev_info disappeared in commit
|
||
Christoph Hellwig | 8067ca1dcd |
xfs: wire up the ->mark_dead holder operation for log and RT devices
Implement a set of holder_ops that shut down the file system when the block device used as log or RT device is removed undeneath the file system. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
Christoph Hellwig | e7caa877e5 |
xfs: wire up sops->shutdown
Wire up the shutdown method to shut down the file system when the underlying block device is marked dead. Add a new message to clearly distinguish this shutdown reason from other shutdowns. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
Christoph Hellwig | 0718afd47f |
block: introduce holder ops
Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and installed in the block_device for exclusive claims. It will be used to allow the block layer to call back into the user of the block device for thing like notification of a removed device or a device resize. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
Dave Chinner | d4d12c02bf |
xfs: collect errors from inodegc for unlinked inode recovery
Unlinked list recovery requires errors removing the inode the from the unlinked list get fed back to the main recovery loop. Now that we offload the unlinking to the inodegc work, we don't get errors being fed back when we trip over a corruption that prevents the inode from being removed from the unlinked list. This means we never clear the corrupt unlinked list bucket, resulting in runtime operations eventually tripping over it and shutting down. Fix this by collecting inodegc worker errors and feed them back to the flush caller. This is largely best effort - the only context that really cares is log recovery, and it only flushes a single inode at a time so we don't need complex synchronised handling. Essentially the inodegc workers will capture the first error that occurs and the next flush will gather them and clear them. The flush itself will only report the first gathered error. In the cases where callers can return errors, propagate the collected inodegc flush error up the error handling chain. In the case of inode unlinked list recovery, there are several superfluous calls to flush queued unlinked inodes - xlog_recover_iunlink_bucket() guarantees that it has flushed the inodegc and collected errors before it returns. Hence nothing in the calling path needs to run a flush, even when an error is returned. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com> |
||
Dave Chinner | 7dfee17b13 |
xfs: validate block number being freed before adding to xefi
Bad things happen in defered extent freeing operations if it is passed a bad block number in the xefi. This can come from a bogus agno/agbno pair from deferred agfl freeing, or just a bad fsbno being passed to __xfs_free_extent_later(). Either way, it's very difficult to diagnose where a null perag oops in EFI creation is coming from when the operation that queued the xefi has already been completed and there's no longer any trace of it around.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com> |
||
Dave Chinner | 3148ebf2c0 |
xfs: validity check agbnos on the AGFL
If the agfl or the indexing in the AGF has been corrupted, getting a block form the AGFL could return an invalid block number. If this happens, bad things happen. Check the agbno we pull off the AGFL and return -EFSCORRUPTED if we find somethign bad. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com> |
||
Dave Chinner | e0a8de7da3 |
xfs: fix agf/agfl verification on v4 filesystems
When a v4 filesystem has fl_last - fl_first != fl_count, we do not not detect the corruption and allow the AGF to be used as it if was fully valid. On V5 filesystems, we reset the AGFL to empty in these cases and avoid the corruption at a small cost of leaked blocks. If we don't catch the corruption on V4 filesystems, bad things happen later when an allocation attempts to trim the free list and either double-frees stale entries in the AGFl or tries to free NULLAGBNO entries. Either way, this is bad. Prevent this from happening by using the AGFL_NEED_RESET logic for v4 filesysetms, too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com> |
||
Dave Chinner | 1e473279f4 |
xfs: fix double xfs_perag_rele() in xfs_filestream_pick_ag()
xfs_bmap_longest_free_extent() can return an error when accessing
the AGF fails. In this case, the behaviour of
xfs_filestream_pick_ag() is conditional on the error. We may
continue the loop, or break out of it. The error handling after the
loop cleans up the perag reference held when the break occurs. If we
continue, the next loop iteration handles cleaning up the perag
reference.
EIther way, we don't need to release the active perag reference when
xfs_bmap_longest_free_extent() fails. Doing so means we do a double
decrement on the active reference count, and this causes tha active
reference count to fall to zero. At this point, new active
references will fail.
This leads to unmount hanging because it tries to grab active
references to that perag, only for it to fail. This happens inside a
loop that retries until a inode tree radix tree tag is cleared,
which cannot happen because we can't get an active reference to the
perag.
The unmount livelocks in this path:
xfs_reclaim_inodes+0x80/0xc0
xfs_unmount_flush_inodes+0x5b/0x70
xfs_unmountfs+0x5b/0x1a0
xfs_fs_put_super+0x49/0x110
generic_shutdown_super+0x7c/0x1a0
kill_block_super+0x27/0x50
deactivate_locked_super+0x30/0x90
deactivate_super+0x3c/0x50
cleanup_mnt+0xc2/0x160
__cleanup_mnt+0x12/0x20
task_work_run+0x5e/0xa0
exit_to_user_mode_prepare+0x1bc/0x1c0
syscall_exit_to_user_mode+0x16/0x40
do_syscall_64+0x40/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Fixes:
|
||
Darrick J. Wong | 6be73cecb5 |
xfs: fix broken logic when detecting mergeable bmap records
Commit 6bc6c99a944c was a well-intentioned effort to initiate consolidation of adjacent bmbt mapping records by setting the PREEN flag. Consolidation can only happen if the length of the combined record doesn't overflow the 21-bit blockcount field of the bmbt recordset. Unfortunately, the length test is inverted, leading to it triggering on data forks like these: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..16777207]: 76110848..92888055 0 (76110848..92888055) 16777208 1: [16777208..20639743]: 92888056..96750591 0 (92888056..96750591) |
||
Geert Uytterhoeven | 4320f34666 |
xfs: Fix undefined behavior of shift into sign bit
With gcc-5: In file included from ./include/trace/define_trace.h:102:0, from ./fs/xfs/scrub/trace.h:988, from fs/xfs/scrub/trace.c:40: ./fs/xfs/./scrub/trace.h: In function ‘trace_raw_output_xchk_fsgate_class’: ./fs/xfs/scrub/scrub.h:111:28: error: initializer element is not constant #define XREP_ALREADY_FIXED (1 << 31) /* checking our repair work */ ^ Shifting the (signed) value 1 into the sign bit is undefined behavior. Fix this for all definitions in the file by shifting "1U" instead of "1". This was exposed by the first user added in commit |
||
Dave Chinner | 82842fee6e |
xfs: fix AGF vs inode cluster buffer deadlock
Lock order in XFS is AGI -> AGF, hence for operations involving
inode unlinked list operations we always lock the AGI first. Inode
unlinked list operations operate on the inode cluster buffer,
so the lock order there is AGI -> inode cluster buffer.
For O_TMPFILE operations, this now means the lock order set down in
xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the
unlinked ops are done before the directory modifications that may
allocate space and lock the AGF.
Unfortunately, we also now lock the inode cluster buffer when
logging an inode so that we can attach the inode to the cluster
buffer and pin it in memory. This creates a lock order of AGF ->
inode cluster buffer in directory operations as we have to log the
inode after we've allocated new space for it.
This creates a lock inversion between the AGF and the inode cluster
buffer. Because the inode cluster buffer is shared across multiple
inodes, the inversion is not specific to individual inodes but can
occur when inodes in the same cluster buffer are accessed in
different orders.
To fix this we need move all the inode log item cluster buffer
interactions to the end of the current transaction. Unfortunately,
xfs_trans_log_inode() calls are littered throughout the transactions
with no thought to ordering against other items or locking. This
makes it difficult to do anything that involves changing the call
sites of xfs_trans_log_inode() to change locking orders.
However, we do now have a mechanism that allows is to postpone dirty
item processing to just before we commit the transaction: the
->iop_precommit method. This will be called after all the
modifications are done and high level objects like AGI and AGF
buffers have been locked and modified, thereby providing a mechanism
that guarantees we don't lock the inode cluster buffer before those
high level objects are locked.
This change is largely moving the guts of xfs_trans_log_inode() to
xfs_inode_item_precommit() and providing an extra flag context in
the inode log item to track the dirty state of the inode in the
current transaction. This also means we do a lot less repeated work
in xfs_trans_log_inode() by only doing it once per transaction when
all the work is done.
Fixes:
|
||
Dave Chinner | cb04211748 |
xfs: defered work could create precommits
To fix a AGI-AGF-inode cluster buffer deadlock, we need to move inode cluster buffer operations to the ->iop_precommit() method. However, this means that deferred operations can require precommits to be run on the final transaction that the deferred ops pass back to xfs_trans_commit() context. This will be exposed by attribute handling, in that the last changes to the inode in the attr set state machine "disappear" because the precommit operation is not run. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> |
||
Dave Chinner | 00dcd17cfa |
xfs: restore allocation trylock iteration
It was accidentally dropped when refactoring the allocation code,
resulting in the AG iteration always doing blocking AG iteration.
This results in a small performance regression for a specific fsmark
test that runs more user data writer threads than there are AGs.
Reported-by: kernel test robot <oliver.sang@intel.com>
Fixes:
|
||
Dave Chinner | 89a4bf0dc3 |
xfs: buffer pins need to hold a buffer reference
When a buffer is unpinned by xfs_buf_item_unpin(), we need to access the buffer after we've dropped the buffer log item reference count. This opens a window where we can have two racing unpins for the buffer item (e.g. shutdown checkpoint context callback processing racing with journal IO iclog completion processing) and both attempt to access the buffer after dropping the BLI reference count. If we are unlucky, the "BLI freed" context wins the race and frees the buffer before the "BLI still active" case checks the buffer pin count. This results in a use after free that can only be triggered in active filesystem shutdown situations. To fix this, we need to ensure that buffer existence extends beyond the BLI reference count checks and until the unpin processing is complete. This implies that a buffer pin operation must also take a buffer reference to ensure that the buffer cannot be freed until the buffer unpin processing is complete. Reported-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> |
||
Kees Cook | d67790ddf0 |
overflow: Add struct_size_t() helper
While struct_size() is normally used in situations where the structure type already has a pointer instance, there are places where no variable is available. In the past, this has been worked around by using a typed NULL first argument, but this is a bit ugly. Add a helper to do this, and replace the handful of instances of the code pattern with it. Instances were found with this Coccinelle script: @struct_size_t@ identifier STRUCT, MEMBER; expression COUNT; @@ - struct_size((struct STRUCT *)\(0\|NULL\), + struct_size_t(struct STRUCT, MEMBER, COUNT) Suggested-by: Christoph Hellwig <hch@infradead.org> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Cc: Tony Nguyen <anthony.l.nguyen@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: James Smart <james.smart@broadcom.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: HighPoint Linux Team <linux@highpoint-tech.com> Cc: "James E.J. Bottomley" <jejb@linux.ibm.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> Cc: Don Brace <don.brace@microchip.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <dchinner@redhat.com> Cc: Guo Xuenan <guoxuenan@huawei.com> Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Daniel Latypov <dlatypov@google.com> Cc: kernel test robot <lkp@intel.com> Cc: intel-wired-lan@lists.osuosl.org Cc: netdev@vger.kernel.org Cc: linux-nvme@lists.infradead.org Cc: linux-scsi@vger.kernel.org Cc: megaraidlinux.pdl@broadcom.com Cc: storagedev@microchip.com Cc: linux-xfs@vger.kernel.org Cc: linux-hardening@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20230522211810.never.421-kees@kernel.org |
||
David Howells | 54919f94ec |
xfs: Provide a splice-read wrapper
Provide a splice_read wrapper for XFS. This does a stat count and a shutdown check before proceeding, then emits a new trace line and locks the inode across the call to filemap_splice_read() and adds to the stats afterwards. Splicing from direct I/O or DAX is handled by the caller. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: Darrick J. Wong <djwong@kernel.org> cc: linux-xfs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-block@vger.kernel.org cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20230522135018.2742245-25-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
Linus Torvalds | 849a4f0973 |
xfs: bug fixes for 6.4-rc2
o fixes for inode garbage collection shutdown racing with work queue updates o ensure inodegc workers run on the CPU they are supposed to o disable counter scrubbing until we can exclusively freeze the filesystem from the kernel o Regression fixes for new allocation related bugs o a couple of minor cleanups -----BEGIN PGP SIGNATURE----- iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmRcSIsUHGRhdmlkQGZy b21vcmJpdC5jb20ACgkQregpR/R1+h2Y8xAAxtsTdOx71XtDuNyfBOiqzZgTCq6b 6LsckJIDQa1AXjUNq9G3zWcUcWBcRWcw+CWbkqjqQ9W47K/ijLuoKnjRsQ+5B4DU TBUctVq+/Zk2lBlb6HKuKdzqDGnIFWGVKVd7u8KlowqnXuzUeQ0vFkT7ZHTepUKG P+midgGNVT4+tykq7oH0H8WxoTyNPZhKiAUcZjneBgA60IAoQWHA2iUt+SKpbrkL 1HyK+/edVMTXiDXtyHfXmDaH9Pgy6NCpw3TNkPDhuL1UDpLhg/zgT39rFZGBsAUt gaDM3wN5jBrot/mvJE3rH9bdZhkcf+NQKPx/1DDg3DL8plS/1/LUC4cImdolBJ3w RNmgJv1lK+AlE4MUJ/bUDlEpHUmwAjnnsxBXwEvnYNfj+9V6/mDB+HqKiY7/XxVK vF77s6z+CWvefdnZavJ4/72pVVJNkcDYCYmvh/donRP6vtnwZyzocFUeBeNMInV1 /s3WMrF9hwmJqAClKG7p1fnszWp658yFIuw/TXVs+NrjTtQgXwMpl2cEYYvUZEJN Trq2p0xH/JSwcnOPSPJO6WHb8UPoqrM6lgGFaJVWJx1AWt1i1CFLf5eA5X+XisDV AJKgpqlnDg02bBMQ0tMFGZUaNx/1S1mwtxcZsyEFTutpUNxqJKDaMohpxxrWb0WC ppSqDvyJN4wtlFI= =qok2 -----END PGP SIGNATURE----- Merge tag 'xfs-6.4-rc1-fixes' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs bug fixes from Dave Chinner: "Largely minor bug fixes and cleanups, th emost important of which are probably the fixes for regressions in the extent allocation code: - fixes for inode garbage collection shutdown racing with work queue updates - ensure inodegc workers run on the CPU they are supposed to - disable counter scrubbing until we can exclusively freeze the filesystem from the kernel - regression fixes for new allocation related bugs - a couple of minor cleanups" * tag 'xfs-6.4-rc1-fixes' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix xfs_inodegc_stop racing with mod_delayed_work xfs: disable reaping in fscounters scrub xfs: check that per-cpu inodegc workers actually run on that cpu xfs: explicitly specify cpu when forcing inodegc delayed work to run immediately xfs: fix negative array access in xfs_getbmap xfs: don't allocate into the data fork for an unshare request xfs: flush dirty data and drain directios before scrubbing cow fork xfs: set bnobt/cntbt numrecs correctly when formatting new AGs xfs: don't unconditionally null args->pag in xfs_bmap_btalloc_at_eof |
||
Darrick J. Wong | 2254a7396a |
xfs: fix xfs_inodegc_stop racing with mod_delayed_work
syzbot reported this warning from the faux inodegc shrinker that tries
to kick off inodegc work:
------------[ cut here ]------------
WARNING: CPU: 1 PID: 102 at kernel/workqueue.c:1445 __queue_work+0xd44/0x1120 kernel/workqueue.c:1444
RIP: 0010:__queue_work+0xd44/0x1120 kernel/workqueue.c:1444
Call Trace:
__queue_delayed_work+0x1c8/0x270 kernel/workqueue.c:1672
mod_delayed_work_on+0xe1/0x220 kernel/workqueue.c:1746
xfs_inodegc_shrinker_scan fs/xfs/xfs_icache.c:2212 [inline]
xfs_inodegc_shrinker_scan+0x250/0x4f0 fs/xfs/xfs_icache.c:2191
do_shrink_slab+0x428/0xaa0 mm/vmscan.c:853
shrink_slab+0x175/0x660 mm/vmscan.c:1013
shrink_one+0x502/0x810 mm/vmscan.c:5343
shrink_many mm/vmscan.c:5394 [inline]
lru_gen_shrink_node mm/vmscan.c:5511 [inline]
shrink_node+0x2064/0x35f0 mm/vmscan.c:6459
kswapd_shrink_node mm/vmscan.c:7262 [inline]
balance_pgdat+0xa02/0x1ac0 mm/vmscan.c:7452
kswapd+0x677/0xd60 mm/vmscan.c:7712
kthread+0x2e8/0x3a0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
This warning corresponds to this code in __queue_work:
/*
* For a draining wq, only works from the same workqueue are
* allowed. The __WQ_DESTROYING helps to spot the issue that
* queues a new work item to a wq after destroy_workqueue(wq).
*/
if (unlikely(wq->flags & (__WQ_DESTROYING | __WQ_DRAINING) &&
WARN_ON_ONCE(!is_chained_work(wq))))
return;
For this to trip, we must have a thread draining the inodedgc workqueue
and a second thread trying to queue inodegc work to that workqueue.
This can happen if freezing or a ro remount race with reclaim poking our
faux inodegc shrinker and another thread dropping an unlinked O_RDONLY
file:
Thread 0 Thread 1 Thread 2
xfs_inodegc_stop
xfs_inodegc_shrinker_scan
xfs_is_inodegc_enabled
<yes, will continue>
xfs_clear_inodegc_enabled
xfs_inodegc_queue_all
<list empty, do not queue inodegc worker>
xfs_inodegc_queue
<add to list>
xfs_is_inodegc_enabled
<no, returns>
drain_workqueue
<set WQ_DRAINING>
llist_empty
<no, will queue list>
mod_delayed_work_on(..., 0)
__queue_work
<sees WQ_DRAINING, kaboom>
In other words, everything between the access to inodegc_enabled state
and the decision to poke the inodegc workqueue requires some kind of
coordination to avoid the WQ_DRAINING state. We could perhaps introduce
a lock here, but we could also try to eliminate WQ_DRAINING from the
picture.
We could replace the drain_workqueue call with a loop that flushes the
workqueue and queues workers as long as there is at least one inode
present in the per-cpu inodegc llists. We've disabled inodegc at this
point, so we know that the number of queued inodes will eventually hit
zero as long as xfs_inodegc_start cannot reactivate the workers.
There are four callers of xfs_inodegc_start. Three of them come from the
VFS with s_umount held: filesystem thawing, failed filesystem freezing,
and the rw remount transition. The fourth caller is mounting rw (no
remount or freezing possible).
There are three callers ofs xfs_inodegc_stop. One is unmounting (no
remount or thaw possible). Two of them come from the VFS with s_umount
held: fs freezing and ro remount transition.
Hence, it is correct to replace the drain_workqueue call with a loop
that drains the inodegc llists.
Fixes:
|
||
Darrick J. Wong | 2d5f38a319 |
xfs: disable reaping in fscounters scrub
The fscounters scrub code doesn't work properly because it cannot quiesce updates to the percpu counters in the filesystem, hence it returns false corruption reports. This has been fixed properly in one of the online repair patchsets that are under review by replacing the xchk_disable_reaping calls with an exclusive filesystem freeze. Disabling background gc isn't sufficient to fix the problem. In other words, scrub doesn't need to call xfs_inodegc_stop, which is just as well since it wasn't correct to allow scrub to call xfs_inodegc_start when something else could be calling xfs_inodegc_stop (e.g. trying to freeze the filesystem). Neuter the scrubber for now, and remove the xchk_*_reaping functions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> |
||
Darrick J. Wong | b37c4c8339 |
xfs: check that per-cpu inodegc workers actually run on that cpu
Now that we've allegedly worked out the problem of the per-cpu inodegc workers being scheduled on the wrong cpu, let's put in a debugging knob to let us know if a worker ever gets mis-scheduled again. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> |
||
Darrick J. Wong | 03e0add80f |
xfs: explicitly specify cpu when forcing inodegc delayed work to run immediately
I've been noticing odd racing behavior in the inodegc code that could
only be explained by one cpu adding an inode to its inactivation llist
at the same time that another cpu is processing that cpu's llist.
Preemption is disabled between get/put_cpu_ptr, so the only explanation
is scheduler mayhem. I inserted the following debug code into
xfs_inodegc_worker (see the next patch):
ASSERT(gc->cpu == smp_processor_id());
This assertion tripped during overnight tests on the arm64 machines, but
curiously not on x86_64. I think we haven't observed any resource leaks
here because the lockfree list code can handle simultaneous llist_add
and llist_del_all functions operating on the same list. However, the
whole point of having percpu inodegc lists is to take advantage of warm
memory caches by inactivating inodes on the last processor to touch the
inode.
The incorrect scheduling seems to occur after an inodegc worker is
subjected to mod_delayed_work(). This wraps mod_delayed_work_on with
WORK_CPU_UNBOUND specified as the cpu number. Unbound allows for
scheduling on any cpu, not necessarily the same one that scheduled the
work.
Because preemption is disabled for as long as we have the gc pointer, I
think it's safe to use current_cpu() (aka smp_processor_id) to queue the
delayed work item on the correct cpu.
Fixes:
|
||
Darrick J. Wong | 1bba82fe1a |
xfs: fix negative array access in xfs_getbmap
In commit |
||
Darrick J. Wong | 1f1397b721 |
xfs: don't allocate into the data fork for an unshare request
For an unshare request, we only have to take action if the data fork has a shared mapping. We don't care if someone else set up a cow operation. If we find nothing in the data fork, return a hole to avoid allocating space. Note that fallocate will replace the delalloc reservation with an unwritten extent anyway, so this has no user-visible effects outside of avoiding unnecessary updates. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> |
||
Darrick J. Wong | 397b2d7e0f |
xfs: flush dirty data and drain directios before scrubbing cow fork
When we're scrubbing the COW fork, we need to take MMAPLOCK_EXCL to prevent page_mkwrite from modifying any inode state. The ILOCK should suffice to avoid confusing online fsck, but let's take the same locks that we do everywhere else. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> |
||
Darrick J. Wong | 8e698ee72c |
xfs: set bnobt/cntbt numrecs correctly when formatting new AGs
Through generic/300, I discovered that mkfs.xfs creates corrupt
filesystems when given these parameters:
# mkfs.xfs -d size=512M /dev/sda -f -d su=128k,sw=4 --unsupported
Filesystems formatted with --unsupported are not supported!!
meta-data=/dev/sda isize=512 agcount=8, agsize=16352 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=1
data = bsize=4096 blocks=130816, imaxpct=25
= sunit=32 swidth=128 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=8192, version=2
= sectsz=512 sunit=32 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
= rgcount=0 rgsize=0 blks
Discarding blocks...Done.
# xfs_repair -n /dev/sda
Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
Phase 2 - using internal log
- zero log...
- 16:30:50: zeroing log - 16320 of 16320 blocks done
- scan filesystem freespace and inode maps...
agf_freeblks 25, counted 0 in ag 4
sb_fdblocks 8823, counted 8798
The root cause of this problem is the numrecs handling in
xfs_freesp_init_recs, which is used to initialize a new AG. Prior to
calling the function, we set up the new bnobt block with numrecs == 1
and rely on _freesp_init_recs to format that new record. If the last
record created has a blockcount of zero, then it sets numrecs = 0.
That last bit isn't correct if the AG contains the log, the start of the
log is not immediately after the initial blocks due to stripe alignment,
and the end of the log is perfectly aligned with the end of the AG. For
this case, we actually formatted a single bnobt record to handle the
free space before the start of the (stripe aligned) log, and incremented
arec to try to format a second record. That second record turned out to
be unnecessary, so what we really want is to leave numrecs at 1.
The numrecs handling itself is overly complicated because a different
function sets numrecs == 1. Change the bnobt creation code to start
with numrecs set to zero and only increment it after successfully
formatting a free space extent into the btree block.
Fixes:
|
||
Darrick J. Wong | b82a5c42a5 |
xfs: don't unconditionally null args->pag in xfs_bmap_btalloc_at_eof
xfs/170 on a filesystem with su=128k,sw=4 produces this splat:
BUG: kernel NULL pointer dereference, address: 0000000000000010
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP
CPU: 1 PID: 4022907 Comm: dd Tainted: G W 6.3.0-xfsx #2 6ebeeffbe9577d32
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-bu
RIP: 0010:xfs_perag_rele+0x10/0x70 [xfs]
RSP: 0018:ffffc90001e43858 EFLAGS: 00010217
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000100
RDX: ffffffffa054e717 RSI: 0000000000000005 RDI: 0000000000000000
RBP: ffff888194eea000 R08: 0000000000000000 R09: 0000000000000037
R10: ffff888100ac1cb0 R11: 0000000000000018 R12: 0000000000000000
R13: ffffc90001e43a38 R14: ffff888194eea000 R15: ffff888194eea000
FS: 00007f93d1a0e740(0000) GS:ffff88843fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000010 CR3: 000000018a34f000 CR4: 00000000003506e0
Call Trace:
<TASK>
xfs_bmap_btalloc+0x1a7/0x5d0 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
xfs_bmapi_allocate+0xee/0x470 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
xfs_bmapi_write+0x539/0x9e0 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
xfs_iomap_write_direct+0x1bb/0x2b0 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
xfs_direct_write_iomap_begin+0x51c/0x710 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
iomap_iter+0x132/0x2f0
__iomap_dio_rw+0x2f8/0x840
iomap_dio_rw+0xe/0x30
xfs_file_dio_write_aligned+0xad/0x180 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
xfs_file_write_iter+0xfb/0x190 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
vfs_write+0x2eb/0x410
ksys_write+0x65/0xe0
do_syscall_64+0x2b/0x80
This crash occurs under the "out_low_space" label. We grabbed a perag
reference, passed it via args->pag into xfs_bmap_btalloc_at_eof, and
afterwards args->pag is NULL. Fix the second function not to clobber
args->pag if the caller had passed one in.
Fixes:
|
||
Linus Torvalds | 56c455b38d |
xfs: New code for 6.4
o Added detailed design documentation for the upcoming online repair feature o major update to online scrub to complete the reverse mapping cross-referencing infrastructure enabling us to fully validate allocated metadata against owner records. This is the last piece of scrub infrastructure needed before we can start merging online repair functionality. o Fixes for the ascii-ci hashing issues o deprecation of the ascii-ci functionality o on-disk format verification bug fixes o various random bug fixes for syzbot and other bug reports Signed-off-by: Dave Chinner <david@fromorbit.com> -----BEGIN PGP SIGNATURE----- iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmRMTZ8UHGRhdmlkQGZy b21vcmJpdC5jb20ACgkQregpR/R1+h3XtA//bZYjsYRU3hzyGLKee++5t/zbiqZB KWw8zuPEdEsSAPphK4DQYO7XPWetgFh8iBU39M8TM0+g5YacjzBLGADjQiEv7naN IxSoElQQzZbvMcUPOnuRaoopn0v7pbWIDRo3hKWaRDKCrnMGOnTvDFuC/VX0RAbn GzPimbuvaYJPXTnWTwsKeAuVYP4HLdTh2R1gUMjyY80Ed08hxhCzrXSvjEtuxOOy tDk50wJUhgx7UTgFBsXug1wXLCYwDFvAUjpsBKnmq+vSl0MpI3TdCetmSQbuvAeu gvkRyBMOcqcY5rlozcKPpyXwy7I0ftXOY4xpUSW8H9tAx0oVImkC69DsAjotQV0r r6vEtcw7LgmaS9kbA6G2Z4JfKEHuf2d/6OI4onZh25b5SWq7+qFBPo67AoFg8UQf bKSf3QQNVLTyWqpRf8Z3XOEBygYGsDUuxrm2AA5Aar4t4T3y5oAKFKkf4ZAlAYxH KViQsq0qVcoQ4k4txZgU7XQrftKyu2csqxqtKDozH7FutxscchZEwvjdQ6jnS2+L 2Qlf6On8edfEkPKzF7/1cgxUXCXuTqakFVetChXjZ1/ZFt9LUYphvESdaolJ8Aqz lwEy5UrbC2oMrBDT7qESLWs3U66mPhRaaFfuLUJRyhHN3Y0tVVA2mgNzyD6oBQVy ffIbZ3+1QEPOaOQ= =lBJy -----END PGP SIGNATURE----- Merge tag 'xfs-6.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs updates from Dave Chinner: "This consists mainly of online scrub functionality and the design documentation for the upcoming online repair functionality built on top of the scrub code: - Added detailed design documentation for the upcoming online repair feature - major update to online scrub to complete the reverse mapping cross-referencing infrastructure enabling us to fully validate allocated metadata against owner records. This is the last piece of scrub infrastructure needed before we can start merging online repair functionality. - Fixes for the ascii-ci hashing issues - deprecation of the ascii-ci functionality - on-disk format verification bug fixes - various random bug fixes for syzbot and other bug reports" * tag 'xfs-6.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (107 commits) xfs: fix livelock in delayed allocation at ENOSPC xfs: Extend table marker on deprecated mount options table xfs: fix duplicate includes xfs: fix BUG_ON in xfs_getbmap() xfs: verify buffer contents when we skip log replay xfs: _{attr,data}_map_shared should take ILOCK_EXCL until iread_extents is completely done xfs: remove WARN when dquot cache insertion fails xfs: don't consider future format versions valid xfs: deprecate the ascii-ci feature xfs: test the ascii case-insensitive hash xfs: stabilize the dirent name transformation function used for ascii-ci dir hash computation xfs: cross-reference rmap records with refcount btrees xfs: cross-reference rmap records with inode btrees xfs: cross-reference rmap records with free space btrees xfs: cross-reference rmap records with ag btrees xfs: introduce bitmap type for AG blocks xfs: convert xbitmap to interval tree xfs: drop the _safe behavior from the xbitmap foreach macro xfs: don't load local xattr values during scrub xfs: remove the for_each_xbitmap_ helpers ... |
||
Linus Torvalds | 7fa8a8ee94 |
- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page(). - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96 eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY= =J+Dj -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ... |
||
Linus Torvalds | 888d3c9f7f |
sysctl-6.4-rc1
This pull request goes with only a few sysctl moves from the kernel/sysctl.c file, the rest of the work has been put towards deprecating two API calls which incur recursion and prevent us from simplifying the registration process / saving memory per move. Most of the changes have been soaking on linux-next since v6.3-rc3. I've slowed down the kernel/sysctl.c moves due to Matthew Wilcox's feedback that we should see if we could *save* memory with these moves instead of incurring more memory. We currently incur more memory since when we move a syctl from kernel/sysclt.c out to its own file we end up having to add a new empty sysctl used to register it. To achieve saving memory we want to allow syctls to be passed without requiring the end element being empty, and just have our registration process rely on ARRAY_SIZE(). Without this, supporting both styles of sysctls would make the sysctl registration pretty brittle, hard to read and maintain as can be seen from Meng Tang's efforts to do just this [0]. Fortunately, in order to use ARRAY_SIZE() for all sysctl registrations also implies doing the work to deprecate two API calls which use recursion in order to support sysctl declarations with subdirectories. And so during this development cycle quite a bit of effort went into this deprecation effort. I've annotated the following two APIs are deprecated and in few kernel releases we should be good to remove them: * register_sysctl_table() * register_sysctl_paths() During this merge window we should be able to deprecate and unexport register_sysctl_paths(), we can probably do that towards the end of this merge window. Deprecating register_sysctl_table() will take a bit more time but this pull request goes with a few example of how to do this. As it turns out each of the conversions to move away from either of these two API calls *also* saves memory. And so long term, all these changes *will* prove to have saved a bit of memory on boot. The way I see it then is if remove a user of one deprecated call, it gives us enough savings to move one kernel/sysctl.c out from the generic arrays as we end up with about the same amount of bytes. Since deprecating register_sysctl_table() and register_sysctl_paths() does not require maintainer coordination except the final unexport you'll see quite a bit of these changes from other pull requests, I've just kept the stragglers after rc3. Most of these changes have been soaking on linux-next since around rc3. [0] https://lkml.kernel.org/r/ZAD+cpbrqlc5vmry@bombadil.infradead.org -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmRHAjQSHG1jZ3JvZkBr ZXJuZWwub3JnAAoJEM4jHQowkoinTzgQAI/uKHKi0VlUR1l2Psl0XbseUVueuyj3 ZDxSJpbVUmsoDf2MlLjzB8mYE3ricnNTDbLr7qOyA6pXdM1N0mY5LQmRVRu8/ffd 2T1hQ5pl7YnJdWP5dPhcF9Y+jnu1tjX1MW5DS4fzllwK7FnD86HuIruGq52RAPS/ /FH+BD9eodLWWXk6A/o2GFqoWxPKQI0GLxEYWa7Hg7yt8E/3PQL9QsRzn8i6U+HW BrN/+G3YD1VCCzXu0UAeXnm+i1Z7CdvqNdZuSkvE3DObiZ5WpOS+/i7FrDB7zdiu zAbHaifHnDPtcK3w2ZodbLAAwEWD/mG4iwIjE2kgIMVYxBv7TFDBRREXAWYAevIT UUuZnWDQsGaWdjywrebaUycEfd6dytKyan0fTXgMFkcoWRjejhitfdM2iZDdQROg q453p4HqOw4vTrhy4ov4zOX7J3EFiBzpZdl+SmLqcXk+jbLVb/Q9snUWz1AFtHBl gHoP5bS82uVktGG3MsObjgTzYYMQjO9YGIrVuW1VP9uWs8WaoWx6M9FQJIIhtwE+ h6wG2s7CjuFWnS0/IxWmDOn91QyUn1w7ohiz9TuvYj/5GLSBpBDGCJHsNB5T2WS1 qbQRaZ2Kg3j9TeyWfXxdlxBx7bt3ni+J/IXDY0zom2sTpGHKl8D2g5AzmEXJDTpl kd7Z3gsmwhDh =0U0W -----END PGP SIGNATURE----- Merge tag 'sysctl-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "This only does a few sysctl moves from the kernel/sysctl.c file, the rest of the work has been put towards deprecating two API calls which incur recursion and prevent us from simplifying the registration process / saving memory per move. Most of the changes have been soaking on linux-next since v6.3-rc3. I've slowed down the kernel/sysctl.c moves due to Matthew Wilcox's feedback that we should see if we could *save* memory with these moves instead of incurring more memory. We currently incur more memory since when we move a syctl from kernel/sysclt.c out to its own file we end up having to add a new empty sysctl used to register it. To achieve saving memory we want to allow syctls to be passed without requiring the end element being empty, and just have our registration process rely on ARRAY_SIZE(). Without this, supporting both styles of sysctls would make the sysctl registration pretty brittle, hard to read and maintain as can be seen from Meng Tang's efforts to do just this [0]. Fortunately, in order to use ARRAY_SIZE() for all sysctl registrations also implies doing the work to deprecate two API calls which use recursion in order to support sysctl declarations with subdirectories. And so during this development cycle quite a bit of effort went into this deprecation effort. I've annotated the following two APIs are deprecated and in few kernel releases we should be good to remove them: - register_sysctl_table() - register_sysctl_paths() During this merge window we should be able to deprecate and unexport register_sysctl_paths(), we can probably do that towards the end of this merge window. Deprecating register_sysctl_table() will take a bit more time but this pull request goes with a few example of how to do this. As it turns out each of the conversions to move away from either of these two API calls *also* saves memory. And so long term, all these changes *will* prove to have saved a bit of memory on boot. The way I see it then is if remove a user of one deprecated call, it gives us enough savings to move one kernel/sysctl.c out from the generic arrays as we end up with about the same amount of bytes. Since deprecating register_sysctl_table() and register_sysctl_paths() does not require maintainer coordination except the final unexport you'll see quite a bit of these changes from other pull requests, I've just kept the stragglers after rc3" Link: https://lkml.kernel.org/r/ZAD+cpbrqlc5vmry@bombadil.infradead.org [0] * tag 'sysctl-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (29 commits) fs: fix sysctls.c built mm: compaction: remove incorrect #ifdef checks mm: compaction: move compaction sysctl to its own file mm: memory-failure: Move memory failure sysctls to its own file arm: simplify two-level sysctl registration for ctl_isa_vars ia64: simplify one-level sysctl registration for kdump_ctl_table utsname: simplify one-level sysctl registration for uts_kern_table ntfs: simplfy one-level sysctl registration for ntfs_sysctls coda: simplify one-level sysctl registration for coda_table fs/cachefiles: simplify one-level sysctl registration for cachefiles_sysctls xfs: simplify two-level sysctl registration for xfs_table nfs: simplify two-level sysctl registration for nfs_cb_sysctls nfs: simplify two-level sysctl registration for nfs4_cb_sysctls lockd: simplify two-level sysctl registration for nlm_sysctls proc_sysctl: enhance documentation xen: simplify sysctl registration for balloon md: simplify sysctl registration hv: simplify sysctl registration scsi: simplify sysctl registration with register_sysctl() csky: simplify alignment sysctl registration ... |
||
Dave Chinner | 9419092fb2 |
xfs: fix livelock in delayed allocation at ENOSPC
On a filesystem with a non-zero stripe unit and a large sequential
write, delayed allocation will set a minimum allocation length of
the stripe unit. If allocation fails because there are no extents
long enough for an aligned minlen allocation, it is supposed to
fall back to unaligned allocation which allows single block extents
to be allocated.
When the allocator code was rewritting in the 6.3 cycle, this
fallback was broken - the old code used args->fsbno as the both the
allocation target and the allocation result, the new code passes the
target as a separate parameter. The conversion didn't handle the
aligned->unaligned fallback path correctly - it reset args->fsbno to
the target fsbno on failure which broke allocation failure detection
in the high level code and so it never fell back to unaligned
allocations.
This resulted in a loop in writeback trying to allocate an aligned
block, getting a false positive success, trying to insert the result
in the BMBT. This did nothing because the extent already was in the
BMBT (merge results in an unchanged extent) and so it returned the
prior extent to the conversion code as the current iomap.
Because the iomap returned didn't cover the offset we tried to map,
xfs_convert_blocks() then retries the allocation, which fails in the
same way and now we have a livelock.
Reported-and-tested-by: Brian Foster <bfoster@redhat.com>
Fixes:
|
||
Linus Torvalds | 5b9a7bb72f |
for-6.4/io_uring-2023-04-21
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRCvawQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpiKTEACvp0jm3Lyhxb8RMsx5T6Ko0pFH3DIymiL4 xpoZmAUflOjD0c+99FwHRQqKKXuo3OelhW+YOm0N6OOAt6JMSGmKpZh0UNJx+Fgj wMwiQ0X3Y5SaAsr5ZpXM+G1BV7ajihsMpu8/a718ERB3U3cLDz2qJfnzJh+E5Ip5 pYB4vS3+/FAER2MYQ7IPeovch2wWYtxDPztOxNX6SORu3OvpWiz1GR/+8u0tqj50 ROq97Jwjh5Tl4zP356EUSj/Vkfdr2yb+NlLbun8My5x8tYftZjnrNQ/+qeJNLwB8 tWTrg166ox/VX3aYZruAgUPv0IyGPZg7qZV5R72ChBK3VhIbOOLOCm/V6dhvl/XH vu2FG7J8WylWHmc+OU8u7TeSJdrwxTLs4e2IFUBK9ymAYFp0Q9S924fgvSYsFvVB iNn58SPRIbuA4SPtRfCd7pENtZW/QKfBC5CYK+pjsZVX40c9dbe40foVu4t2/EAo gi9+gSWEUVRRW2osxjaHXh78cW63g0j9bNfS6n1Vy32Oo5Mwm7n+bVWqCU5bCBXI MpPOk6AgME3UPwFzGzSmx+PVw8VacPxYP1NF8RFTCwj7OowFnrolJtruDmKJgXWY BN41EDo41k/C5mEu16Jr9rAkHeVhHaNZ+JhyDrzv8llJ/rv+4zEJw9SrhnpufmOX +YERd/ndAw== =Erfk -----END PGP SIGNATURE----- Merge tag 'for-6.4/io_uring-2023-04-21' of git://git.kernel.dk/linux Pull io_uring updates from Jens Axboe: - Cleanup of the io-wq per-node mapping, notably getting rid of it so we just have a single io_wq entry per ring (Breno) - Followup to the above, move accounting to io_wq as well and completely drop struct io_wqe (Gabriel) - Enable KASAN for the internal io_uring caches (Breno) - Add support for multishot timeouts. Some applications use timeouts to wake someone waiting on completion entries, and this makes it a bit easier to just have a recurring timer rather than needing to rearm it every time (David) - Support archs that have shared cache coloring between userspace and the kernel, and hence have strict address requirements for mmap'ing the ring into userspace. This should only be parisc/hppa. (Helge, me) - XFS has supported O_DIRECT writes without needing to lock the inode exclusively for a long time, and ext4 now supports it as well. This is true for the common cases of not extending the file size. Flag the fs as having that feature, and utilize that to avoid serializing those writes in io_uring (me) - Enable completion batching for uring commands (me) - Revert patch adding io_uring restriction to what can be GUP mapped or not. This does not belong in io_uring, as io_uring isn't really special in this regard. Since this is also getting in the way of cleanups and improvements to the GUP code, get rid of if (me) - A few series greatly reducing the complexity of registered resources, like buffers or files. Not only does this clean up the code a lot, the simplified code is also a LOT more efficient (Pavel) - Series optimizing how we wait for events and run task_work related to it (Pavel) - Fixes for file/buffer unregistration with DEFER_TASKRUN (Pavel) - Misc cleanups and improvements (Pavel, me) * tag 'for-6.4/io_uring-2023-04-21' of git://git.kernel.dk/linux: (71 commits) Revert "io_uring/rsrc: disallow multi-source reg buffers" io_uring: add support for multishot timeouts io_uring/rsrc: disassociate nodes and rsrc_data io_uring/rsrc: devirtualise rsrc put callbacks io_uring/rsrc: pass node to io_rsrc_put_work() io_uring/rsrc: inline io_rsrc_put_work() io_uring/rsrc: add empty flag in rsrc_node io_uring/rsrc: merge nodes and io_rsrc_put io_uring/rsrc: infer node from ctx on io_queue_rsrc_removal io_uring/rsrc: remove unused io_rsrc_node::llist io_uring/rsrc: refactor io_queue_rsrc_removal io_uring/rsrc: simplify single file node switching io_uring/rsrc: clean up __io_sqe_buffers_update() io_uring/rsrc: inline switch_start fast path io_uring/rsrc: remove rsrc_data refs io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce io_uring/rsrc: use wq for quiescing io_uring/rsrc: refactor io_rsrc_ref_quiesce io_uring/rsrc: remove io_rsrc_node::done io_uring/rsrc: use nospec'ed indexes ... |
||
Linus Torvalds | 7bcff5a396 |
v6.4/vfs.acl
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZEEhwgAKCRCRxhvAZXjc otwgAQDXHnKiPm/d76lITXbxdUNCtvZz+ig26EbOrD+vEszzIQEA81dru0QbCNCt ctoZdcsmtKbt2VaYQF1CDOhlnNg5VQM= =pER1 -----END PGP SIGNATURE----- Merge tag 'v6.4/vfs.acl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull acl updates from Christian Brauner: "After finishing the introduction of the new posix acl api last cycle the generic POSIX ACL xattr handlers are still around in the filesystems xattr handlers for two reasons: (1) Because a few filesystems rely on the ->list() method of the generic POSIX ACL xattr handlers in their ->listxattr() inode operation. (2) POSIX ACLs are only available if IOP_XATTR is raised. The IOP_XATTR flag is raised in inode_init_always() based on whether the sb->s_xattr pointer is non-NULL. IOW, the registered xattr handlers of the filesystem are used to raise IOP_XATTR. Removing the generic POSIX ACL xattr handlers from all filesystems would risk regressing filesystems that only implement POSIX ACL support and no other xattrs (nfs3 comes to mind). This contains the work to decouple POSIX ACLs from the IOP_XATTR flag as they don't depend on xattr handlers anymore. So it's now possible to remove the generic POSIX ACL xattr handlers from the sb->s_xattr list of all filesystems. This is a crucial step as the generic POSIX ACL xattr handlers aren't used for POSIX ACLs anymore and POSIX ACLs don't depend on the xattr infrastructure anymore. Adressing problem (1) will require more long-term work. It would be best to get rid of the ->list() method of xattr handlers completely at some point. For erofs, ext{2,4}, f2fs, jffs2, ocfs2, and reiserfs the nop POSIX ACL xattr handler is kept around so they can continue to use array-based xattr handler indexing. This update does simplify the ->listxattr() implementation of all these filesystems however" * tag 'v6.4/vfs.acl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: acl: don't depend on IOP_XATTR ovl: check for ->listxattr() support reiserfs: rework priv inode handling fs: rename generic posix acl handlers reiserfs: rework ->listxattr() implementation fs: simplify ->listxattr() implementation fs: drop unused posix acl handlers xattr: remove unused argument xattr: add listxattr helper xattr: simplify listxattr helpers |
||
Dave Chinner | 422d56536f |
xfs: fix duplicate includes
Header files were already included, just not in the normal order.
Remove the duplicates, preserving normal order. Also move xfs_ag.h
include to before the scrub internal includes which are normally
last in the include list.
Fixes:
|
||
Yosry Ahmed | c7b23b68e2 |
mm: vmscan: refactor updating current->reclaim_state
During reclaim, we keep track of pages reclaimed from other means than LRU-based reclaim through scan_control->reclaim_state->reclaimed_slab, which we stash a pointer to in current task_struct. However, we keep track of more than just reclaimed slab pages through this. We also use it for clean file pages dropped through pruned inodes, and xfs buffer pages freed. Rename reclaimed_slab to reclaimed, and add a helper function that wraps updating it through current, so that future changes to this logic are contained within include/linux/swap.h. Link: https://lkml.kernel.org/r/20230413104034.1086717-4-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Lameter <cl@linux.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Dave Chinner | 798352cb25 |
xfs: fix ascii-ci problems, then kill it [v2]
Last week, I was fiddling around with the metadump name obfuscation code while writing a debugger command to generate directories full of names that all have the same hash name. I had a few questions about how well all that worked with ascii-ci mode, and discovered a nasty discrepancy between the kernel and glibc's implementations of the tolower() function. I discovered that I could create a directory that is large enough to require separate leaf index blocks. The hashes stored in the dabtree use the ascii-ci specific hash function, which uses a library function to convert the name to lowercase before hashing. If the kernel and C library's versions of tolower do not behave exactly identically, xfs_ascii_ci_hashname will not produce the same results for the same inputs. xfs_repair will deem the leaf information corrupt and rebuild the directory. After that, lookups in the kernel will fail because the hash index doesn't work. The kernel's tolower function will convert extended ascii uppercase letters (e.g. A-with-umlaut) to extended ascii lowercase letters (e.g. a-with-umlaut), whereas glibc's will only do that if you force LANG to ascii. Tiny embedded libc implementations just plain won't do it at all, and the result is a mess. Stabilize the behavior of the hash function by encoding the name transformation function in libxfs, add it to the selftest, and fix all the userspace tools, none of which handle this transformation correctly. The v1 series generated a /lot/ of discussion, in which several things became very clear: (1) Linus is not enamored of case folding of any kind; (2) Dave and Christoph don't seem to agree on whether the feature is supposed to work for 7-bit ascii or latin1; (3) it trashes UTF8 encoded names if those happen to show up; and (4) I don't want to maintain this mess any longer than I have to. Kill it in 2030. v2: rename the functions to make it clear we're moving away from the letters t, o, l, o, w, e, and r; and deprecate the whole feature once we've fixed the bugs and added tests. Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdqwAKCRBKO3ySh0YR pi33AQC4MFCz0uP1aF64zRgE+wtU2YBGw5cGps7nWIljVptbkAEAubfoY88wAop8 /KHIgZ8pHIb7ooPrYKpPZL5m0udtMw8= =3Up6 -----END PGP SIGNATURE----- Merge tag 'fix-asciici-bugs-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: fix ascii-ci problems, then kill it [v2] Last week, I was fiddling around with the metadump name obfuscation code while writing a debugger command to generate directories full of names that all have the same hash name. I had a few questions about how well all that worked with ascii-ci mode, and discovered a nasty discrepancy between the kernel and glibc's implementations of the tolower() function. I discovered that I could create a directory that is large enough to require separate leaf index blocks. The hashes stored in the dabtree use the ascii-ci specific hash function, which uses a library function to convert the name to lowercase before hashing. If the kernel and C library's versions of tolower do not behave exactly identically, xfs_ascii_ci_hashname will not produce the same results for the same inputs. xfs_repair will deem the leaf information corrupt and rebuild the directory. After that, lookups in the kernel will fail because the hash index doesn't work. The kernel's tolower function will convert extended ascii uppercase letters (e.g. A-with-umlaut) to extended ascii lowercase letters (e.g. a-with-umlaut), whereas glibc's will only do that if you force LANG to ascii. Tiny embedded libc implementations just plain won't do it at all, and the result is a mess. Stabilize the behavior of the hash function by encoding the name transformation function in libxfs, add it to the selftest, and fix all the userspace tools, none of which handle this transformation correctly. The v1 series generated a /lot/ of discussion, in which several things became very clear: (1) Linus is not enamored of case folding of any kind; (2) Dave and Christoph don't seem to agree on whether the feature is supposed to work for 7-bit ascii or latin1; (3) it trashes UTF8 encoded names if those happen to show up; and (4) I don't want to maintain this mess any longer than I have to. Kill it in 2030. v2: rename the functions to make it clear we're moving away from the letters t, o, l, o, w, e, and r; and deprecate the whole feature once we've fixed the bugs and added tests. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com> |
||
Dave Chinner | b89116c2fb |
xfs: strengthen rmapbt scrubbing [v24.5]
This series strengthens space allocation record cross referencing by using AG block bitmaps to compute the difference between space used according to the rmap records and the primary metadata, and reports cross-referencing errors for any discrepancies. Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdqgAKCRBKO3ySh0YR pgQ8AP4uxtGQY7RXaJ6aKXTwqarZO9vwZyQiL/nFo1onGCHv4wD/cIwn7hvJmbnF 7avZMCsYJlVFNAJUFZuACmz0v4PwWQ8= =KqDr -----END PGP SIGNATURE----- Merge tag 'scrub-strengthen-rmap-checking-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: strengthen rmapbt scrubbing [v24.5] This series strengthens space allocation record cross referencing by using AG block bitmaps to compute the difference between space used according to the rmap records and the primary metadata, and reports cross-referencing errors for any discrepancies. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com> |
||
Dave Chinner | 43223ef72e |
xfs: rework online fsck incore bitmap [v24.5]
In this series, we make some changes to the incore bitmap code: First, we shorten the prefix to 'xbitmap'. Then, we rework some utility functions for later use by online repair and clarify how the walk functions are supposed to be used. Finally, we use all these new pieces to convert the incore bitmap to use an interval tree instead of linked lists. This lifts the limitation that callers had to be careful not to set a range that was already set; and gets us ready for the btree rebuilder functions needing to be able to set bits in a bitmap and generate maximal contiguous extents for the set ranges. Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdqQAKCRBKO3ySh0YR pmoIAQDszM48pGVRa7zh942ACCa+CbPABFyKDaVePwhccnX2/QD/a9gi4lmEoPVw 915wIjWbk5rzAjZPp2k3IwYkEf4v2wI= =QCWU -----END PGP SIGNATURE----- Merge tag 'repair-bitmap-rework-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: rework online fsck incore bitmap [v24.5] In this series, we make some changes to the incore bitmap code: First, we shorten the prefix to 'xbitmap'. Then, we rework some utility functions for later use by online repair and clarify how the walk functions are supposed to be used. Finally, we use all these new pieces to convert the incore bitmap to use an interval tree instead of linked lists. This lifts the limitation that callers had to be careful not to set a range that was already set; and gets us ready for the btree rebuilder functions needing to be able to set bits in a bitmap and generate maximal contiguous extents for the set ranges. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com> |