Commit graph

5422 commits

Author SHA1 Message Date
Linus Torvalds 7d3bf613e9 libnvdimm for 4.18
* DAX broke a fundamental assumption of truncate of file mapped pages.
   The truncate path assumed that it is safe to disconnect a pinned page
   from a file and let the filesystem reclaim the physical block. With DAX
   the page is equivalent to the filesystem block. Introduce
   dax_layout_busy_page() to enable filesystems to wait for pinned DAX
   pages to be released. Without this wait a filesystem could allocate
   blocks under active device-DMA to a new file.
 
 * DAX arranges for the block layer to be bypassed and uses
   dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
   However, the memcpy_mcsafe() facility is available through the pmem
   block driver. In order to safely handle media errors, via the DAX
   block-layer bypass, introduce copy_to_iter_mcsafe().
 
 * Fix cache management policy relative to the ACPI NFIT Platform
   Capabilities Structure to properly elide cache flushes when they are not
   necessary. The table indicates whether CPU caches are power-fail
   protected. Clarify that a deep flush is always performed on
   REQ_{FUA,PREFLUSH} requests.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbGxI7AAoJEB7SkWpmfYgCDjsP/2Lcibu9Kf4tKIzuInsle6iE
 6qP29qlkpHVTpDKbhvIxTYTYL9sMU0DNUrpPCJR/EYdeyztLWDFC5EAT1wF240vf
 maV37s/uP331jSC/2VJnKWzBs2ztQxmKLEIQCxh6aT0qs9cbaOvJgB/WlVu+qtsl
 aGJFLmb6vdQacp31noU5plKrMgMA1pADyF5qx9I9K2HwowHE7T368ZEFS/3S//c3
 LXmpx/Nfq52sGu/qbRbu6B1CTJhIGhmarObyQnvBYoKntK1Ov4e8DS95wD3EhNDe
 FuRkOCUKhjl6cFy7QVWh1ct1bFm84ny+b4/AtbpOmv9l/+0mveJ7e+5mu8HQTifT
 wYiEe2xzXJ+OG/xntv8SvlZKMpjP3BqI0jYsTutsjT4oHrciiXdXM186cyS+BiGp
 KtFmWyncQJgfiTq6+Hj5XpP9BapNS+OYdYgUagw9ZwzdzptuGFYUMSVOBrYrn6c/
 fwqtxjubykJoW0P3pkIoT91arFSea7nxOKnGwft06imQ7TwR4ARsI308feQ9itJq
 2P2e7/20nYMsw2aRaUDDA70Yu+Lagn1m8WL87IybUGeUDLb1BAkjphAlWa6COJ+u
 PhvAD2tvyM9m0c7O5Mytvz7iWKG6SVgatoAyOPkaeplQK8khZ+wEpuK58sO6C1w8
 4GBvt9ri9i/Ww/A+ppWs
 =4bfw
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "This adds a user for the new 'bytes-remaining' updates to
  memcpy_mcsafe() that you already received through Ingo via the
  x86-dax- for-linus pull.

  Not included here, but still targeting this cycle, is support for
  handling memory media errors (poison) consumed via userspace dax
  mappings.

  Summary:

   - DAX broke a fundamental assumption of truncate of file mapped
     pages. The truncate path assumed that it is safe to disconnect a
     pinned page from a file and let the filesystem reclaim the physical
     block. With DAX the page is equivalent to the filesystem block.
     Introduce dax_layout_busy_page() to enable filesystems to wait for
     pinned DAX pages to be released. Without this wait a filesystem
     could allocate blocks under active device-DMA to a new file.

   - DAX arranges for the block layer to be bypassed and uses
     dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
     However, the memcpy_mcsafe() facility is available through the pmem
     block driver. In order to safely handle media errors, via the DAX
     block-layer bypass, introduce copy_to_iter_mcsafe().

   - Fix cache management policy relative to the ACPI NFIT Platform
     Capabilities Structure to properly elide cache flushes when they
     are not necessary. The table indicates whether CPU caches are
     power-fail protected. Clarify that a deep flush is always performed
     on REQ_{FUA,PREFLUSH} requests"

* tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
  dax: Use dax_write_cache* helpers
  libnvdimm, pmem: Do not flush power-fail protected CPU caches
  libnvdimm, pmem: Unconditionally deep flush on *sync
  libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH
  acpi, nfit: Remove ecc_unit_size
  dax: dax_insert_mapping_entry always succeeds
  libnvdimm, e820: Register all pmem resources
  libnvdimm: Debug probe times
  linvdimm, pmem: Preserve read-only setting for pmem devices
  x86, nfit_test: Add unit test for memcpy_mcsafe()
  pmem: Switch to copy_to_iter_mcsafe()
  dax: Report bytes remaining in dax_iomap_actor()
  dax: Introduce a ->copy_to_iter dax operation
  uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation
  xfs, dax: introduce xfs_break_dax_layouts()
  xfs: prepare xfs_break_layouts() for another layout type
  xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
  mm, fs, dax: handle layout changes to pinned dax mappings
  mm: fix __gup_device_huge vs unmap
  mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
  ...
2018-06-08 17:21:52 -07:00
Dan Williams 930218affe Merge branch 'for-4.18/mcsafe' into libnvdimm-for-next 2018-06-08 15:16:44 -07:00
Linus Torvalds a3818841bd for-linus-20180608
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlsa4sQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqPNEADbZby01Q6i+dTZYIosz5+/gq8gSkCcpQ/T
 krK/f2MlwD7Rdog1BnGNNP5XOqK8pKIGdARL1FQKpViii6xGIoOc2F4VK+vO44yR
 LI+BeeOM6rWNOAoBO4CqeZz/Fv5IYi7KURWogYhZMqrxBqT2OeD9MMowm5NulBix
 YZ2ttFWiTJScJttJCDPE6cu9EjHDeK63Nr7+UU80k3atU4eUpUp1mRFGmtaYWulq
 l3KaENCwm00WCVqM4i/gVWr2AkgTZqAAyeCx7IrPsrQrCMEhxEpMnU52e2kXSxhM
 Qx6FLNEOjzARuBDurtfJE74usQcW2xDLzT8fh2UStnPpt6S/JX6f9GMBVk0G7I8B
 8COF4DF+bzdbhhz2SiZaTFOmDML5H1iQ8t6lTTms0Bnq29mE3E4QFom8lO+2BxN3
 g6PFhvYaOkhTVtV5BPXpXs9xZBLHrv5G/JopXsZh0RF1kpiova+nfA1K2uJPFpJ0
 NcHuMZKmIG3uBqY3fj5Ul+zuVhZ/1v8B69zWoSWafLrk+VRdcEAniuY2E6SsQFP5
 gV4GNja85S53DnlIVwEUXPYMiY6opiwP53yMNMvkB/FdzaQB5Ehdif2fhZu64QmE
 TtqbHtAuV0VZ3z4GrJ3XNbV6Np4wMOhYls4lTkZsnqNNO2sw/eoTYcmwxLDEYOQw
 uQ9rhZh4IQ==
 =N3BP
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180608' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A few fixes for this merge window, where some of them should go in
  sooner rather than later, hence a new pull this week. This pull
  request contains:

   - Set of NVMe fixes, mostly follow up cleanups/fixes to the queue
     changes, but also teardown/removal and misc changes (Christop/Dan/
     Johannes/Sagi/Steve).

   - Two lightnvm fixes for issues that showed up in this window
     (Colin/Wei).

   - Failfast/driver flags inheritance for flush requests (Hannes).

   - The md device put sanitization and fix (Kent).

   - dm bio_set inheritance fix (me).

   - nbd discard granularity fix (Josef).

   - nbd consistency in command printing (Kevin).

   - Loop recursion validation fix (Ted).

   - Partition overlap check (Wang)"

[ .. and now my build is warning-free again thanks to the md fix  - Linus ]

* tag 'for-linus-20180608' of git://git.kernel.dk/linux-block: (22 commits)
  nvme: cleanup double shift issue
  nvme-pci: make CMB SQ mod-param read-only
  nvme-pci: unquiesce dead controller queues
  nvme-pci: remove HMB teardown on reset
  nvme-pci: queue creation fixes
  nvme-pci: remove unnecessary completion doorbell check
  nvme-pci: remove unnecessary nested locking
  nvmet: filter newlines from user input
  nvme-rdma: correctly check for target keyed sgl support
  nvme: don't hold nvmf_transports_rwsem for more than transport lookups
  nvmet: return all zeroed buffer when we can't find an active namespace
  md: Unify mddev destruction paths
  dm: use bioset_init_from_src() to copy bio_set
  block: add bioset_init_from_src() helper
  block: always set partition number to '0' in blk_partition_remap()
  block: pass failfast and driver-specific flags to flush requests
  nbd: set discard_alignment to the granularity
  nbd: Consistently use request pointer in debug messages.
  block: add verifier for cmdline partition
  lightnvm: pblk: fix resource leak of invalid_bitmap
  ...
2018-06-08 13:36:19 -07:00
Mikulas Patocka 48debafe4f dm: add writecache target
The writecache target caches writes on persistent memory or SSD.
It is intended for databases or other programs that need extremely low
commit latency.

The writecache target doesn't cache reads because reads are supposed to
be cached in page cache in normal RAM.

If persistent memory isn't available this target can still be used in
SSD mode.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com> # fix missing goto
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> # fix compilation issue with !DAX
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> # use msecs_to_jiffies
Acked-by: Dan Williams <dan.j.williams@intel.com> # reworks to unify ARM and x86 flushing
Signed-off-by: Mike Snitzer <msnitzer@redhat.com>
2018-06-08 11:59:51 -04:00
Mike Snitzer 72d711c876 dm: adjust structure members to improve alignment
Eliminate most holes in DM data structures that were modified by
commit 6f1c819c21 ("dm: convert to bioset_init()/mempool_init()").
Also prevent structure members from unnecessarily spanning cache
lines.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-06-08 11:53:14 -04:00
Kent Overstreet 28dec870aa md: Unify mddev destruction paths
Previously, mddev_put() had a couple different paths for freeing a
mddev, due to the fact that the kobject wasn't initialized when the
mddev was first allocated. If we move the kobject_init() to when it's
first allocated and just use kobject_add() later, we can clean all this
up.

This also removes a hack in mddev_put() to avoid freeing biosets under a
spinlock, which involved copying biosets on the stack after the reset
bioset_init() changes.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-08 08:41:17 -06:00
Mike Snitzer b2b04e7e2d dm: report which conflicting type caused error during table_load()
Eases troubleshooting to know the before vs after types.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-06-08 09:50:15 -04:00
Jens Axboe 2a2a4c510b dm: use bioset_init_from_src() to copy bio_set
We can't just copy and clear a bio_set, use the bio helper to
setup a new bio_set with the settings from another one.

Fixes: 6f1c819c21 ("dm: convert to bioset_init()/mempool_init()")
Reported-by: Venkat R.B <vrbagal1@linux.vnet.ibm.com>
Tested-by: Venkat R.B <vrbagal1@linux.vnet.ibm.com>
Tested-by: Li Wang <liwang@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-08 07:06:29 -06:00
Linus Torvalds 2857676045 - Introduce arithmetic overflow test helper functions (Rasmus)
- Use overflow helpers in 2-factor allocators (Kees, Rasmus)
 - Introduce overflow test module (Rasmus, Kees)
 - Introduce saturating size helper functions (Matthew, Kees)
 - Treewide use of struct_size() for allocators (Kees)
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlsYJ1gWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJlCTEACwdEeriAd2VwxknnsstojGD/3g
 8TTFA19vSu4Gxa6WiDkjGoSmIlfhXTlZo1Nlmencv16ytSvIVDNLUIB3uDxUIv1J
 2+dyHML9JpXYHHR7zLXXnGFJL0wazqjbsD3NYQgXqmun7EVVYnOsAlBZ7h/Lwiej
 jzEJd8DaHT3TA586uD3uggiFvQU0yVyvkDCDONIytmQx+BdtGdg9TYCzkBJaXuDZ
 YIthyKDvxIw5nh/UaG3L+SKo73tUr371uAWgAfqoaGQQCWe+mxnWL4HkCKsjFzZL
 u9ouxxF/n6pij3E8n6rb0i2fCzlsTDdDF+aqV1rQ4I4hVXCFPpHUZgjDPvBWbj7A
 m6AfRHVNnOgI8HGKqBGOfViV+2kCHlYeQh3pPW33dWzy/4d/uq9NIHKxE63LH+S4
 bY3oO2ela8oxRyvEgXLjqmRYGW1LB/ZU7FS6Rkx2gRzo4k8Rv+8K/KzUHfFVRX61
 jEbiPLzko0xL9D53kcEn0c+BhofK5jgeSWxItdmfuKjLTW4jWhLRlU+bcUXb6kSS
 S3G6aF+L+foSUwoq63AS8QxCuabuhreJSB+BmcGUyjthCbK/0WjXYC6W/IJiRfBa
 3ZTxBC/2vP3uq/AGRNh5YZoxHL8mSxDfn62F+2cqlJTTKR/O+KyDb1cusyvk3H04
 KCDVLYPxwQQqK1Mqig==
 =/3L8
 -----END PGP SIGNATURE-----

Merge tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull overflow updates from Kees Cook:
 "This adds the new overflow checking helpers and adds them to the
  2-factor argument allocators. And this adds the saturating size
  helpers and does a treewide replacement for the struct_size() usage.
  Additionally this adds the overflow testing modules to make sure
  everything works.

  I'm still working on the treewide replacements for allocators with
  "simple" multiplied arguments:

     *alloc(a * b, ...) -> *alloc_array(a, b, ...)

  and

     *zalloc(a * b, ...) -> *calloc(a, b, ...)

  as well as the more complex cases, but that's separable from this
  portion of the series. I expect to have the rest sent before -rc1
  closes; there are a lot of messy cases to clean up.

  Summary:

   - Introduce arithmetic overflow test helper functions (Rasmus)

   - Use overflow helpers in 2-factor allocators (Kees, Rasmus)

   - Introduce overflow test module (Rasmus, Kees)

   - Introduce saturating size helper functions (Matthew, Kees)

   - Treewide use of struct_size() for allocators (Kees)"

* tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  treewide: Use struct_size() for devm_kmalloc() and friends
  treewide: Use struct_size() for vmalloc()-family
  treewide: Use struct_size() for kmalloc()-family
  device: Use overflow helpers for devm_kmalloc()
  mm: Use overflow helpers in kvmalloc()
  mm: Use overflow helpers in kmalloc_array*()
  test_overflow: Add memory allocation overflow tests
  overflow.h: Add allocation size calculation helpers
  test_overflow: Report test failures
  test_overflow: macrofy some more, do more tests for free
  lib: add runtime test of check_*_overflow functions
  compiler.h: enable builtin overflow checkers and add fallback code
2018-06-06 17:27:14 -07:00
Kees Cook acafe7e302 treewide: Use struct_size() for kmalloc()-family
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    void *entry[];
};

instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);

This patch makes the changes for kmalloc()-family (and kvmalloc()-family)
uses. It was done via automatic conversion with manual review for the
"CHECKME" non-standard cases noted below, using the following Coccinelle
script:

// pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
//                      sizeof *pkey_cache->table, GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@

- alloc(sizeof(*VAR) + COUNT * sizeof(*VAR->ELEMENT), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)

// mr = kzalloc(sizeof(*mr) + m * sizeof(mr->map[0]), GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@

- alloc(sizeof(*VAR) + COUNT * sizeof(VAR->ELEMENT[0]), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)

// Same pattern, but can't trivially locate the trailing element name,
// or variable name.
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
expression SOMETHING, COUNT, ELEMENT;
@@

- alloc(sizeof(SOMETHING) + COUNT * sizeof(ELEMENT), GFP)
+ alloc(CHECKME_struct_size(&SOMETHING, ELEMENT, COUNT), GFP)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-06 11:15:43 -07:00
Kees Cook 610b15c50e overflow.h: Add allocation size calculation helpers
In preparation for replacing unchecked overflows for memory allocations,
this creates helpers for the 3 most common calculations:

array_size(a, b): 2-dimensional array
array3_size(a, b, c): 3-dimensional array
struct_size(ptr, member, n): struct followed by n-many trailing members

Each of these return SIZE_MAX on overflow instead of wrapping around.

(Additionally renames a variable named "array_size" to avoid future
collision.)

Co-developed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-05 12:16:51 -07:00
Linus Torvalds 25d80be86c Refactors rslib and callers to provide a per-instance allocation area
instead of performing VLAs on the stack.
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlsUv+MWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsL4D/0arl2+HZtG3GV6zMHbH1T3J0a3
 mZojEEZqgByx0vlN/Ai57YK4ADHKlF3oYqpXamc718iAmDfKgvbdG+IPv6hg0MQR
 c6F+ogSbcZ38vY7t08Iaa9cBQcKFJwJGAbYXyKV6dKp0ddPqNBBxHG0CLSWL+3DE
 ZxKRKQFQKZWJR587CjA9geZOImgYJG8Jq9Ue4fp1fKacBEBiZTtI/Y8C4XlrxZz6
 6p7BdarMXKqxW8B8vpE4xfCCkm8QGAfEuAXzMrqFLZhj48bg7+Tqd3d4AHm28hMp
 Q2eHyS9j4wWN+OxA1R+VIuzciz/vfWZJi1N7zSV8LznPPLICX8Vl4gXw+MkrELYV
 vzZRlRH67E993bsj0Xp65w7Zwc3MJKVfJOUq+3qR3I1JytCSOBsLs2w/VE2bzpeO
 jFD4gUJxYnl5s2UnCiHVgUzbgXE/n+JQS3Acf71qfAsb+Ld2/q9Tyjx7p7sm5IfB
 m/lokrSgBajtHPZfB6pQV2AvaoLIbWW9Lf1p84zA6XAwdAnAESffUGQHEK/W02a7
 ftwxXnOvjgRZJ/lUfOb0GQDgQniXO7cN7AcmKAfL+DNZb0j86esR6PuZDCIeDhTJ
 KcTVcwPtFnI4W+H3vagh+mLHf7bL/RgLngjUZd6PHqS+cVQfKFiNe43zlRl9+eRj
 U20+w5VfwSkrmAgfSw==
 =Z1if
 -----END PGP SIGNATURE-----

Merge tag 'rslib-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull reed-salomon library updates from Kees Cook:
 "Refactors rslib and callers to provide a per-instance allocation area
  instead of performing VLAs on the stack"

* tag 'rslib-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  rslib: Allocate decoder buffers to avoid VLAs
  mtd: rawnand: diskonchip: Allocate rs control per instance
  rslib: Split rs control struct
  rslib: Simplify error path
  rslib: Remove GPL boilerplate
  rslib: Add SPDX identifiers
  rslib: Cleanup top level comments
  rslib: Cleanup whitespace damage
  dm/verity_fec: Use GFP aware reed solomon init
  rslib: Add GFP aware init function
2018-06-05 10:48:05 -07:00
Kent Overstreet d377535405 dm: Use kzalloc for all structs with embedded biosets/mempools
mempool_init()/bioset_init() require that the mempools/biosets be zeroed
first; they probably should not _require_ this, but not allocating those
structs with kzalloc is a fairly nonsensical thing to do (calling
mempool_exit()/bioset_exit() on an uninitialized mempool/bioset is legal
and safe, but only works if said memory was zeroed.)

Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-05 08:47:43 -06:00
Linus Torvalds f459c34538 for-4.18/block-20180603
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJbFIrHAAoJEPfTWPspceCm2+kQAKo7o7HL30aRxJYu+gYafkuW
 PV47zr3e4vhMDEzDaMsh1+V7I7bm3uS+NZu6cFbcV+N9KXFpeb4V4Hvvm5cs+OC3
 WCOBi4eC1h4qnDQ3ZyySrCMN+KHYJ16pZqddEjqw+fhVudx8i+F+jz3Y4ZMDDc3q
 pArKZvjKh2wEuYXUMFTjaXY46IgPt+er94OwvrhyHk+4AcA+Q/oqSfSdDahUC8jb
 BVR3FV4I3NOHUaru0RbrUko13sVZSboWPCIFrlTDz8xXcJOnVHzdVS1WLFDXLHnB
 O8q9cADCfa4K08kz68RxykcJiNxNvz5ChDaG0KloCFO+q1tzYRoXLsfaxyuUDg57
 Zd93OFZC6hAzXdhclDFIuPET9OQIjDzwphodfKKmDsm3wtyOtydpA0o7JUEongp0
 O1gQsEfYOXmQsXlo8Ot+Z7Ne/HvtGZ91JahUa/59edxQbcKaMrktoyQsQ/d1nOEL
 4kXID18wPcFHWRQHYXyVuw6kbpRtQnh/U2m1eenSZ7tVQHwoe6mF3cfSf5MMseak
 k8nAnmsfEvOL4Ar9ftg61GOrImaQlidxOC2A8fmY5r0Sq/ZldvIFIZizsdTTCcni
 8SOTxcQowyqPf5NvMNQ8cKqqCJap3ppj4m7anZNhbypDIF2TmOWsEcXcMDn4y9on
 fax14DPLo59gBRiPCn5f
 =nga/
 -----END PGP SIGNATURE-----

Merge tag 'for-4.18/block-20180603' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - clean up how we pass around gfp_t and
   blk_mq_req_flags_t (Christoph)

 - prepare us to defer scheduler attach (Christoph)

 - clean up drivers handling of bounce buffers (Christoph)

 - fix timeout handling corner cases (Christoph/Bart/Keith)

 - bcache fixes (Coly)

 - prep work for bcachefs and some block layer optimizations (Kent).

 - convert users of bio_sets to using embedded structs (Kent).

 - fixes for the BFQ io scheduler (Paolo/Davide/Filippo)

 - lightnvm fixes and improvements (Matias, with contributions from Hans
   and Javier)

 - adding discard throttling to blk-wbt (me)

 - sbitmap blk-mq-tag handling (me/Omar/Ming).

 - remove the sparc jsflash block driver, acked by DaveM.

 - Kyber scheduler improvement from Jianchao, making it more friendly
   wrt merging.

 - conversion of symbolic proc permissions to octal, from Joe Perches.
   Previously the block parts were a mix of both.

 - nbd fixes (Josef and Kevin Vigor)

 - unify how we handle the various kinds of timestamps that the block
   core and utility code uses (Omar)

 - three NVMe pull requests from Keith and Christoph, bringing AEN to
   feature completeness, file backed namespaces, cq/sq lock split, and
   various fixes

 - various little fixes and improvements all over the map

* tag 'for-4.18/block-20180603' of git://git.kernel.dk/linux-block: (196 commits)
  blk-mq: update nr_requests when switching to 'none' scheduler
  block: don't use blocking queue entered for recursive bio submits
  dm-crypt: fix warning in shutdown path
  lightnvm: pblk: take bitmap alloc. out of critical section
  lightnvm: pblk: kick writer on new flush points
  lightnvm: pblk: only try to recover lines with written smeta
  lightnvm: pblk: remove unnecessary bio_get/put
  lightnvm: pblk: add possibility to set write buffer size manually
  lightnvm: fix partial read error path
  lightnvm: proper error handling for pblk_bio_add_pages
  lightnvm: pblk: fix smeta write error path
  lightnvm: pblk: garbage collect lines with failed writes
  lightnvm: pblk: rework write error recovery path
  lightnvm: pblk: remove dead function
  lightnvm: pass flag on graceful teardown to targets
  lightnvm: pblk: check for chunk size before allocating it
  lightnvm: pblk: remove unnecessary argument
  lightnvm: pblk: remove unnecessary indirection
  lightnvm: pblk: return NVM_ error on failed submission
  lightnvm: pblk: warn in case of corrupted write buffer
  ...
2018-06-04 07:58:06 -07:00
Kent Overstreet d00a11df69 dm-crypt: fix warning in shutdown path
The counter for the number of allocated pages includes pages in the
mempool's reserve, so checking that the number of allocated pages is 0
needs to happen after we exit the mempool.

Fixes: 6f1c819c21 ("dm: convert to bioset_init()/mempool_init()")
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Reported-by: Krzysztof Kozlowski <krzk@kernel.org>
Acked-by: Mike Snitzer <snitzer@redhat.com>

Fixed to always just use percpu_counter_sum()

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-02 20:35:00 -06:00
Kent Overstreet 6f1c819c21 dm: convert to bioset_init()/mempool_init()
Convert dm to embedded bio sets.

Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 15:33:32 -06:00
Kent Overstreet afeee514ce md: convert to bioset_init()/mempool_init()
Convert md to embedded bio sets.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 15:33:32 -06:00
Kent Overstreet d19936a266 bcache: convert to bioset_init()/mempool_init()
Convert bcache to embedded bio sets.

Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 15:33:32 -06:00
Kent Overstreet 338aa96d56 block: convert bounce, q->bio_split to bioset_init()/mempool_init()
Convert the core block functionality to embedded bio sets.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 15:33:32 -06:00
Andy Shevchenko ce4c3e19e5 bcache: Replace bch_read_string_list() by __sysfs_match_string()
Kernel library has a common function to match user input from sysfs
against an array of strings. Thus, replace bch_read_string_list() by
__sysfs_match_string().

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-28 14:53:22 -06:00
Andy Shevchenko ecb37ce9ba bcache: Move couple of functions to sysfs.c
There is couple of functions that are used exclusively in sysfs.c.
Move it to there and make them static.

Besides above, it will allow further clean up.

No functional change intended.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-28 14:53:20 -06:00
Andy Shevchenko 04cbc21137 bcache: Move couple of string arrays to sysfs.c
There is couple of string arrays that are used exclusively in sysfs.c.
Move it to there and make them static.

Besides above, it will allow further clean up.

No functional change intended.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-28 14:53:18 -06:00
Coly Li 0f0709e6bf bcache: stop bcache device when backing device is offline
Currently bcache does not handle backing device failure, if backing
device is offline and disconnected from system, its bcache device can still
be accessible. If the bcache device is in writeback mode, I/O requests even
can success if the requests hit on cache device. That is to say, when and
how bcache handles offline backing device is undefined.

This patch tries to handle backing device offline in a rather simple way,
- Add cached_dev->status_update_thread kernel thread to update backing
  device status in every 1 second.
- Add cached_dev->offline_seconds to record how many seconds the backing
  device is observed to be offline. If the backing device is offline for
  BACKING_DEV_OFFLINE_TIMEOUT (30) seconds, set dc->io_disable to 1 and
  call bcache_device_stop() to stop the bache device which linked to the
  offline backing device.

Now if a backing device is offline for BACKING_DEV_OFFLINE_TIMEOUT seconds,
its bcache device will be removed, then user space application writing on
it will get error immediately, and handler the device failure in time.

This patch is quite simple, does not handle more complicated situations.
Once the bcache device is stopped, users need to recovery the backing
device, register and attach it manually.

Changelog:
v3: call wait_for_kthread_stop() before exits kernel thread.
v2: remove "bcache: " prefix when calling pr_warn().
v1: initial version.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-28 14:53:16 -06:00
Dan Williams b3a9a0c36e dax: Introduce a ->copy_to_iter dax operation
Similar to the ->copy_from_iter() operation, a platform may want to
deploy an architecture or device specific routine for handling reads
from a dax_device like /dev/pmemX. On x86 this routine will point to a
machine check safe version of copy_to_iter(). For now, add the plumbing
to device-mapper and the dax core.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-05-22 23:18:31 -07:00
Xiao Ni 5a409b4f56 MD: fix lock contention for flush bios
There is a lock contention when there are many processes which send flush bios
to md device. eg. Create many lvs on one raid device and mkfs.xfs on each lv.

Now it just can handle flush request sequentially. It needs to wait mddev->flush_bio
to be NULL, otherwise get mddev->lock.

This patch remove mddev->flush_bio and handle flush bio asynchronously.
I did a test with command dbench -s 128 -t 300. This is the test result:

=================Without the patch============================
 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                    11165   167.595  5879.560
 Close                   107469     1.391  2231.094
 LockX                      384     0.003     0.019
 Rename                    5944     2.141  1856.001
 ReadX                   208121     0.003     0.074
 WriteX                   98259  1925.402 15204.895
 Unlink                   25198    13.264  3457.268
 UnlockX                    384     0.001     0.009
 FIND_FIRST               47111     0.012     0.076
 SET_FILE_INFORMATION     12966     0.007     0.065
 QUERY_FILE_INFORMATION   27921     0.004     0.085
 QUERY_PATH_INFORMATION  124650     0.005     5.766
 QUERY_FS_INFORMATION     22519     0.003     0.053
 NTCreateX               141086     4.291  2502.812

Throughput 3.7181 MB/sec (sync open)  128 clients  128 procs  max_latency=15204.905 ms

=================With the patch============================
 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                     4500   174.134   406.398
 Close                    48195     0.060   467.062
 LockX                      256     0.003     0.029
 Rename                    2324     0.026     0.360
 ReadX                    78846     0.004     0.504
 WriteX                   66832   562.775  1467.037
 Unlink                    5516     3.665  1141.740
 UnlockX                    256     0.002     0.019
 FIND_FIRST               16428     0.015     0.313
 SET_FILE_INFORMATION      6400     0.009     0.520
 QUERY_FILE_INFORMATION   17865     0.003     0.089
 QUERY_PATH_INFORMATION   47060     0.078   416.299
 QUERY_FS_INFORMATION      7024     0.004     0.032
 NTCreateX                55921     0.854  1141.452

Throughput 11.744 MB/sec (sync open)  128 clients  128 procs  max_latency=1467.041 ms

The test is done on raid1 disk with two rotational disks

V5: V4 is more complicated than the version with memory pool. So revert to the memory pool
version

V4: use address of fbio to do hash to choose free flush info.
V3:
Shaohua suggests mempool is overkill. In v3 it allocs memory during creating raid device
and uses a simple bitmap to record which resource is free.

Fix a bug from v2. It should set flush_pending to 1 at first.

V2:
Neil pointed out two problems. One is counting error problem and another is return value
when allocat memory fails.
1. counting error problem
This isn't safe.  It is only safe to call rdev_dec_pending() on rdevs
that you previously called
                          atomic_inc(&rdev->nr_pending);
If an rdev was added to the list between the start and end of the flush,
this will do something bad.

Now it doesn't use bio_chain. It uses specified call back function for each
flush bio.
2. Returned on IO error when kmalloc fails is wrong.
I use mempool suggested by Neil in V2
3. Fixed some places pointed by Guoqing

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-05-21 09:30:26 -07:00
Linus Torvalds 61c2ad9a2e for-linus-20180518
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJa/uljAAoJEPfTWPspceCmanYP/3ini0vEOawCIXdyzlnF1nrU
 j8ggoo1C2mcTtex29ZuNxCYtTifelptomgCzBfdQfJOZw7OJ7R6Qn1eqTBbbAV7h
 GKtYfMvg7MptfSK3yFfRCFm0a+xWiK1l4ywRwkKo0et3LqM0d+CjGfNfV5+iwvlb
 2ezn/NfF97FCKDISSaagFmSvppbj//QutdD/A1ZiDrct/ZTWWbkIYsrMSGeRbQuq
 6p0JdHAZ7OdVzcc9wUnrfNE63TbcRQc1/GBnEyymI6bkXnNUh4NMnulFyzRvzK2r
 Yv6FFxdsZOQgUPE7nEx+gfZIelgEFQ1pec8txc4vs5PJgR7Knv4EfU3jZjTH3GYx
 YZTKVU9q4WGxv1CnyoJMu/bhJm2s07Vv7g+4vB80xF+XlF1J8rLx2fxlGKFkQUbH
 0VTIc42eqWIip9Owb8RNwe6DqX5dIMewr1r0VXgG+y4ZnBnuOxBpo07NcMjcmHTv
 flgzSPRn2tzbVuSNqR08PzaZUhbDlb10LV9csePxvVokPNTrJ4jfIcA9Fl0w0dM3
 UgUFndDOXoBE2zuZyWmzBAHv/rFQPq88kBNHo5GBMhPOWG0C60jZzQDvFwmbUKTF
 6jQJ9lYDaVLzJ+5qE4AD0x/cLRGGAFqjiQYwlgQi6RSU68hUagA7NFBtAxz7Fo0V
 rBwfbZEX+/fcGWrbAo1N
 =1mzk
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180518' of git://git.kernel.dk/linux-block

Pull block fix from Jens Axboe:
 "Single fix this time, from Coly, fixing a failure case when
  CONFIG_DEBUGFS isn't enabled"

* tag 'for-linus-20180518' of git://git.kernel.dk/linux-block:
  bcache: return 0 from bch_debug_init() if CONFIG_DEBUG_FS=n
2018-05-18 10:10:43 -07:00
Amy Chiang 448ec638c6 md/raid5: Assigning NULL to sh->batch_head before testing bit R5_Overlap of a stripe
In add_stripe_bio(), if the stripe_head is in batch list, the incoming
bio is regarded as overlapping, and the bit R5_Overlap on this stripe_head
is set. break_stripe_batch_list() checks bit R5_Overlap on each stripe_head
first then assigns NULL to sh->batch_head.

If break_stripe_batch_list() checks bit R5_Overlap on stripe_head A
after add_stripe_bio() finds stripe_head A is in batch list and before
add_stripe_bio() sets bit R5_Overlapt of stripe_head A,
break_stripe_batch_list() would not know there's a process in
wait_for_overlap and needs to call wake_up(). There's a huge chance a
process never returns from schedule() if add_stripe_bio() is called
from raid5_make_request().

In break_stripe_batch_list(), assigning NULL to sh->batch_head should
be done before it checks bit R5_Overlap of a stripe_head.

Signed-off-by: Amy Chiang <amychiang@qnap.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-05-17 09:56:00 -07:00
Gioh Kim b33d10624f md/raid1: add error handling of read error from FailFast device
Current handle_read_error() function calls fix_read_error()
only if md device is RW and rdev does not include FailFast flag.
It does not handle a read error from a RW device including
FailFast flag.

I am not sure it is intended. But I found that write IO error
sets rdev faulty. The md module should handle the read IO error and
write IO error equally. So I think read IO error should set rdev faulty.

Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Reviewed-by: Jack Wang <jinpu.wang@profitbricks.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-05-17 09:55:59 -07:00
Yufen Yu c42a0e2675 md: fix NULL dereference of mddev->pers in remove_and_add_spares()
We met NULL pointer BUG as follow:

[  151.760358] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
[  151.761340] PGD 80000001011eb067 P4D 80000001011eb067 PUD 1011ea067 PMD 0
[  151.762039] Oops: 0000 [#1] SMP PTI
[  151.762406] Modules linked in:
[  151.762723] CPU: 2 PID: 3561 Comm: mdadm-test Kdump: loaded Not tainted 4.17.0-rc1+ #238
[  151.763542] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
[  151.764432] RIP: 0010:remove_and_add_spares.part.56+0x13c/0x3a0
[  151.765061] RSP: 0018:ffffc90001d7fcd8 EFLAGS: 00010246
[  151.765590] RAX: 0000000000000000 RBX: ffff88013601d600 RCX: 0000000000000000
[  151.766306] RDX: 0000000000000000 RSI: ffff88013601d600 RDI: ffff880136187000
[  151.767014] RBP: ffff880136187018 R08: 0000000000000003 R09: 0000000000000051
[  151.767728] R10: ffffc90001d7fed8 R11: 0000000000000000 R12: ffff88013601d600
[  151.768447] R13: ffff8801298b1300 R14: ffff880136187000 R15: 0000000000000000
[  151.769160] FS:  00007f2624276700(0000) GS:ffff88013ae80000(0000) knlGS:0000000000000000
[  151.769971] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  151.770554] CR2: 0000000000000060 CR3: 0000000111aac000 CR4: 00000000000006e0
[  151.771272] Call Trace:
[  151.771542]  md_ioctl+0x1df2/0x1e10
[  151.771906]  ? __switch_to+0x129/0x440
[  151.772295]  ? __schedule+0x244/0x850
[  151.772672]  blkdev_ioctl+0x4bd/0x970
[  151.773048]  block_ioctl+0x39/0x40
[  151.773402]  do_vfs_ioctl+0xa4/0x610
[  151.773770]  ? dput.part.23+0x87/0x100
[  151.774151]  ksys_ioctl+0x70/0x80
[  151.774493]  __x64_sys_ioctl+0x16/0x20
[  151.774877]  do_syscall_64+0x5b/0x180
[  151.775258]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

For raid6, when two disk of the array are offline, two spare disks can
be added into the array. Before spare disks recovery completing,
system reboot and mdadm thinks it is ok to restart the degraded
array by md_ioctl(). Since disks in raid6 is not only_parity(),
raid5_run() will abort, when there is no PPL feature or not setting
'start_dirty_degraded' parameter. Therefore, mddev->pers is NULL.

But, mddev->raid_disks has been set and it will not be cleared when
raid5_run abort. md_ioctl() can execute cmd 'HOT_REMOVE_DISK' to
remove a disk by mdadm, which will cause NULL pointer dereference
in remove_and_add_spares() finally.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-05-17 09:55:59 -07:00
Mariusz Dabrowski 2cd259a77d raid5: copy write hint from origin bio to stripe
Store write hint from original bio in stripe head so it can be assigned
to bio sent to each RAID device.

Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Reviewed-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-05-17 09:55:58 -07:00
Coly Li 1c1a2ee1b5 bcache: return 0 from bch_debug_init() if CONFIG_DEBUG_FS=n
Commit 539d39eb27 ("bcache: fix wrong return value in bch_debug_init()")
returns the return value of debugfs_create_dir() to bcache_init(). When
CONFIG_DEBUG_FS=n, bch_debug_init() always returns 1 and makes
bcache_init() failedi.

This patch makes bch_debug_init() always returns 0 if CONFIG_DEBUG_FS=n,
so bcache can continue to work for the kernels which don't have debugfs
enanbled.

Changelog:
v4: Add Acked-by from Kent Overstreet.
v3: Use IS_ENABLED(CONFIG_DEBUG_FS) to replace #ifdef DEBUG_FS.
v2: Remove a warning information
v1: Initial version.

Fixes: Commit 539d39eb27 ("bcache: fix wrong return value in bch_debug_init()")
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Reported-by: Massimo B. <massimo.b@gmx.net>
Reported-by: Kai Krakow <kai@kaishome.de>
Tested-by: Kai Krakow <kai@kaishome.de>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-17 09:43:40 -06:00
Christoph Hellwig ff005a0662 block: sanitize blk_get_request calling conventions
Switch everyone to blk_get_request_flags, and then rename
blk_get_request_flags to blk_get_request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 08:55:12 -06:00
Linus Torvalds 94d7dbf108 - A stable fix for DM integrity to use kvfree.
- Fix for a 4.17-rc1 change to dm-bufio's buffer alignment.
 
 - Fixes for a few sparse warnings.
 
 - Remove VLA usage in DM mirror target.
 
 - Improve DM thinp Documentation for the "read_only" feature.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJa9Gn4AAoJEMUj8QotnQNa+6sH/0lu9XA7fpaopD29wPkjtFLg
 nDY8yCBrBvo1GRQBCAM3TWYDdTYLgO0srn24y9AM0AnhuiR+YFeDuoMyhzIQjWay
 X+RGgiMtLwroLWO9t7hhP1eK3u3SX+40bhvle6vNOn/KGb7XOuFnEksUJ85B9pJZ
 xF1aGos8+YIXTqBRP4RLJPWKPme1HIpdVGUcwnt9fW3J9PYzkN9xIry/cow0JWEl
 xmr69l2KpOQ7jVpcBhA52NDosW/LCOipyr9mhe0+lq60BDcsCbjCkK5p6F38Ufa8
 +24cqgdT2fdbogL3JTfs0lIgrOaLOhYmL5qJemopcxE6TKXvP6AzHhihlT48fNA=
 =w2u8
 -----END PGP SIGNATURE-----

Merge tag 'for-4.17/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - a stable fix for DM integrity to use kvfree

 - fix for a 4.17-rc1 change to dm-bufio's buffer alignment

 - fixes for a few sparse warnings

 - remove VLA usage in DM mirror target

 - improve DM thinp Documentation for the "read_only" feature

* tag 'for-4.17/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm thin: update Documentation to clarify when "read_only" is valid
  dm mirror: remove VLA usage
  dm: fix some sparse warnings and whitespace in dax methods
  dm cache background tracker: fix sparse warning
  dm bufio: fix buffer alignment
  dm integrity: use kvfree for kvmalloc'd memory
2018-05-10 11:42:01 -07:00
Omar Sandoval 522a777566 block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:

- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
  used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq

These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 08:33:09 -06:00
Kees Cook 65972a6fa9 dm mirror: remove VLA usage
On the quest to remove all VLAs from the kernel[1], this avoids VLAs
in dm-raid1.c by just using the maximum size for the stack arrays.
The nr_mirrors value was already capped at 9, so this makes it a trivial
adjustment to the array sizes.

[1] https://lkml.org/lkml/2018/3/7/621

Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-05-04 10:35:20 -04:00
Coly Li 09a44ca211 bcache: use pr_info() to inform duplicated CACHE_SET_IO_DISABLE set
It is possible that multiple I/O requests hits on failed cache device or
backing device, therefore it is quite common that CACHE_SET_IO_DISABLE is
set already when a task tries to set the bit from bch_cache_set_error().
Currently the message "CACHE_SET_IO_DISABLE already set" is printed by
pr_warn(), which might mislead users to think a serious fault happens in
source code.

This patch uses pr_info() to print the information in such situation,
avoid extra worries. This information is helpful to understand bcache
behavior in cache device failures, so I still keep them in source code.

Fixes: 771f393e8f ("bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-03 08:35:16 -06:00
Coly Li 4fd8e13843 bcache: set dc->io_disable to true in conditional_stop_bcache_device()
Commit 7e027ca4b5 ("bcache: add stop_when_cache_set_failed option to
backing device") adds stop_when_cache_set_failed option and stops bcache
device if stop_when_cache_set_failed is auto and there is dirty data on
broken cache device. There might exists a small time gap that the cache
set is released and set to NULL but bcache device is not released yet
(because they are released in parallel). During this time gap, dc->c is
NULL so CACHE_SET_IO_DISABLE won't be checked, and dc->io_disable is still
false, so new coming I/O requests will be accepted and directly go into
backing device as no cache set attached to. If there is dirty data on
cache device, this behavior may introduce potential inconsistent data.

This patch sets dc->io_disable to true before calling bcache_device_stop()
to make sure the backing device will reject new coming I/O request as
well, so even in the small time gap no I/O will directly go into backing
device to corrupt data consistency.

Fixes: 7e027ca4b5 ("bcache: add stop_when_cache_set_failed option to backing device")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-03 08:35:15 -06:00
Coly Li ecb2ba8cb8 bcache: add wait_for_kthread_stop() in bch_allocator_thread()
When CACHE_SET_IO_DISABLE is set on cache set flags, bcache allocator
thread routine bch_allocator_thread() may stop the while-loops and
exit. Then it is possible to observe the following kernel oops message,

[  631.068366] bcache: bch_btree_insert() error -5
[  631.069115] bcache: cached_dev_detach_finish() Caching disabled for sdf
[  631.070220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[  631.070250] PGD 0 P4D 0
[  631.070261] Oops: 0002 [#1] SMP PTI
[snipped]
[  631.070578] Workqueue: events cache_set_flush [bcache]
[  631.070597] RIP: 0010:exit_creds+0x1b/0x50
[  631.070610] RSP: 0018:ffffc9000705fe08 EFLAGS: 00010246
[  631.070626] RAX: 0000000000000001 RBX: ffff880a622ad300 RCX: 000000000000000b
[  631.070645] RDX: 0000000000000601 RSI: 000000000000000c RDI: 0000000000000000
[  631.070663] RBP: ffff880a622ad300 R08: ffffea00190c66e0 R09: 0000000000000200
[  631.070682] R10: ffff880a48123000 R11: ffff880000000000 R12: 0000000000000000
[  631.070700] R13: ffff880a4b160e40 R14: ffff880a4b160000 R15: 0ffff880667e2530
[  631.070719] FS:  0000000000000000(0000) GS:ffff880667e00000(0000) knlGS:0000000000000000
[  631.070740] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  631.070755] CR2: 0000000000000000 CR3: 000000000200a001 CR4: 00000000003606e0
[  631.070774] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  631.070793] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  631.070811] Call Trace:
[  631.070828]  __put_task_struct+0x55/0x160
[  631.070845]  kthread_stop+0xee/0x100
[  631.070863]  cache_set_flush+0x11d/0x1a0 [bcache]
[  631.070879]  process_one_work+0x146/0x340
[  631.070892]  worker_thread+0x47/0x3e0
[  631.070906]  kthread+0xf5/0x130
[  631.070917]  ? max_active_store+0x60/0x60
[  631.070930]  ? kthread_bind+0x10/0x10
[  631.070945]  ret_from_fork+0x35/0x40
[snipped]
[  631.071017] RIP: exit_creds+0x1b/0x50 RSP: ffffc9000705fe08
[  631.071033] CR2: 0000000000000000
[  631.071045] ---[ end trace 011c63a24b22c927 ]---
[  631.071085] bcache: bcache_device_free() bcache0 stopped

The reason is when cache_set_flush() tries to call kthread_stop() to stop
allocator thread, but it exits already due to cache device I/O errors.

This patch adds wait_for_kthread_stop() at tail of bch_allocator_thread(),
to prevent the thread routine exiting directly. Then the allocator thread
can be blocked at wait_for_kthread_stop() and wait for cache_set_flush()
to stop it by calling kthread_stop().

changelog:
v3: add Reviewed-by from Hannnes.
v2: not directly return from allocator_wait(), move 'return 0' to tail of
    bch_allocator_thread().
v1: initial version.

Fixes: 771f393e8f ("bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-03 08:35:13 -06:00
Coly Li bf78980fcc bcache: count backing device I/O error for writeback I/O
Commit c7b7bd0740 ("bcache: add io_disable to struct cached_dev")
counts backing device I/O requets and set dc->io_disable to true if error
counters exceeds dc->io_error_limit. But it only counts I/O errors for
regular I/O request, neglects errors of write back I/Os when backing device
is offline.

This patch counts the errors of writeback I/Os, in dirty_endio() if
bio->bi_status is  not 0, it means error happens when writing dirty keys
to backing device, then bch_count_backing_io_errors() is called.

By this fix, even there is no reqular I/O request coming, if writeback I/O
errors exceed dc->io_error_limit, the bcache device may still be stopped
for the broken backing device.

Fixes: c7b7bd0740 ("bcache: add io_disable to struct cached_dev")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-03 08:35:12 -06:00
Coly Li 6147305c73 bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()
Commit c7b7bd0740 ("bcache: add io_disable to struct cached_dev") tries
to stop bcache device by calling bcache_device_stop() when too many I/O
errors happened on backing device. But if there is internal I/O happening
on cache device (writeback scan, garbage collection, etc), a regular I/O
request triggers the internal I/Os may still holds a refcount of dc->count,
and the refcount may only be dropped after the internal I/O stopped.

By this patch, bch_cached_dev_error() will check if the backing device is
attached to a cache set, if yes that CACHE_SET_IO_DISABLE will be set to
flags of this cache set. Then internal I/Os on cache device will be
rejected and stopped immediately, and the bcache device can be stopped.

For people who are not familiar with the interesting refcount dependance,
let me explain a bit more how the fix works. Example the writeback thread
will scan cache device for dirty data writeback purpose. Before it stopps,
it holds a refcount of dc->count. When CACHE_SET_IO_DISABLE bit is set,
the internal I/O will stopped and the while-loop in bch_writeback_thread()
quits and calls cached_dev_put() to drop dc->count. If this is the last
refcount to drop, then cached_dev_detach_finish() will be called. In this
call back function, in turn closure_put(dc->disk.cl) is called to drop a
refcount of closure dc->disk.cl. If this is the last refcount of this
closure to drop, then cached_dev_flush() will be called. Then the cached
device is freed. So if CACHE_SET_IO_DISABLE is not set, the bache device
can not be stopped until all inernal cache device I/O stopped. For large
size cache device, and writeback thread competes locks with gc thread,
there might be a quite long time to wait.

Fixes: c7b7bd0740 ("bcache: add io_disable to struct cached_dev")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-03 08:35:10 -06:00
Coly Li 6e916a7eb1 bcache: store disk name in struct cache and struct cached_dev
Current code uses bdevname() or bio_devname() to reference gendisk
disk name when bcache needs to display the disk names in kernel message.
It was safe before bcache device failure handling patch set merged in,
because when devices are failed, there was deadlock to prevent bcache
printing error messages with gendisk disk name. But after the failure
handling patch set merged, the deadlock is fixed, so it is possible
that the gendisk structure bdev->hd_disk is released when bdevname() is
called to reference bdev->bd_disk->disk_name[]. This is why I receive
bug report of NULL pointers deference panic.

This patch stores gendisk disk name in a buffer inside struct cache and
struct cached_dev, then print out the offline device name won't reference
bdev->hd_disk anymore. And this patch also avoids extra function calls
of bdevname() and bio_devnmae().

Changelog:
v3, add Reviewed-by from Hannes.
v2, call bdevname() earlier in register_bdev()
v1, first version with segguestion from Junhui Tang.

Fixes: c7b7bd0740 ("bcache: add io_disable to struct cached_dev")
Fixes: 5138ac6748 ("bcache: fix misleading error message in bch_count_io_errors()")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-03 08:35:08 -06:00
NeilBrown 011abdc9df md: fix two problems with setting the "re-add" device state.
If "re-add" is written to the "state" file for a device
which is faulty, this has an effect similar to removing
and re-adding the device.  It should take up the
same slot in the array that it previously had, and
an accelerated (e.g. bitmap-based) rebuild should happen.

The slot that "it previously had" is determined by
rdev->saved_raid_disk.
However this is not set when a device fails (only when a device
is added), and it is cleared when resync completes.
This means that "re-add" will normally work once, but may not work a
second time.

This patch includes two fixes.
1/ when a device fails, record the ->raid_disk value in
    ->saved_raid_disk before clearing ->raid_disk
2/ when "re-add" is written to a device for which
    ->saved_raid_disk is not set, fail.

I think this is suitable for stable as it can
cause re-adding a device to be forced to do a full
resync which takes a lot longer and so puts data at
more risk.

Cc: <stable@vger.kernel.org> (v4.1)
Fixes: 97f6cd39da ("md-cluster: re-add capabilities")
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-05-01 09:47:50 -07:00
Guoqing Jiang eb81b32826 raid10: check bio in r10buf_pool_free to void NULL pointer dereference
For recovery case, r10buf_pool_alloc only allocates 2 bios,
so we can't access more than 2 bios in r10buf_pool_free.
Otherwise, we can see NULL pointer dereference as follows:

[   98.347009] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000050
[   98.355783] IP: r10buf_pool_free+0x38/0xe0 [raid10]
[...]
[   98.543734] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   98.550161] CR2: 0000000000000050 CR3: 000000089500a001 CR4: 00000000001606f0
[   98.558145] Call Trace:
[   98.560881]  <IRQ>
[   98.563136]  put_buf+0x19/0x20 [raid10]
[   98.567426]  end_sync_request+0x6b/0x70 [raid10]
[   98.572591]  end_sync_write+0x9b/0x160 [raid10]
[   98.577662]  blk_update_request+0x78/0x2c0
[   98.582254]  scsi_end_request+0x2c/0x1e0 [scsi_mod]
[   98.587719]  scsi_io_completion+0x22f/0x610 [scsi_mod]
[   98.593472]  blk_done_softirq+0x8e/0xc0
[   98.597767]  __do_softirq+0xde/0x2b3
[   98.601770]  irq_exit+0xae/0xb0
[   98.605285]  do_IRQ+0x81/0xd0
[   98.608606]  common_interrupt+0x7d/0x7d
[   98.612898]  </IRQ>

So we need to check the bio is valid or not before the bio is
used in r10buf_pool_free. Another workable way is to free 2 bios
for recovery case just like r10buf_pool_alloc.

Fixes: f025061836 ("md: raid10: don't use bio's vec table to manage resync pages")
Reported-by: Alexis Castilla <pencerval@gmail.com>
Tested-by: Alexis Castilla <pencerval@gmail.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-05-01 09:47:50 -07:00
Yufen Yu 13db16d74c md: fix an error code format and remove unsed bio_sector
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-05-01 09:47:49 -07:00
Mike Snitzer 3d97c829ed dm: fix some sparse warnings and whitespace in dax methods
Eliminate these sparse warnings:
drivers/md/dm.c:1062:9: warning: context imbalance in 'dm_dax_direct_access' - unexpected unlock
drivers/md/dm.c:1086:9: warning: context imbalance in 'dm_dax_copy_from_iter' - unexpected unlock

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-30 17:05:17 -04:00
Mike Snitzer 280884fadc dm cache background tracker: fix sparse warning
Fix drivers/md/dm-cache-background-tracker.c:169:16: warning: symbol
'alloc_work' was not declared. Should it be static?

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-30 15:40:40 -04:00
Mikulas Patocka f7879b4cea dm bufio: fix buffer alignment
Commit 6b5e718cc1 ("dm bufio: relax alignment constraint on slab
cache") relaxed alignment on dm-bufio cache, however it may break
dm-crypt or dm-integrity.

dm-crypt and dm-integrity require that the size of bio vector entries
(bv_len) is aligned on its sector size. bv_offset doesn't have to be
aligned, but bv_len must be. XFS sends unaligned bios, but they do not
cross page boundary, so the requirement for aligned bv_len is met.

Commit 6b5e718cc1 made dm-bufio send unaligned bios that cross page
boundary, this could break dm-crypt and dm-integrity.

Reinstates the alignment. Note that misaligned entries only happen when
we use slab/slub debugging. Without debugging, the entries are always
aligned.

Fixes: 6b5e718cc1 ("dm bufio: relax alignment constraint on slab cache")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-30 11:51:39 -04:00
Mikulas Patocka fc8cec1139 dm integrity: use kvfree for kvmalloc'd memory
Use kvfree instead of kfree because the array is allocated with kvmalloc.

Fixes: 7eada909bf ("dm: add integrity target")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-30 11:51:39 -04:00
Thomas Gleixner eb366989aa dm/verity_fec: Use GFP aware reed solomon init
Allocations from the rs_pool can invoke init_rs() from the mempool
allocation callback. This is problematic in fec_alloc_bufs() which invokes
mempool_alloc() with GFP_NOIO to prevent a swap deadlock because init_rs()
uses GFP_KERNEL allocations.

Switch it to init_rs_gfp() and invoke it with the gfp_t flags which are
handed in from the allocator.

Note: This is not a problem today because the rs control struct is shared
between the instances and its created when the mempool is initialized. But
the upcoming changes which switch to a rs_control struct per instance to
embed decoder buffers will trigger the swap vs. GFP_KERNEL issue.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-04-24 19:50:05 -07:00
Linus Torvalds 7768ee3f45 Merge tag 'md/4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD fixes from Shaohua Li:
 "Three small fixes for MD:

   - md-cluster fix for faulty device from Guoqing

   - writehint fix for writebehind IO for raid1 from Mariusz

   - a live lock fix for interrupted recovery from Yufen"

* tag 'md/4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  raid1: copy write hint from master bio to behind bio
  md/raid1: exit sync request if MD_RECOVERY_INTR is set
  md-cluster: don't update recovery_offset for faulty device
2018-04-20 10:39:44 -07:00
Linus Torvalds 9f3a0941fb libnvdimm for 4.17
* A rework of the filesytem-dax implementation provides for detection of
   unmap operations (truncate / hole punch) colliding with in-progress
   device-DMA. A fix for these collisions remains a work-in-progress
   pending resolution of truncate latency and starvation regressions.
 
 * The of_pmem driver expands the users of libnvdimm outside of x86 and
   ACPI to describe an implementation of persistent memory on PowerPC with
   Open Firmware / Device tree.
 
 * Address Range Scrub (ARS) handling is completely rewritten to account for
   the fact that ARS may run for 100s of seconds and there is no platform
   defined way to cancel it. ARS will now no longer block namespace
   initialization.
 
 * The NVDIMM Namespace Label implementation is updated to handle label
   areas as small as 1K, down from 128K.
 
 * Miscellaneous cleanups and updates to unit test infrastructure.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJazDt5AAoJEB7SkWpmfYgCqGMQALLwdPeY87cUK7AvQ2IXj46B
 lJgeVuHPzyQDbC03AS5uUYnnU3I5lFd7i4y7ZrywNpFs4lsb/bNmbUpQE5xp+Yvc
 1MJ/JYDIP5X4misWYm3VJo85N49+VqSRgAQk52PBigwnZ7M6/u4cSptXM9//c9JL
 /NYbat6IjjY6Tx49Tec6+F3GMZjsFLcuTVkQcREoOyOqVJE4YpP0vhNjEe0vq6vr
 EsSWiqEI5VFH4PfJwKdKj/64IKB4FGKj2A5cEgjQBxW2vw7tTJnkRkdE3jDUjqtg
 xYAqGp/Dqs4+bgdYlT817YhiOVrcr5mOHj7TKWQrBPgzKCbcG5eKDmfT8t+3NEga
 9kBlgisqIcG72lwZNA7QkEHxq1Omy9yc1hUv9qz2YA0G+J1WE8l1T15k1DOFwV57
 qIrLLUypklNZLxvrzNjclempboKc4JCUlj+TdN5E5Y6pRs55UWTXaP7Xf5O7z0vf
 l/uiiHkc3MPH73YD2PSEGFJ8m8EU0N8xhrcz3M9E2sHgYCnbty1Lw3FH0/GhThVA
 ya1mMeDdb8A2P7gWCBk1Lqeig+rJKXSey4hKM6D0njOEtMQO1H4tFqGjyfDX1xlJ
 3plUR9WBVEYzN5+9xWbwGag/ezGZ+NfcVO2gmy6yXiEph796BxRAZx/18zKRJr0m
 9eGJG1H+JspcbtLF9iHn
 =acZQ
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "This cycle was was not something I ever want to repeat as there were
  several late changes that have only now just settled.

  Half of the branch up to commit d2c997c0f1 ("fs, dax: use
  page->mapping to warn...") have been in -next for several releases.
  The of_pmem driver and the address range scrub rework were late
  arrivals, and the dax work was scaled back at the last moment.

  The of_pmem driver missed a previous merge window due to an oversight.
  A sense of obligation to rectify that miss is why it is included for
  4.17. It has acks from PowerPC folks. Stephen reported a build failure
  that only occurs when merging it with your latest tree, for now I have
  fixed that up by disabling modular builds of of_pmem. A test merge
  with your tree has received a build success report from the 0day robot
  over 156 configs.

  An initial version of the ARS rework was submitted before the merge
  window. It is self contained to libnvdimm, a net code reduction, and
  passing all unit tests.

  The filesystem-dax changes are based on the wait_var_event()
  functionality from tip/sched/core. However, late review feedback
  showed that those changes regressed truncate performance to a large
  degree. The branch was rewound to drop the truncate behavior change
  and now only includes preparation patches and cleanups (with full acks
  and reviews). The finalization of this dax-dma-vs-trnucate work will
  need to wait for 4.18.

  Summary:

   - A rework of the filesytem-dax implementation provides for detection
     of unmap operations (truncate / hole punch) colliding with
     in-progress device-DMA. A fix for these collisions remains a
     work-in-progress pending resolution of truncate latency and
     starvation regressions.

   - The of_pmem driver expands the users of libnvdimm outside of x86
     and ACPI to describe an implementation of persistent memory on
     PowerPC with Open Firmware / Device tree.

   - Address Range Scrub (ARS) handling is completely rewritten to
     account for the fact that ARS may run for 100s of seconds and there
     is no platform defined way to cancel it. ARS will now no longer
     block namespace initialization.

   - The NVDIMM Namespace Label implementation is updated to handle
     label areas as small as 1K, down from 128K.

   - Miscellaneous cleanups and updates to unit test infrastructure"

* tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits)
  libnvdimm, of_pmem: workaround OF_NUMA=n build error
  nfit, address-range-scrub: add module option to skip initial ars
  nfit, address-range-scrub: rework and simplify ARS state machine
  nfit, address-range-scrub: determine one platform max_ars value
  powerpc/powernv: Create platform devs for nvdimm buses
  doc/devicetree: Persistent memory region bindings
  libnvdimm: Add device-tree based driver
  libnvdimm: Add of_node to region and bus descriptors
  libnvdimm, region: quiet region probe
  libnvdimm, namespace: use a safe lookup for dimm device name
  libnvdimm, dimm: fix dpa reservation vs uninitialized label area
  libnvdimm, testing: update the default smart ctrl_temperature
  libnvdimm, testing: Add emulation for smart injection commands
  nfit, address-range-scrub: introduce nfit_spa->ars_state
  libnvdimm: add an api to cast a 'struct nd_region' to its 'struct device'
  nfit, address-range-scrub: fix scrub in-progress reporting
  dax, dm: allow device-mapper to operate without dax support
  dax: introduce CONFIG_DAX_DRIVER
  fs, dax: use page->mapping to warn if truncate collides with a busy page
  ext2, dax: introduce ext2_dax_aops
  ...
2018-04-10 10:25:57 -07:00
Dan Williams e13e75b86e Merge branch 'for-4.17/dax' into libnvdimm-for-next 2018-04-09 10:50:17 -07:00
Mariusz Dabrowski dba40d46eb raid1: copy write hint from master bio to behind bio
Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Reviewed-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-04-09 08:54:34 -07:00
Yufen Yu 8c24259323 md/raid1: exit sync request if MD_RECOVERY_INTR is set
We met a sync thread stuck as follows:

 raid1_sync_request+0x2c9/0xb50
 md_do_sync+0x983/0xfa0
 md_thread+0x11c/0x160
 kthread+0x111/0x130
 ret_from_fork+0x35/0x40
 0xffffffffffffffff

At the same time, there is a stuck mdadm thread (mdadm --manage
/dev/md2 --add /dev/sda). It is trying to stop the sync thread:

 kthread_stop+0x42/0xf0
 md_unregister_thread+0x3a/0x70
 md_reap_sync_thread+0x15/0x160
 action_store+0x142/0x2a0
 md_attr_store+0x6c/0xb0
 kernfs_fop_write+0x102/0x180
 __vfs_write+0x33/0x170
 vfs_write+0xad/0x1a0
 SyS_write+0x52/0xc0
 do_syscall_64+0x6e/0x190
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Debug tools show that the sync thread is waiting in raise_barrier(),
until raid1d() end all normal IO bios into bio_end_io_list(introduced
in commit 55ce74d4bf). But, raid1d() cannot end these bios if
MD_CHANGE_PENDING bit is set. It needs to get mddev->reconfig_mutex lock
and then clear the bit in md_check_recovery().
However, the lock is holding by mdadm in action_store().

Thus, there is a loop:
mdadm waiting for sync thread to stop, sync thread waiting for
raid1d() to end bios, raid1d() waiting for mdadm to release
mddev->reconfig_mutex lock and then it can end bios.

Fix this by checking MD_RECOVERY_INTR while waiting in raise_barrier(),
so that sync thread can exit while mdadm is stoping the sync thread.

Fixes: 55ce74d4bf ("md/raid1: ensure device failure recorded before write request returns.")
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-04-09 08:41:16 -07:00
Guoqing Jiang 0ea9924abe md-cluster: don't update recovery_offset for faulty device
Device could become faulty when clustered array handling
METADATA_UPDATED msg, so we don't need to call read_rdev
for this device.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-04-09 08:39:36 -07:00
Linus Torvalds 83c7c18b16 - DM core passthrough ioctl fix to retain reference to DM table, and
that table's block devices, while issuing the ioctl to one of those
   block devices.
 
 - DM core passthrough ioctl fix to _not_ override the fmode_t used to
   issue the ioctl.  Overriding by using the fmode_t that the block
   device was originally open with during DM table load is a liability.
 
 - Add DM core support for secure erase forwarding and update the DM
   linear and DM striped targets to support them.
 
 - A DM core 4.16 stable fix to allow abnormal IO (e.g. discard, write
   same, write zeroes) for targets that make use of the non-splitting IO
   variant (as is done for multipath or thinp when layered directly on
   NVMe).
 
 - Allow DM targets to return a payload in response to a DM message that
   they are sent.  This is useful for DM targets that would like to
   provide statistics data in response to DM messages.
 
 - Update DM bufio to support non-power-of-2 block sizes.  Numerous other
   related changes prepare the DM bufio code for this support.
 
 - Fix DM crypt to use a bounded amount of memory across the entire
   system.  This is to avoid OOM that can otherwise occur in response to
   certain pathological IO workloads (e.g. discarding a large DM crypt
   device).
 
 - Add a 'check_at_most_once' feature to the DM verity target to allow
   verity to be used on mobile devices that have very limited resources.
 
 - Fix the DM integrity target to fail early if a keyed algorithm
   (e.g. HMAC) is to be used but the key isn't set.
 
 - Add non-power-of-2 support to the DM unstripe target.
 
 - Eliminate the use of a Variable Length Array in the DM stripe target.
 
 - Update the DM log-writes target to record metadata (REQ_META flag).
 
 - DM raid fixes for its nosync status and some variable range issues.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJaxjrCAAoJEMUj8QotnQNaEW8IAMeb6dWbxBgCleafK7GV5I0B
 jdm6NWDGoGrhz5TyZXbd6pLXD6fbU51O2Jn/LyQ7KOY1lRlS66TBprR60gNyzPQn
 bQVVEbKnSMwNAnkVpStiVZSs2e9HyNIRsM0yvwZXPfcVw6Q8XttiJQHRdGu4A5jm
 i+/aoyAf1iJowe5ituyE569gFGnOzPmesKZoF2/A36ik3yq3HF4FR0IPxWKseSpt
 gCeGTPpBv3aRonRBDxT9uOwb3SXC8TGrStRLm1I6UFwmHKs7nWx/o/P1ghbSOgSt
 Gu+GdoXnG5qtzAZ1t3sR2Iw3zEmpO1t5Jht5VqR7T+1145DaoPkW5M8vyjHll5o=
 =H7V5
 -----END PGP SIGNATURE-----

Merge tag 'for-4.17/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - DM core passthrough ioctl fix to retain reference to DM table, and
   that table's block devices, while issuing the ioctl to one of those
   block devices.

 - DM core passthrough ioctl fix to _not_ override the fmode_t used to
   issue the ioctl. Overriding by using the fmode_t that the block
   device was originally open with during DM table load is a liability.

 - Add DM core support for secure erase forwarding and update the DM
   linear and DM striped targets to support them.

 - A DM core 4.16 stable fix to allow abnormal IO (e.g. discard, write
   same, write zeroes) for targets that make use of the non-splitting IO
   variant (as is done for multipath or thinp when layered directly on
   NVMe).

 - Allow DM targets to return a payload in response to a DM message that
   they are sent. This is useful for DM targets that would like to
   provide statistics data in response to DM messages.

 - Update DM bufio to support non-power-of-2 block sizes. Numerous other
   related changes prepare the DM bufio code for this support.

 - Fix DM crypt to use a bounded amount of memory across the entire
   system. This is to avoid OOM that can otherwise occur in response to
   certain pathological IO workloads (e.g. discarding a large DM crypt
   device).

 - Add a 'check_at_most_once' feature to the DM verity target to allow
   verity to be used on mobile devices that have very limited resources.

 - Fix the DM integrity target to fail early if a keyed algorithm (e.g.
   HMAC) is to be used but the key isn't set.

 - Add non-power-of-2 support to the DM unstripe target.

 - Eliminate the use of a Variable Length Array in the DM stripe target.

 - Update the DM log-writes target to record metadata (REQ_META flag).

 - DM raid fixes for its nosync status and some variable range issues.

* tag 'for-4.17/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (28 commits)
  dm: remove fmode_t argument from .prepare_ioctl hook
  dm: hold DM table for duration of ioctl rather than use blkdev_get
  dm raid: fix parse_raid_params() variable range issue
  dm verity: make verity_for_io_block static
  dm verity: add 'check_at_most_once' option to only validate hashes once
  dm bufio: don't embed a bio in the dm_buffer structure
  dm bufio: support non-power-of-two block sizes
  dm bufio: use slab cache for dm_buffer structure allocations
  dm bufio: reorder fields in dm_buffer structure
  dm bufio: relax alignment constraint on slab cache
  dm bufio: remove code that merges slab caches
  dm bufio: get rid of slab cache name allocations
  dm bufio: move dm-bufio.h to include/linux/
  dm bufio: delete outdated comment
  dm: add support for secure erase forwarding
  dm: backfill abnormal IO support to non-splitting IO submission
  dm raid: fix nosync status
  dm mpath: use DM_MAPIO_SUBMITTED instead of magic number 0 in process_queued_bios()
  dm stripe: get rid of a Variable Length Array (VLA)
  dm log writes: record metadata flag for better flags record
  ...
2018-04-06 11:50:19 -07:00
Linus Torvalds 3526dd0c78 for-4.17/block-20180402
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJawr05AAoJEPfTWPspceCmT2UP/1uuaqwzyl4VjFNb/k7KS7UM
 +Cs/1HBlGomgMA8orDTGqtWqLRdR3z4RSh0+MvXTzQ78HpFVYz7CbDc9itHm+G9M
 X0ypD4kF/JGCFb5cxk+x6qv28uO2nv4DP3+0hHqJWLH4UVJBWDY6bs4BPShsf9QB
 I6XjioNMhoqylXgdOITLODJZz+TcChlJMDAqwhpJwh9TH1wjobleAZ6AdmCPfgi5
 h0UCKMUKzcVJlNZwQUrzrs2cxcx9Uhunnbz7HK0ZV4n/FKFtDpGynFpQQ71pZxKe
 Be0ZOBPCQvC3ykOM/egCIvC/e5y7FgrjORD6jxyu1PTwAugI5E1VYSMxHkXvgPAx
 zOo9A7RT4GPO2tDQv+DbzNFpqeSAclTgSmr+/y1wmheBs8DiSt7MPVBiNM4zdCNv
 NLk9z7IEjFhdmluSB/LbTb1aokypMb/q7QTLouPHdwGn80k7yrhFyLHgdjpNTQ2K
 UHfHZvGxkOX6SmFhBNOtIFUkuSceenh64a0RkRle7filx+ImpbCVm2/GYi9zZNCu
 EtctgzLbLmz40zMiyDaZS2bxBgGzfn6yf4xd9LsaAJPMhvZnmXogT0D9ctWXB0WU
 mMaS7sOkLnNjnGkzF1fHkeiZ/oigrstJbe+CA7BtOdwxpWn6MZBgKEoFQ6iA2b3X
 5J1axMgVH5LAsIEcEQVq
 =RVhK
 -----END PGP SIGNATURE-----

Merge tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-block

Pull block layer updates from Jens Axboe:
 "It's a pretty quiet round this time, which is nice. This contains:

   - series from Bart, cleaning up the way we set/test/clear atomic
     queue flags.

   - series from Bart, fixing races between gendisk and queue
     registration and removal.

   - set of bcache fixes and improvements from various folks, by way of
     Michael Lyle.

   - set of lightnvm updates from Matias, most of it being the 1.2 to
     2.0 transition.

   - removal of unused DIO flags from Nikolay.

   - blk-mq/sbitmap memory ordering fixes from Omar.

   - divide-by-zero fix for BFQ from Paolo.

   - minor documentation patches from Randy.

   - timeout fix from Tejun.

   - Alpha "can't write a char atomically" fix from Mikulas.

   - set of NVMe fixes by way of Keith.

   - bsg and bsg-lib improvements from Christoph.

   - a few sed-opal fixes from Jonas.

   - cdrom check-disk-change deadlock fix from Maurizio.

   - various little fixes, comment fixes, etc from various folks"

* tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-block: (139 commits)
  blk-mq: Directly schedule q->timeout_work when aborting a request
  blktrace: fix comment in blktrace_api.h
  lightnvm: remove function name in strings
  lightnvm: pblk: remove some unnecessary NULL checks
  lightnvm: pblk: don't recover unwritten lines
  lightnvm: pblk: implement 2.0 support
  lightnvm: pblk: implement get log report chunk
  lightnvm: pblk: rename ppaf* to addrf*
  lightnvm: pblk: check for supported version
  lightnvm: implement get log report chunk helpers
  lightnvm: make address conversions depend on generic device
  lightnvm: add support for 2.0 address format
  lightnvm: normalize geometry nomenclature
  lightnvm: complete geo structure with maxoc*
  lightnvm: add shorten OCSSD version in geo
  lightnvm: add minor version to generic geometry
  lightnvm: simplify geometry structure
  lightnvm: pblk: refactor init/exit sequences
  lightnvm: Avoid validation of default op value
  lightnvm: centralize permission check for lightnvm ioctl
  ...
2018-04-05 14:27:02 -07:00
Mike Snitzer 5bd5e8d891 dm: remove fmode_t argument from .prepare_ioctl hook
Use the fmode_t that is passed to dm_blk_ioctl() rather than
inconsistently (varies across targets) drop it on the floor by
overriding it with the fmode_t stored in 'struct dm_dev'.

All the persistent reservation functions weren't using the fmode_t they
got back from .prepare_ioctl so remove them.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-04 12:12:39 -04:00
Mike Snitzer 971888c469 dm: hold DM table for duration of ioctl rather than use blkdev_get
Commit 519049afea ("dm: use blkdev_get rather than bdgrab when issuing
pass-through ioctl") inadvertantly introduced a regression relative to
users of device cgroups that issue ioctls (e.g. libvirt).  Using
blkdev_get() in DM's passthrough ioctl support implicitly introduced a
cgroup permissions check that would fail unless care were taken to add
all devices in the IO stack to the device cgroup.  E.g. rather than just
adding the top-level DM multipath device to the cgroup all the
underlying devices would need to be allowed.

Fix this, to no longer require allowing all underlying devices, by
simply holding the live DM table (which includes the table's original
blkdev_get() reference on the blockdevice that the ioctl will be issued
to) for the duration of the ioctl.

Also, bump the DM ioctl version so a user can know that their device
cgroup allow workaround is no longer needed.

Reported-by: Michal Privoznik <mprivozn@redhat.com>
Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 519049afea ("dm: use blkdev_get rather than bdgrab when issuing pass-through ioctl")
Cc: stable@vger.kernel.org # 4.16
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-04 12:12:38 -04:00
Heinz Mauelshagen 13bc62d4a6 dm raid: fix parse_raid_params() variable range issue
parse_raid_params() compares variable "int value" with INT_MAX.

E.g. related Coverity report excerpt:
   CID 1364818 (#2 of 3): Operands don't affect result (CONSTANT_EXPRESSION_RESULT) [select issue]
1433                        if (value > INT_MAX) {

Fix by changing checks to avoid INT_MAX.

Whilst on it, avoid unnecessary checks against constants
and add check for sane recovery speed min/max.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-04 12:12:37 -04:00
weiyongjun (A) d4b1aaf53c dm verity: make verity_for_io_block static
Fixes the following sparse warning:

drivers/md/dm-verity-target.c:375:6: warning:
 symbol 'verity_for_io_block' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-04 12:12:36 -04:00
Patrik Torstensson 843f38d382 dm verity: add 'check_at_most_once' option to only validate hashes once
This allows platforms that are CPU/memory contrained to verify data
blocks only the first time they are read from the data device, rather
than every time.  As such, it provides a reduced level of security
because only offline tampering of the data device's content will be
detected, not online tampering.

Hash blocks are still verified each time they are read from the hash
device, since verification of hash blocks is less performance critical
than data blocks, and a hash block will not be verified any more after
all the data blocks it covers have been verified anyway.

This option introduces a bitset that is used to check if a block has
been validated before or not.  A block can be validated more than once
as there is no thread protection for the bitset.

These changes were developed and tested on entry-level Android Go
devices.

Signed-off-by: Patrik Torstensson <totte@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:29 -04:00
Mikulas Patocka 45354f1eb6 dm bufio: don't embed a bio in the dm_buffer structure
The bio structure consumes a substantial part of dm_buffer.  The bio
structure is only needed when doing I/O on the buffer, thus we don't
have to embed it in the buffer.

Allocate the bio structure only when doing I/O.

We don't need to create a bio_set because, in case of allocation
failure, dm-bufio falls back to using dm-io (which keeps its own
bio_set).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:29 -04:00
Mikulas Patocka f51f2e0a7f dm bufio: support non-power-of-two block sizes
Support block sizes that are not a power-of-two (but they must be a
multiple of 512b).  As always, a slab cache is used for allocations.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:28 -04:00
Mikulas Patocka 359dbf19ab dm bufio: use slab cache for dm_buffer structure allocations
kmalloc padded to the next power of two, using a slab cache avoids this.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:27 -04:00
Mikulas Patocka 03b0293959 dm bufio: reorder fields in dm_buffer structure
Reorder fields in dm_buffer structure to improve packing and reduce
structure size.  The compiler allocates 32-bit integer for field 'enum
data_mode', so change it to unsigned char.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:26 -04:00
Mikulas Patocka 6b5e718cc1 dm bufio: relax alignment constraint on slab cache
The I/O buffer doesn't have to be aligned on block size granularity,
relax alignment to ARCH_KMALLOC_MINALIGN (required to allow DMA from
slab cache memory on some architectures).

Also, set SLAB_RECLAIM_ACCOUNT so that the memory allocated from the
cache is accounted as reclaimable and doesn't inflate the 'used' entry
in the free command.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:25 -04:00
Mikulas Patocka 21bb132767 dm bufio: remove code that merges slab caches
All slab allocators can merge duplicate caches.  So dm-bufio doesn't
need extra slab merging logic.  Instead it can just allocate one slab
cache per client and let the allocator merge them.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:25 -04:00
Mikulas Patocka eeb67a0ba0 dm bufio: get rid of slab cache name allocations
dm-bufio keeps the dm_bufio_cache_names array that holds names of the
slab caches.

Since the commit db265eca77 ("mm/sl[aou]b: Move duping of slab name to
slab_common.c"), the kernel automatically duplicates the slab cache name
when creating the slab cache, so we no longer have to keep the name
allocated.

Remove the code that allocates the slab names and keeps them around.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:24 -04:00
Mikulas Patocka afa53df869 dm bufio: move dm-bufio.h to include/linux/
Move dm-bufio.h to include/linux/ so that external GPL'd DM target
modules can use it.

It is better to allow the use of dm-bufio than force external modules
to implement the equivalent buffered IO mechanism in some new way.  The
hope is this will encourage the use of dm-bufio; which will then make it
easier for a GPL'd external DM target module to be included upstream.

A couple dm-bufio EXPORT_SYMBOL exports have also been updated to use
EXPORT_SYMBOL_GPL.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:23 -04:00
Mikulas Patocka 1f013174b3 dm bufio: delete outdated comment
This comment was true when dm-bufio was written but, since 4.3, bios can
now have arbitrary size and the driver splits them.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:22 -04:00
Denis Semakin 00716545c8 dm: add support for secure erase forwarding
Set QUEUE_FLAG_SECERASE in DM device's queue_flags if a DM table's
data devices support secure erase.

Also, add support for secure erase to both the linear and striped
targets.

Signed-off-by: Denis Semakin <d.semakin@omprussia.ru>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:21 -04:00
Mike Snitzer 0519c71e8d dm: backfill abnormal IO support to non-splitting IO submission
Otherwise, these abnormal IOs would be sent to the DM target
regardless of whether the target advertised support for them.

Factor out __process_abnormal_io() from __split_and_process_non_flush()
so that discards, write same, etc may be conditionally processed.

Fixes: 978e51ba3 ("dm: optimize bio-based NVMe IO submission")
Cc: stable@vger.kernel.org # 4.16
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:20 -04:00
Heinz Mauelshagen 880bcce0dc dm raid: fix nosync status
Fix a race for "nosync" activations providing "aa.." device health
characters and "0/N" sync ratio rather than "AA..." and "N/N".  Occurs
when status for the raid set is retrieved during resume before the MD
sync thread starts and clears the MD_RECOVERY_NEEDED flag.

Cc: stable@vger.kernel.org # 4.16+
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:19 -04:00
Wang Sheng-Hui 8192a0cd76 dm mpath: use DM_MAPIO_SUBMITTED instead of magic number 0 in process_queued_bios()
Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:19 -04:00
Tycho Andersen 706dd22f12 dm stripe: get rid of a Variable Length Array (VLA)
Ideally, we'd like to get rid of all VLAs in the kernel and add -Wvla to
the build args: https://lkml.org/lkml/2018/3/7/621

This one is a simple case, since we don't actually need the VLA at all: we
can just iterate over the stripes twice, once to emit their names, and the
second time to emit status (i.e. trade memory for time). Since the number
of stripes is probably low, this is hopefully not that expensive.

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:18 -04:00
Qu Wenruo e5c4cb9b1b dm log writes: record metadata flag for better flags record
So developer could distinguish data and metadata bios easier.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:17 -04:00
Milan Broz e16b4f99f0 dm integrity: fail early if required HMAC key is not available
Since crypto API commit 9fa68f6200 ("crypto: hash - prevent using keyed
hashes without setting key") dm-integrity cannot use keyed algorithms
without the key being set.

The dm-integrity recognizes this too late (during use of HMAC), so it
allows creation and formatting of superblock, but the device is in fact
unusable.

Fix it by detecting the key requirement in integrity table constructor.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:16 -04:00
Wang Sheng-Hui 2d77dafe23 dm: remove unused macro DM_MOD_NAME_SIZE
Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:15 -04:00
Heinz Mauelshagen afac6bd6d1 dm unstripe: remove unnecessary header includes
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:15 -04:00
Heinz Mauelshagen 91e065d8f2 dm unstripe: remove superfluous module init error path message
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Scott Bauer <Scott.Bauer@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:14 -04:00
Heinz Mauelshagen ba5dfbb712 dm unstripe: add "dm-unstriped" module alias
This target's kernel module being named dm-unstripe.ko doesn't allow
lvm2's DM module autoload capability to load the dm-unstripe.ko
because lvm2 looks for dm-unstriped.ko due to the target name being
"unstriped".

Add the "dm-unstriped" module alias to resolve this oversight.

NOTE: this isn't needed for the "striped" target, despite its source
file being named dm-stripe.c, because it is part of dm-mod.ko.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:13 -04:00
Heinz Mauelshagen 2ae600cd15 dm unstripe: support non-power-of-2 chunk size
Address "FIXME: must support non power of 2 chunk_size, dm-stripe.c does".

Bump target version to indicate change.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Tested-by: Scott Bauer <Scott.Bauer@intel.com>
Reviewed-by: Scott Bauer <Scott.Bauer@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:12 -04:00
Mikulas Patocka 5059353df8 dm crypt: limit the number of allocated pages
dm-crypt consumes an excessive amount memory when the user attempts to
zero a dm-crypt device with "blkdiscard -z". The command "blkdiscard -z"
calls the BLKZEROOUT ioctl, it goes to the function __blkdev_issue_zeroout,
__blkdev_issue_zeroout sends a large amount of write bios that contain
the zero page as their payload.

For each incoming page, dm-crypt allocates another page that holds the
encrypted data, so when processing "blkdiscard -z", dm-crypt tries to
allocate the amount of memory that is equal to the size of the device.
This can trigger OOM killer or cause system crash.

Fix this by limiting the amount of memory that dm-crypt allocates to 2%
of total system memory. This limit is system-wide and is divided by the
number of active dm-crypt devices and each device receives an equal
share.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:11 -04:00
Mike Snitzer 1eb5fa849f dm: allow targets to return output from messages they are sent
Could be useful for a target to return stats or other information.
If a target does DMEMIT() anything to @result from its .message method
then it must return 1 to the caller.

Signed-off-By: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:10 -04:00
Dan Williams 976431b02c dax, dm: allow device-mapper to operate without dax support
Change device-mapper's DAX dependency to require the presence of at
least one DAX_DRIVER. This allows device-mapper to be built without
bringing the DAX core along which is especially wasteful when there are
no DAX drivers, like BLK_DEV_PMEM, configured.

Cc: Alasdair Kergon <agk@redhat.com>
Reported-by: Bart Van Assche <Bart.VanAssche@wdc.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-04-03 05:41:19 -07:00
Mike Snitzer da5dadb4f1 dm: fix dropped return code from dm_get_bdev_for_ioctl
dm_get_bdev_for_ioctl()'s return of 0 or 1 must be the result from
prepare_ioctl (1 means the ioctl was issued to a partition, 0 means it
wasn't).  Unfortunately commit 519049afea ("dm: use blkdev_get rather
than bdgrab when issuing pass-through ioctl") reused the variable 'r'
to store the return from blkdev_get() that follows prepare_ioctl()
-- whereby dropping prepare_ioctl()'s result on the floor.

This can lead to an ioctl or persistent reservation being issued to a
partition going unnoticed, which implies the extra permission check for
CAP_SYS_RAWIO is skipped.

Fix this by using a different variable to store blkdev_get()'s return.

Fixes: 519049afea ("dm: use blkdev_get rather than bdgrab when issuing pass-through ioctl")
Reported-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-03-29 23:31:32 -04:00
Mike Snitzer e457edf0b2 dm mpath: fix support for loading scsi_dh modules during table load
The ability to have multipath dynamically attach a scsi_dh, that the user
specified in the multipath table, was broken by commit e8f74a0f00 ("dm
mpath: eliminate need to use scsi_device_from_queue").

Restore the ability to load, and attach, a particular scsi_dh module if
one is specified (as noticed by checking m->hw_handler_name).

Fixes: e8f74a0f00 ("dm mpath: eliminate need to use scsi_device_from_queue")
Reported-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-03-29 17:00:44 -04:00
Bart Van Assche 5f2b18ec8e bcache: Fix a compiler warning in bcache_device_init()
Avoid that building with W=1 triggers the following compiler warning:

drivers/md/bcache/super.c:776:20: warning: comparison is always false due to limited range of data type [-Wtype-limits]
      d->nr_stripes > SIZE_MAX / sizeof(atomic_t)) {
                    ^

Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Bart Van Assche 20d3a51871 bcache: Reduce the number of sparse complaints about lock imbalances
Add more annotations for sparse to inform it about which functions do
not have the same number of spin_lock() and spin_unlock() calls.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Bart Van Assche 42361469ae bcache: Suppress more warnings about set-but-not-used variables
This patch does not change any functionality.

Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Bart Van Assche f0d3814090 bcache: Remove an unused variable
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Bart Van Assche 47344e330e bcache: Fix kernel-doc warnings
Avoid that building with W=1 triggers warnings about the kernel-doc
headers.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Bart Van Assche 9dfbdec7b7 bcache: Annotate switch fall-through
This patch avoids that building with W=1 triggers complaints about
switch fall-throughs.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Bart Van Assche 4a4e443835 bcache: Add __printf annotation to __bch_check_keys()
Make it possible for the compiler to verify the consistency of the
format string passed to __bch_check_keys() and the arguments that
should be formatted according to that format string.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Bart Van Assche fd01991d5c bcache: Fix indentation
This patch avoids that smatch complains about inconsistent indentation.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Coly Li c7b7bd0740 bcache: add io_disable to struct cached_dev
If a bcache device is configured to writeback mode, current code does not
handle write I/O errors on backing devices properly.

In writeback mode, write request is written to cache device, and
latter being flushed to backing device. If I/O failed when writing from
cache device to the backing device, bcache code just ignores the error and
upper layer code is NOT noticed that the backing device is broken.

This patch tries to handle backing device failure like how the cache device
failure is handled,
- Add a error counter 'io_errors' and error limit 'error_limit' in struct
  cached_dev. Add another io_disable to struct cached_dev to disable I/Os
  on the problematic backing device.
- When I/O error happens on backing device, increase io_errors counter. And
  if io_errors reaches error_limit, set cache_dev->io_disable to true, and
  stop the bcache device.

The result is, if backing device is broken of disconnected, and I/O errors
reach its error limit, backing device will be disabled and the associated
bcache device will be removed from system.

Changelog:
v2: remove "bcache: " prefix in pr_error(), and use correct name string to
    print out bcache device gendisk name.
v1: indeed this is new added in v2 patch set.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Coly Li 27a40ab926 bcache: add backing_request_endio() for bi_end_io
In order to catch I/O error of backing device, a separate bi_end_io
call back is required. Then a per backing device counter can record I/O
errors number and retire the backing device if the counter reaches a
per backing device I/O error limit.

This patch adds backing_request_endio() to bcache backing device I/O code
path, this is a preparation for further complicated backing device failure
handling. So far there is no real code logic change, I make this change a
separate patch to make sure it is stable and reliable for further work.

Changelog:
v2: Fix code comments typo, remove a redundant bch_writeback_add() line
    added in v4 patch set.
v1: indeed this is new added in this patch set.

[mlyle: truncated commit subject]

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Chengguang Xu df2b94313a bcache: move closure debug file into debug directory
In current code closure debug file is outside of debug directory
and when unloading module there is lack of removing operation
for closure debug file, so it will cause creating error when trying
to reload  module.

This patch move closure debug file into "bcache" debug direcory
so that the file can get deleted properly.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Tang Junhui ca71df3166 bcache: fix using of loop variable in memory shrink
In bch_mca_scan(), There are some confusion and logical error in the use of
loop variables. In this patch, we clarify them as:
1) nr: the number of btree nodes needs to scan, which will decrease after
we scan a btree node, and should not be less than 0;
2) i: the number of btree nodes have scanned, includes both
btree_cache_freeable and btree_cache, which should not be bigger than
btree_cache_used;
3) freed: the number of btree nodes have freed.

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00