Commit graph

4683 commits

Author SHA1 Message Date
Keith Busch a67a95134f NVMe: Meta data handling through submit io ioctl
This adds support for the extended metadata formats through the submit
IO ioctl, and simplifies the rest when using a separate metadata format.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-07 19:11:06 -06:00
Keith Busch 7f749d9c10 NVMe: Add translation for block limits
Adds SCSI-to-NVMe translation for VPD B0h, block limits inquiry data.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-07 19:11:04 -06:00
Keith Busch 447228023e NVMe: Remove check for null
Checking fails static analysis due to additional arithmetic prior to
the NULL check. Mapping doesn't return NULL here anyway, so removing
the check.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-07 19:10:51 -06:00
Alexey Khoroshilov c727040bda NVMe: Fix error handling of class_create("nvme")
class_create() returns ERR_PTR on failure,
so IS_ERR() should be used instead of check for NULL.

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-07 19:08:58 -06:00
Markus Pargmann de9ad6d4ed nbd: Return error pointer directly
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-02 12:39:28 -06:00
Markus Pargmann dab5313aa4 nbd: Return error code directly
By returning the error code directly, we can avoid the jump label
error_out.

Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-02 12:39:26 -06:00
Markus Pargmann e018e7570c nbd: Remove fixme that was already fixed
The mentioned problem is not present anymore.

Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-02 12:39:24 -06:00
Markus Pargmann d18509f597 nbd: Restructure debugging prints
dprintk has some name collisions with other frameworks and drivers. It
is also not necessary to have these custom debug print filters. Dynamic
debug offers the same amount of filtered debugging.

This patch replaces all dprintks with dev_dbg(). It also removes the
ioctl dprintk which prints the ingoing ioctls which should be
replaceable by strace or similar stuff.

Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-02 12:39:22 -06:00
Markus Pargmann b9c495bb6d nbd: Fix device bytesize type
The block subsystem uses loff_t to store the device size. Change the
type for nbd_device bytesize to loff_t.

Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-02 12:39:21 -06:00
Markus Pargmann d06df60b94 nbd: Replace kthread_create with kthread_run
kthread_run includes the wake_up_process() call, so instead of
kthread_create() followed by wake_up_process() we can use this macro.

Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-02 12:39:19 -06:00
Markus Pargmann 13e71d69cc nbd: Remove kernel internal header
The header is not included anywhere. Remove it and include the private
nbd_device struct in nbd.c.

Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-02 12:39:18 -06:00
Jens Axboe d31af0a325 NVMe: increase depth of admin queue
Usually the admin queue depth of 64 is plenty, but for some use cases we
really need it larger. Examples are use cases like MAT, where you have
to touch all of NAND for init/format like purposes. In those cases, we
see a good 2x increase with an increased queue depth.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Keith Busch <keith.busch@intel.com>
2015-03-31 10:41:34 -06:00
Murali Iyer f137e0f151 nvme: Fix PRP list calculation for non-4k system page size
PRP list calculation is supposed to be based on device's page size.
Systems with page size larger than device's page size cause corruption
to the name space as well as system memory with out this fix.
Systems like x86 might not experience this issue because it uses
PAGE_SIZE of 4K where as powerpc uses PAGE_SIZE of 64k while NVMe device's
page size varies depending upon the vendor.

Signed-off-by: Murali Iyer <mniyer@us.ibm.com>
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-31 10:39:56 -06:00
Keith Busch 1efccc9ddb NVMe: Fix blk-mq hot cpu notification
The driver may issue commands to a device that may never return, so its
request_queue could always have active requests while the controller is
running. Waiting for the queue to freeze could block forever, which is
what blk-mq's hot cpu notification handler was doing when nvme drives
were in use.

This has the nvme driver make the asynchronous event command's tag
reserved and does not keep the request active. We can't have more than
one since the request is released back to the request_queue before the
command is completed. Having only one avoids potential tag collisions,
and reserving the tag for this purpose prevents other admin tasks from
reusing the tag.

I also couldn't think of a scenario where issuing AEN requests single
depth is worse than issuing them in batches, so I don't think we lose
anything with this change.

As an added bonus, doing it this way removes "Cancelling I/O" warnings
observed when unbinding the nvme driver from a device.

Reported-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-31 10:39:56 -06:00
Chong Yuan fda631ffe5 NVMe: embedded iod mask cleanup
Remove unused mask in nvme_alloc_iod

Signed-off-by: Chong Yuan <chong.yuan@memblaze.com>
Reviewed-by: Wenbo Wang  <wenbo.wang@memblaze.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-31 10:36:41 -06:00
Keith Busch 6df3dbc83f NVMe: Freeze admin queue on device failure
This fixes a race accessing an invalid address when a controller's admin
queue is in use during a reset for failure or hot removal occurs. The
admin queue will be frozen to prevent new users from entering prior to
the doorbell queue being unmapped.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-31 10:36:06 -06:00
David Rientjes cbc4ffdbe7 block, drbd: use mempool_create_slab_pool()
Mempools created for slab caches should use
mempool_create_slab_pool().

Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-24 20:00:18 -06:00
David Rientjes 23fe8f8b11 block, drbd: fix drbd_req_new() initialization
mempool_alloc() does not support __GFP_ZERO since elements may come from
memory that has already been released by mempool_free().

Remove __GFP_ZERO from mempool_alloc() in drbd_req_new() and properly
initialize it to 0.

Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-24 20:00:16 -06:00
Jens Axboe b8be79b714 NBD fixes based on v4.0-rc1
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJU+As6AAoJEEpcgKtcEGQQr5UQAJeNbwo3t45u+ftmL+rzznqr
 ospLlVirYqKcB3IZk8WUZ+xWtP2EyzJ6qiTPdYRC2lvgQMDiXTpqYkbi4lNCUr0V
 CBMVSWuOwlZOwlD0Cxr5L6m2A0X+M8aGI1OUISYZuogLaHxjgPRQiEs5vccQKcoF
 1ZHBg0Lguk+alfTLxEtddDfzETx72I+CRaJyxa/g9SNk91C1R8U/+YT6g9ryrF9H
 8bJ5K4p5LYJDecIZMTNTFCndNa1fFZmynDv4cvla0MOfrOwA+lSaIFdCtbSPEbUu
 WHjXzwDlsIKGjJY1Pt7YfND1B+khFSVS9Cgh11hQ0nOer6y3WRIxWg0Jyo7m+PUz
 W4J7Pd1sHr/IgPLLOOI0o21n+Glkqp/wh5O/mDEP0ZEWMpm4pdd2R+UX76Fv62QT
 YCKcYl/0oxbSPjLBKWZLw2hhmd/NRvU4GM7r/mHtQvOxrHEQf5HW9p4ZlQvU7GV3
 kSYZwbMZo/FsiP1KCC/VSv3l+BQJMcZRlBRg0pzSJTvZmGEaBSvydLd7lsV05OsJ
 txmmDw7Xd14rT8Uav83i9olXuZGX8tK+QDOdQxY7SMXmEiMWre1B4pvJu10lXrmo
 tmZ9zPEQyWdcsuf/og8coSk3QKQlqHHmlcdtz9WZKqUHbvE0l4g8GWS9OWBdyGR5
 RXpfcbwvd396f9JXxxob
 =ttVz
 -----END PGP SIGNATURE-----

Merge tag 'nbd_fixes_20150305' of git://git.pengutronix.de/git/mpa/linux-nbd into for-linus

NBD fixes based on v4.0-rc1
2015-03-05 07:46:13 -07:00
Sudip Mukherjee ff6b8090e2 nbd: fix possible memory leak
we have already allocated memory for nbd_dev, but we were not
releasing that memory and just returning the error value.

Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Acked-by: Paul Clements <Paul.Clements@SteelEye.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
2015-03-05 08:51:03 +01:00
Linus Torvalds a015d33c98 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
 "Two smaller fixes for this cycle:

   - A fixup from Keith so that NVMe compiles without BLK_INTEGRITY,
     basically just moving the code around appropriately.

   - A fixup for shm, fixing an oops in shmem_mapping() for mapping with
     no inode.  From Sasha"

[ The shmem fix doesn't look block-layer-related, but fixes a bug that
  happened due to the backing_dev_info removal..  - Linus ]

* 'for-linus' of git://git.kernel.dk/linux-block:
  mm: shmem: check for mapping owner before dereferencing
  NVMe: Fix for BLK_DEV_INTEGRITY not set
2015-02-28 10:21:57 -08:00
Joonsoo Kim 2ea55a2cae zram: use proper type to update max_used_pages
max_used_pages is defined as atomic_long_t so we need to use unsigned
long to keep temporary value for it rather than int which is smaller
than unsigned long in a 64 bit system.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-28 09:57:51 -08:00
Keith Busch 52b68d7ef8 NVMe: Fix for BLK_DEV_INTEGRITY not set
Need to define and use appropriate functions for when BLK_DEV_INTEGRITY
is not set.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-02-23 09:17:54 -08:00
Jens Axboe decf6d79de Merge branch 'for-3.20' of git://git.infradead.org/users/kbusch/linux-nvme into for-linus
Merge 3.20 NVMe changes from Keith.
2015-02-20 22:12:02 -08:00
Keith Busch 0c0f9b95c8 NVMe: Fix potential corruption on sync commands
This makes all sync commands uninterruptible and schedules without timeout
so the controller either has to post a completion or the timeout recovery
fails the command. This fixes potential memory or data corruption from
a command timing out too early or woken by a signal. Previously any DMA
buffers mapped for that command would have been released even though we
don't know what the controller is planning to do with those addresses.

Signed-off-by: Keith Busch <keith.busch@intel.com>
2015-02-19 16:15:38 -07:00
Keith Busch 4832851840 NVMe: Remove unused variables
We don't track queues in a llist, subscribe to hot-cpu notifications,
or internally retry commands. Delete the unused artifacts.

Signed-off-by: Keith Busch <keith.busch@intel.com>
2015-02-19 16:15:38 -07:00
Keith Busch 9ac16938ab NVMe: Fix scsi mode select llbaa setting
It should be a logical bitwise AND, not conditional.

Signed-off-by: Keith Busch <keith.busch@intel.com>
2015-02-19 16:15:37 -07:00
Keith Busch 07836e659c NVMe: Fix potential corruption during shutdown
The driver has to end unreturned commands at some point even if the
controller has not provided a completion. The driver tried to be safe by
deleting IO queues prior to ending all unreturned commands. That should
cause the controller to internally abort inflight commands, but IO queue
deletion request does not have to be successful, so all bets are off. We
still have to make progress, so to be extra safe, this patch doesn't
clear a queue to release the dma mapping for a command until after the
pci device has been disabled.

This patch removes the special handling during device initialization
so controller recovery can be done all the time. This is possible since
initialization is not inlined with pci probe anymore.

Reported-by: Nilish Choudhury <nilesh.choudhury@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
2015-02-19 16:15:37 -07:00
Keith Busch 2e1d844819 NVMe: Asynchronous controller probe
This performs the longest parts of nvme device probe in scheduled work.
This speeds up probe significantly when multiple devices are in use.

Signed-off-by: Keith Busch <keith.busch@intel.com>
2015-02-19 16:15:36 -07:00
Keith Busch b3fffdefab NVMe: Register management handle under nvme class
This creates a new class type for nvme devices to register their
management character devices with. This is so we do not rely on miscdev
to provide enough minors for as many nvme devices some people plan to
use. The previous limit was approximately 60 NVMe controllers, depending
on the platform and kernel. Now the limit is 1M, which ought to be enough
for anybody.

Since we have a new device class, it makes sense to attach the block
devices under this as well, so part of this patch moves the management
handle initialization prior to the namespaces discovery.

Signed-off-by: Keith Busch <keith.busch@intel.com>
2015-02-19 16:15:36 -07:00
Keith Busch 4f1982b4e2 NVMe: Update SCSI Inquiry VPD 83h translation
The original translation created collisions on Inquiry VPD 83 for many
existing devices. Newer specifications provide other ways to translate
based on the device's version can be used to create unique identifiers.

Version 1.1 provides an EUI64 field that uniquely identifies each
namespace, and 1.2 added the longer NGUID field for the same reason.
Both follow the IEEE EUI format and readily translate to the SCSI device
identification EUI designator type 2h. For devices implementing either,
the translation will use this type, defaulting to the EUI64 8-byte type if
implemented then NGUID's 16 byte version if not. If neither are provided,
the 1.0 translation is used, and is updated to use the SCSI String format
to guarantee a unique identifier.

Knowing when to use the new fields depends on the nvme controller's
revision. The NVME_VS macro was not decoding this correctly, so that is
fixed in this patch and moved to a more appropriate place.

Since the Identify Namespace structure required an update for the NGUID
field, this patch adds the remaining new 1.2 fields to the structure.

Signed-off-by: Keith Busch <keith.busch@intel.com>
2015-02-19 16:15:35 -07:00
Keith Busch e1e5e5641e NVMe: Metadata format support
Adds support for NVMe metadata formats and exposes block devices for
all namespaces regardless of their format. Namespace formats that are
unusable will have disk capacity set to 0, but a handle to the block
device is created to simplify device management. A namespace is not
usable when the format requires host interleave block and metadata in
single buffer, has no provisioned storage, or has better data but failed
to register with blk integrity.

The namespace has to be scanned in two phases to support separate
metadata formats. The first establishes the sector size and capacity
prior to invoking add_disk. If metadata is required, the capacity will
be temporarilly set to 0 until it can be revalidated and registered with
the integrity extenstions after add_disk completes.

The driver relies on the integrity extensions to provide the metadata
buffer. NVMe requires this be a single physically contiguous region,
so only one integrity segment is allowed per command. If the metadata
is used for T10 PI, the driver provides mappings to save and restore
the reftag physical block translation. The driver provides no-op
functions for generate and verify if metadata is not used for protection
information. This way the setup is always provided by the block layer.

If a request does not supply a required metadata buffer, the command
is failed with bad address. This could only happen if a user manually
disables verify/generate on such a disk. The only exception to where
this is okay is if the controller is capable of stripping/generating
the metadata, which is possible on some types of formats.

The metadata scatter gather list now occupies the spot in the nvme_iod
that used to be used to link retryable IOD's, but we don't do that
anymore, so the field was unused.

Signed-off-by: Keith Busch <keith.busch@intel.com>
2015-02-19 16:15:35 -07:00
Linus Torvalds 4533f6e27a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph changes from Sage Weil:
 "On the RBD side, there is a conversion to blk-mq from Christoph,
  several long-standing bug fixes from Ilya, and some cleanup from
  Rickard Strandqvist.

  On the CephFS side there is a long list of fixes from Zheng, including
  improved session handling, a few IO path fixes, some dcache management
  correctness fixes, and several blocking while !TASK_RUNNING fixes.

  The core code gets a few cleanups and Chaitanya has added support for
  TCP_NODELAY (which has been used on the server side for ages but we
  somehow missed on the kernel client).

  There is also an update to MAINTAINERS to fix up some email addresses
  and reflect that Ilya and Zheng are doing most of the maintenance for
  RBD and CephFS these days.  Do not be surprised to see a pull request
  come from one of them in the future if I am unavailable for some
  reason"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (27 commits)
  MAINTAINERS: update Ceph and RBD maintainers
  libceph: kfree() in put_osd() shouldn't depend on authorizer
  libceph: fix double __remove_osd() problem
  rbd: convert to blk-mq
  ceph: return error for traceless reply race
  ceph: fix dentry leaks
  ceph: re-send requests when MDS enters reconnecting stage
  ceph: show nocephx_require_signatures and notcp_nodelay options
  libceph: tcp_nodelay support
  rbd: do not treat standalone as flatten
  ceph: fix atomic_open snapdir
  ceph: properly mark empty directory as complete
  client: include kernel version in client metadata
  ceph: provide seperate {inode,file}_operations for snapdir
  ceph: fix request time stamp encoding
  ceph: fix reading inline data when i_size > PAGE_SIZE
  ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_close_sessions)
  ceph: avoid block operation when !TASK_RUNNING (ceph_get_caps)
  ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_sync)
  rbd: fix error paths in rbd_dev_refresh()
  ...
2015-02-19 14:14:42 -08:00
Christoph Hellwig 7ad18afad0 rbd: convert to blk-mq
This converts the rbd driver to use the blk-mq infrastructure.  Except
for switching to a per-request work item this is almost mechanical.

This was tested by Alexandre DERUMIER in November, and found to give
him 120000 iops, although the only comparism available was an old
3.10 kernel which gave 80000iops.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <elder@linaro.org>
[idryomov@gmail.com: context, blk_mq_init_queue() EH]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-02-19 14:27:42 +03:00
Ilya Dryomov cf32bd9c86 rbd: do not treat standalone as flatten
If the clone is resized down to 0, it becomes standalone.  If such
resize is carried over while an image is mapped we would detect this
and call rbd_dev_parent_put() which means "let go of all parent state,
including the spec(s) of parent images(s)".  This leads to a mismatch
between "rbd info" and sysfs parent fields, so a fix is in order.

    # rbd create --image-format 2 --size 1 foo
    # rbd snap create foo@snap
    # rbd snap protect foo@snap
    # rbd clone foo@snap bar
    # DEV=$(rbd map bar)
    # rbd resize --allow-shrink --size 0 bar
    # rbd resize --size 1 bar
    # rbd info bar | grep parent
            parent: rbd/foo@snap

Before:

    # cat /sys/bus/rbd/devices/0/parent
    (no parent image)

After:

    # cat /sys/bus/rbd/devices/0/parent
    pool_id 0
    pool_name rbd
    image_id 10056b8b4567
    image_name foo
    snap_id 2
    snap_name snap
    overlap 0

Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2015-02-19 13:31:39 +03:00
Ilya Dryomov 73e39e4dba rbd: fix error paths in rbd_dev_refresh()
header_rwsem should be released on errors.  Also remove useless
rbd_dev->mapping.size != rbd_dev->header.image_size test.

Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2015-02-19 13:31:38 +03:00
Rickard Strandqvist 3a25cf43e0 rbd: nuke copy_token()
It's been largely superseded by dup_token() and unused for over
2 years, identified by cppcheck.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
[idryomov@redhat.com: changelog]
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2015-02-19 13:31:38 +03:00
Linus Torvalds 53861af9a1 OK, this has the big virtio 1.0 implementation, as specified by OASIS.
On top of tht is the major rework of lguest, to use PCI and virtio 1.0, to
 double-check the implementation.
 
 Then comes the inevitable fixes and cleanups from that work.
 
 Thanks,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJU5B9cAAoJENkgDmzRrbjxPacP/jajliXX353JJ/g/hkZ6oDN5
 o7FhELBKiUMr7enVZYwj2BBYk5OM36nB9pQkiqHMSbjJGoS5IK70enxb4YRxSHBn
 YCLblZMNqutGS0kclZ9DDysztjAhxH7CvLM6pMZ7eHP0f3+FM/QhbxHfbG9DTBUH
 2U/nybvd3M/+YBe7ptwQdrH8aOCAD6RTIsXellfm99dNMK6K/5lqnWQ98WSXmNXq
 vyvdaAQsqqUkmxtajjcBumaCH4/SehOJJjUqojCMsR3aBkgOBWDZJURMek+KA5Dt
 X996fBsTAlvTtCUKRrmLTb2ScDH7fu+jwbWRqMYDk8zpEr3XqiLTTPV4/TiHGmi7
 Wiw3g1wIY1YbETlZyongB5MIoVyUfmDAd+bT8nBsj3KIITD84gOUQFDMl6d63c0I
 z6A9Pu/UzpJGsXZT3WoFLi6TO67QyhOseqZnhS4wBgLabjxffNM7yov9RVKUVH/n
 JHunnpUk2iTtSgscBarOBz5867dstuurnaUIspZthVBo6y6N0z+GrU+agJ8Y4DXx
 mvwzeYLhQH2208PjxPFiah/kA/gHNm1m678TbpS+CUsgmpQiJ4gTwtazDSi4TwZY
 Hs9T9GulkzpZIzEyKL3qG2TsfyDhW5Avn+GvKInAT9+Fkig4BnP3DUONBxcwGZ78
 eI3FDUWsE36NqE5ECWmz
 =ivCe
 -----END PGP SIGNATURE-----

Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull virtio updates from Rusty Russell:
 "OK, this has the big virtio 1.0 implementation, as specified by OASIS.

  On top of tht is the major rework of lguest, to use PCI and virtio
  1.0, to double-check the implementation.

  Then comes the inevitable fixes and cleanups from that work"

* tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (80 commits)
  virtio: don't set VIRTIO_CONFIG_S_DRIVER_OK twice.
  virtio_net: unconditionally define struct virtio_net_hdr_v1.
  tools/lguest: don't use legacy definitions for net device in example launcher.
  virtio: Don't expose legacy net features when VIRTIO_NET_NO_LEGACY defined.
  tools/lguest: use common error macros in the example launcher.
  tools/lguest: give virtqueues names for better error messages
  tools/lguest: more documentation and checking of virtio 1.0 compliance.
  lguest: don't look in console features to find emerg_wr.
  tools/lguest: don't start devices until DRIVER_OK status set.
  tools/lguest: handle indirect partway through chain.
  tools/lguest: insert driver references from the 1.0 spec (4.1 Virtio Over PCI)
  tools/lguest: insert device references from the 1.0 spec (4.1 Virtio Over PCI)
  tools/lguest: rename virtio_pci_cfg_cap field to match spec.
  tools/lguest: fix features_accepted logic in example launcher.
  tools/lguest: handle device reset correctly in example launcher.
  virtual: Documentation: simplify and generalize paravirt_ops.txt
  lguest: remove NOTIFY call and eventfd facility.
  lguest: remove NOTIFY facility from demonstration launcher.
  lguest: use the PCI console device's emerg_wr for early boot messages.
  lguest: always put console in PCI slot #1.
  ...
2015-02-18 09:24:01 -08:00
Matthew Wilcox a7a97fc9ff brd: rename XIP to DAX
Since this is relating to FS_XIP, not KERNEL_XIP, it should be called
DAX instead of XIP.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:04 -08:00
Linus Torvalds 818099574b Merge branch 'akpm' (patches from Andrew)
Merge third set of updates from Andrew Morton:

 - the rest of MM

   [ This includes getting rid of the numa hinting bits, in favor of
     just generic protnone logic.  Yay.     - Linus ]

 - core kernel

 - procfs

 - some of lib/ (lots of lib/ material this time)

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (104 commits)
  lib/lcm.c: replace include
  lib/percpu_ida.c: remove redundant includes
  lib/strncpy_from_user.c: replace module.h include
  lib/stmp_device.c: replace module.h include
  lib/sort.c: move include inside #if 0
  lib/show_mem.c: remove redundant include
  lib/radix-tree.c: change to simpler include
  lib/plist.c: remove redundant include
  lib/nlattr.c: remove redundant include
  lib/kobject_uevent.c: remove redundant include
  lib/llist.c: remove redundant include
  lib/md5.c: simplify include
  lib/list_sort.c: rearrange includes
  lib/genalloc.c: remove redundant include
  lib/idr.c: remove redundant include
  lib/halfmd4.c: simplify includes
  lib/dynamic_queue_limits.c: simplify includes
  lib/sort.c: use simpler includes
  lib/interval_tree.c: simplify includes
  hexdump: make it return number of bytes placed in buffer
  ...
2015-02-12 18:54:28 -08:00
Rasmus Villemoes 02f1f2170d kernel.h: remove ancient __FUNCTION__ hack
__FUNCTION__ hasn't been treated as a string literal since gcc 3.4, so
this only helps people who only test-compile using 3.3 (compiler-gcc3.h
barks at anything older than that).  Besides, there are almost no
occurrences of __FUNCTION__ left in the tree.

[akpm@linux-foundation.org: convert remaining __FUNCTION__ references]
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:13 -08:00
Ganesh Mahendran 3eba0c6a56 mm/zpool: add name argument to create zpool
Currently the underlay of zpool: zsmalloc/zbud, do not know who creates
them.  There is not a method to let zsmalloc/zbud find which caller they
belong to.

Now we want to add statistics collection in zsmalloc.  We need to name the
debugfs dir for each pool created.  The way suggested by Minchan Kim is to
use a name passed by caller(such as zram) to create the zsmalloc pool.

    /sys/kernel/debug/zsmalloc/zram0

This patch adds an argument `name' to zs_create_pool() and other related
functions.

Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:12 -08:00
Sergey Senozhatsky ee98016010 zram: remove request_queue from struct zram
`struct zram' contains both `struct gendisk' and `struct request_queue'.
the latter can be deleted, because zram->disk carries ->queue pointer, and
->queue carries zram pointer:

create_device()
	zram->queue->queuedata = zram
	zram->disk->queue = zram->queue
	zram->disk->private_data = zram

so zram->queue is not needed, we can access all necessary data anyway.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:12 -08:00
Minchan Kim 08eee69fcf zram: remove init_lock in zram_make_request
Admin could reset zram during I/O operation going on so we have used
zram->init_lock as read-side lock in I/O path to prevent sudden zram
meta freeing.

However, the init_lock is really troublesome.  We can't do call
zram_meta_alloc under init_lock due to lockdep splat because
zram_rw_page is one of the function under reclaim path and hold it as
read_lock while other places in process context hold it as write_lock.
So, we have used allocation out of the lock to avoid lockdep warn but
it's not good for readability and fainally, I met another lockdep splat
between init_lock and cpu_hotplug from kmem_cache_destroy during working
zsmalloc compaction.  :(

Yes, the ideal is to remove horrible init_lock of zram in rw path.  This
patch removes it in rw path and instead, add atomic refcount for meta
lifetime management and completion to free meta in process context.
It's important to free meta in process context because some of resource
destruction needs mutex lock, which could be held if we releases the
resource in reclaim context so it's deadlock, again.

As a bonus, we could remove init_done check in rw path because
zram_meta_get will do a role for it, instead.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Ganesh Mahendran <opensource.ganesh@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:12 -08:00
Minchan Kim 2b269ce6fc zram: check bd_openers instead of bd_holders
bd_holders is increased only when user open the device file as FMODE_EXCL
so if something opens zram0 as !FMODE_EXCL and request I/O while another
user reset zram0, we can see following warning.

  zram0: detected capacity change from 0 to 64424509440
  Buffer I/O error on dev zram0, logical block 180823, lost async page write
  Buffer I/O error on dev zram0, logical block 180824, lost async page write
  Buffer I/O error on dev zram0, logical block 180825, lost async page write
  Buffer I/O error on dev zram0, logical block 180826, lost async page write
  Buffer I/O error on dev zram0, logical block 180827, lost async page write
  Buffer I/O error on dev zram0, logical block 180828, lost async page write
  Buffer I/O error on dev zram0, logical block 180829, lost async page write
  Buffer I/O error on dev zram0, logical block 180830, lost async page write
  Buffer I/O error on dev zram0, logical block 180831, lost async page write
  Buffer I/O error on dev zram0, logical block 180832, lost async page write
  ------------[ cut here ]------------
  WARNING: CPU: 11 PID: 1996 at fs/block_dev.c:57 __blkdev_put+0x1d7/0x210()
  Modules linked in:
  CPU: 11 PID: 1996 Comm: dd Not tainted 3.19.0-rc6-next-20150202+ #1125
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Call Trace:
    dump_stack+0x45/0x57
    warn_slowpath_common+0x8a/0xc0
    warn_slowpath_null+0x1a/0x20
    __blkdev_put+0x1d7/0x210
    blkdev_put+0x50/0x130
    blkdev_close+0x25/0x30
    __fput+0xdf/0x1e0
    ____fput+0xe/0x10
    task_work_run+0xa7/0xe0
    do_notify_resume+0x49/0x60
    int_signal+0x12/0x17
  ---[ end trace 274fbbc5664827d2 ]---

The warning comes from bdev_write_node in blkdev_put path.

   static void bdev_write_inode(struct inode *inode)
   {
        spin_lock(&inode->i_lock);
        while (inode->i_state & I_DIRTY) {
                spin_unlock(&inode->i_lock);
                WARN_ON_ONCE(write_inode_now(inode, true)); <========= here.
                spin_lock(&inode->i_lock);
        }
        spin_unlock(&inode->i_lock);
   }

The reason is dd process encounters I/O fails due to sudden block device
disappear so in filemap_check_errors in __writeback_single_inode returns
-EIO.

If we check bd_openers instead of bd_holders, we could address the
problem.  When I see the brd, it already have used it rather than
bd_holders so although I'm not a expert of block layer, it seems to be
better.

I can make following warning with below simple script.  In addition, I
added msleep(2000) below set_capacity(zram->disk, 0) after applying your
patch to make window huge(Kudos to Ganesh!)

script:

   echo $((60<<30)) > /sys/block/zram0/disksize
   setsid dd if=/dev/zero of=/dev/zram0 &
   sleep 1
   setsid echo 1 > /sys/block/zram0/reset

Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Ganesh Mahendran <opensource.ganesh@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:11 -08:00
Sergey Senozhatsky a096cafc31 zram: rework reset and destroy path
We need to return set_capacity(disk, 0) from reset_store() back to
zram_reset_device(), a catch by Ganesh Mahendran.  Potentially, we can
race set_capacity() calls from init and reset paths.

The problem is that zram_reset_device() is also getting called from
zram_exit(), which performs operations in misleading reversed order -- we
first create_device() and then init it, while zram_exit() perform
destroy_device() first and then does zram_reset_device().  This is done to
remove sysfs group before we reset device, so we can continue with device
reset/destruction not being raced by sysfs attr write (f.e.  disksize).

Apart from that, destroy_device() releases zram->disk (but we still have
->disk pointer), so we cannot acces zram->disk in later
zram_reset_device() call, which may cause additional errors in the future.

So, this patch rework and cleanup destroy path.

1) remove several unneeded goto labels in zram_init()

2) factor out zram_init() error path and zram_exit() into
   destroy_devices() function, which takes the number of devices to
   destroy as its argument.

3) remove sysfs group in destroy_devices() first, so we can reorder
   operations -- reset device (as expected) goes before disk destroy and
   queue cleanup.  So we can always access ->disk in zram_reset_device().

4) and, finally, return set_capacity() back under ->init_lock.

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reported-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:11 -08:00
Sergey Senozhatsky ba6b17d68c zram: fix umount-reset_store-mount race condition
Ganesh Mahendran was the first one who proposed to use bdev->bd_mutex to
avoid ->bd_holders race condition:

        CPU0                            CPU1
umount /* zram->init_done is true */
reset_store()
bdev->bd_holders == 0                   mount
...                                     zram_make_request()
zram_reset_device()

However, his solution required some considerable amount of code movement,
which we can avoid.

Apart from using bdev->bd_mutex in reset_store(), this patch also
simplifies zram_reset_device().

zram_reset_device() has a bool parameter reset_capacity which tells it
whether disk capacity and itself disk should be reset.  There are two
zram_reset_device() callers:

-- zram_exit() passes reset_capacity=false
-- reset_store() passes reset_capacity=true

So we can move reset_capacity-sensitive work out of zram_reset_device()
and perform it unconditionally in reset_store().  This also lets us drop
reset_capacity parameter from zram_reset_device() and pass zram pointer
only.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reported-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:11 -08:00
Ganesh Mahendran 1fec117281 zram: free meta table in zram_meta_free
zram_meta_alloc() and zram_meta_free() are a pair.  In
zram_meta_alloc(), meta table is allocated.  So it it better to free it
in zram_meta_free().

Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:11 -08:00
Sergey Senozhatsky b817995832 zram: clean up zram_meta_alloc()
A trivial cleanup of zram_meta_alloc() error handling.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:11 -08:00
Linus Torvalds 8494bcf5b7 Merge branch 'for-3.20/drivers' of git://git.kernel.dk/linux-block
Pull block driver changes from Jens Axboe:
 "This contains:

   - The 4k/partition fixes for brd from Boaz/Matthew.

   - A few xen front/back block fixes from David Vrabel and Roger Pau
     Monne.

   - Floppy changes from Takashi, cleaning the device file creation.

   - Switching libata to use the new blk-mq tagging policy, removing
     code (and a suboptimal implementation) from libata.  This will
     throw you a merge conflict, since a bug in the original libata
     tagging code was fixed since this code was branched.  Trivial.
     From Shaohua.

   - Conversion of loop to blk-mq, from Ming Lei.

   - Cleanup of the io_schedule() handling in bsg from Peter Zijlstra.
     He claims it improves on unreadable code, which will cost him a
     beer.

   - Maintainer update or NDB, now handled by Markus Pargmann.

   - NVMe:
        - Optimization from me that avoids a kmalloc/kfree per IO for
          smaller (<= 8KB) IO. This cuts about 1% of high IOPS CPU
          overhead.
        - Removal of (now) dead RCU code, a relic from before NVMe was
          converted to blk-mq"

* 'for-3.20/drivers' of git://git.kernel.dk/linux-block:
  xen-blkback: default to X86_32 ABI on x86
  xen-blkfront: fix accounting of reqs when migrating
  xen-blkback,xen-blkfront: add myself as maintainer
  block: Simplify bsg complete all
  floppy: Avoid manual call of device_create_file()
  NVMe: avoid kmalloc/kfree for smaller IO
  MAINTAINERS: Update NBD maintainer
  libata: make sata_sil24 use fifo tag allocator
  libata: move sas ata tag allocation to libata-scsi.c
  libata: use blk taging
  NVMe: within nvme_free_queues(), delete RCU sychro/deferred free
  null_blk: suppress invalid partition info
  brd: Request from fdisk 4k alignment
  brd: Fix all partitions BUGs
  axonram: Fix bug in direct_access
  loop: add blk-mq.h include
  block: loop: don't handle REQ_FUA explicitly
  block: loop: introduce lo_discard() and lo_req_flush()
  block: loop: say goodby to bio
  block: loop: improve performance via blk-mq
2015-02-12 14:30:53 -08:00