linux/fs/anon_inodes.c
Sean Christopherson a7800aa80e KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.

A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem.  With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings.   E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection.  Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.

Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping.  Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.

Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.

A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).

More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd.  While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption.  And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.

Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.

Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping.  And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.

Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory.  That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.

Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem.  I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.

Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay.  Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.

Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-14 08:01:03 -05:00

289 lines
8.4 KiB
C

// SPDX-License-Identifier: GPL-2.0-only
/*
* fs/anon_inodes.c
*
* Copyright (C) 2007 Davide Libenzi <davidel@xmailserver.org>
*
* Thanks to Arnd Bergmann for code review and suggestions.
* More changes for Thomas Gleixner suggestions.
*
*/
#include <linux/cred.h>
#include <linux/file.h>
#include <linux/poll.h>
#include <linux/sched.h>
#include <linux/init.h>
#include <linux/fs.h>
#include <linux/mount.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/magic.h>
#include <linux/anon_inodes.h>
#include <linux/pseudo_fs.h>
#include <linux/uaccess.h>
static struct vfsmount *anon_inode_mnt __read_mostly;
static struct inode *anon_inode_inode;
/*
* anon_inodefs_dname() is called from d_path().
*/
static char *anon_inodefs_dname(struct dentry *dentry, char *buffer, int buflen)
{
return dynamic_dname(buffer, buflen, "anon_inode:%s",
dentry->d_name.name);
}
static const struct dentry_operations anon_inodefs_dentry_operations = {
.d_dname = anon_inodefs_dname,
};
static int anon_inodefs_init_fs_context(struct fs_context *fc)
{
struct pseudo_fs_context *ctx = init_pseudo(fc, ANON_INODE_FS_MAGIC);
if (!ctx)
return -ENOMEM;
ctx->dops = &anon_inodefs_dentry_operations;
return 0;
}
static struct file_system_type anon_inode_fs_type = {
.name = "anon_inodefs",
.init_fs_context = anon_inodefs_init_fs_context,
.kill_sb = kill_anon_super,
};
static struct inode *anon_inode_make_secure_inode(
const char *name,
const struct inode *context_inode)
{
struct inode *inode;
const struct qstr qname = QSTR_INIT(name, strlen(name));
int error;
inode = alloc_anon_inode(anon_inode_mnt->mnt_sb);
if (IS_ERR(inode))
return inode;
inode->i_flags &= ~S_PRIVATE;
error = security_inode_init_security_anon(inode, &qname, context_inode);
if (error) {
iput(inode);
return ERR_PTR(error);
}
return inode;
}
static struct file *__anon_inode_getfile(const char *name,
const struct file_operations *fops,
void *priv, int flags,
const struct inode *context_inode,
bool make_inode)
{
struct inode *inode;
struct file *file;
if (fops->owner && !try_module_get(fops->owner))
return ERR_PTR(-ENOENT);
if (make_inode) {
inode = anon_inode_make_secure_inode(name, context_inode);
if (IS_ERR(inode)) {
file = ERR_CAST(inode);
goto err;
}
} else {
inode = anon_inode_inode;
if (IS_ERR(inode)) {
file = ERR_PTR(-ENODEV);
goto err;
}
/*
* We know the anon_inode inode count is always
* greater than zero, so ihold() is safe.
*/
ihold(inode);
}
file = alloc_file_pseudo(inode, anon_inode_mnt, name,
flags & (O_ACCMODE | O_NONBLOCK), fops);
if (IS_ERR(file))
goto err_iput;
file->f_mapping = inode->i_mapping;
file->private_data = priv;
return file;
err_iput:
iput(inode);
err:
module_put(fops->owner);
return file;
}
/**
* anon_inode_getfile - creates a new file instance by hooking it up to an
* anonymous inode, and a dentry that describe the "class"
* of the file
*
* @name: [in] name of the "class" of the new file
* @fops: [in] file operations for the new file
* @priv: [in] private data for the new file (will be file's private_data)
* @flags: [in] flags
*
* Creates a new file by hooking it on a single inode. This is useful for files
* that do not need to have a full-fledged inode in order to operate correctly.
* All the files created with anon_inode_getfile() will share a single inode,
* hence saving memory and avoiding code duplication for the file/inode/dentry
* setup. Returns the newly created file* or an error pointer.
*/
struct file *anon_inode_getfile(const char *name,
const struct file_operations *fops,
void *priv, int flags)
{
return __anon_inode_getfile(name, fops, priv, flags, NULL, false);
}
EXPORT_SYMBOL_GPL(anon_inode_getfile);
/**
* anon_inode_create_getfile - Like anon_inode_getfile(), but creates a new
* !S_PRIVATE anon inode rather than reuse the
* singleton anon inode and calls the
* inode_init_security_anon() LSM hook.
*
* @name: [in] name of the "class" of the new file
* @fops: [in] file operations for the new file
* @priv: [in] private data for the new file (will be file's private_data)
* @flags: [in] flags
* @context_inode:
* [in] the logical relationship with the new inode (optional)
*
* Create a new anonymous inode and file pair. This can be done for two
* reasons:
*
* - for the inode to have its own security context, so that LSMs can enforce
* policy on the inode's creation;
*
* - if the caller needs a unique inode, for example in order to customize
* the size returned by fstat()
*
* The LSM may use @context_inode in inode_init_security_anon(), but a
* reference to it is not held.
*
* Returns the newly created file* or an error pointer.
*/
struct file *anon_inode_create_getfile(const char *name,
const struct file_operations *fops,
void *priv, int flags,
const struct inode *context_inode)
{
return __anon_inode_getfile(name, fops, priv, flags,
context_inode, true);
}
EXPORT_SYMBOL_GPL(anon_inode_create_getfile);
static int __anon_inode_getfd(const char *name,
const struct file_operations *fops,
void *priv, int flags,
const struct inode *context_inode,
bool make_inode)
{
int error, fd;
struct file *file;
error = get_unused_fd_flags(flags);
if (error < 0)
return error;
fd = error;
file = __anon_inode_getfile(name, fops, priv, flags, context_inode,
make_inode);
if (IS_ERR(file)) {
error = PTR_ERR(file);
goto err_put_unused_fd;
}
fd_install(fd, file);
return fd;
err_put_unused_fd:
put_unused_fd(fd);
return error;
}
/**
* anon_inode_getfd - creates a new file instance by hooking it up to
* an anonymous inode and a dentry that describe
* the "class" of the file
*
* @name: [in] name of the "class" of the new file
* @fops: [in] file operations for the new file
* @priv: [in] private data for the new file (will be file's private_data)
* @flags: [in] flags
*
* Creates a new file by hooking it on a single inode. This is
* useful for files that do not need to have a full-fledged inode in
* order to operate correctly. All the files created with
* anon_inode_getfd() will use the same singleton inode, reducing
* memory use and avoiding code duplication for the file/inode/dentry
* setup. Returns a newly created file descriptor or an error code.
*/
int anon_inode_getfd(const char *name, const struct file_operations *fops,
void *priv, int flags)
{
return __anon_inode_getfd(name, fops, priv, flags, NULL, false);
}
EXPORT_SYMBOL_GPL(anon_inode_getfd);
/**
* anon_inode_create_getfd - Like anon_inode_getfd(), but creates a new
* !S_PRIVATE anon inode rather than reuse the singleton anon inode, and calls
* the inode_init_security_anon() LSM hook.
*
* @name: [in] name of the "class" of the new file
* @fops: [in] file operations for the new file
* @priv: [in] private data for the new file (will be file's private_data)
* @flags: [in] flags
* @context_inode:
* [in] the logical relationship with the new inode (optional)
*
* Create a new anonymous inode and file pair. This can be done for two
* reasons:
*
* - for the inode to have its own security context, so that LSMs can enforce
* policy on the inode's creation;
*
* - if the caller needs a unique inode, for example in order to customize
* the size returned by fstat()
*
* The LSM may use @context_inode in inode_init_security_anon(), but a
* reference to it is not held.
*
* Returns a newly created file descriptor or an error code.
*/
int anon_inode_create_getfd(const char *name, const struct file_operations *fops,
void *priv, int flags,
const struct inode *context_inode)
{
return __anon_inode_getfd(name, fops, priv, flags, context_inode, true);
}
static int __init anon_inode_init(void)
{
anon_inode_mnt = kern_mount(&anon_inode_fs_type);
if (IS_ERR(anon_inode_mnt))
panic("anon_inode_init() kernel mount failed (%ld)\n", PTR_ERR(anon_inode_mnt));
anon_inode_inode = alloc_anon_inode(anon_inode_mnt->mnt_sb);
if (IS_ERR(anon_inode_inode))
panic("anon_inode_init() inode allocation failed (%ld)\n", PTR_ERR(anon_inode_inode));
return 0;
}
fs_initcall(anon_inode_init);