When QEMU exposes a VirtIO-RNG device to the guest, that device needs a
source of entropy, and that source needs to be "non-blocking", like
`/dev/urandom`. However, currently QEMU defaults to the problematic
`/dev/random`, which on Linux is "blocking" (as in, it waits until
sufficient entropy is available).
Why prefer `/dev/urandom` over `/dev/random`?
---------------------------------------------
The man pages of urandom(4) and random(4) state:
"The /dev/random device is a legacy interface which dates back to a
time where the cryptographic primitives used in the implementation
of /dev/urandom were not widely trusted. It will return random
bytes only within the estimated number of bits of fresh noise in the
entropy pool, blocking if necessary. /dev/random is suitable for
applications that need high quality randomness, and can afford
indeterminate delays."
Further, the "Usage" section of the said man pages state:
"The /dev/random interface is considered a legacy interface, and
/dev/urandom is preferred and sufficient in all use cases, with the
exception of applications which require randomness during early boot
time; for these applications, getrandom(2) must be used instead,
because it will block until the entropy pool is initialized.
"If a seed file is saved across reboots as recommended below (all
major Linux distributions have done this since 2000 at least), the
output is cryptographically secure against attackers without local
root access as soon as it is reloaded in the boot sequence, and
perfectly adequate for network encryption session keys. Since reads
from /dev/random may block, users will usually want to open it in
nonblocking mode (or perform a read with timeout), and provide some
sort of user notification if the desired entropy is not immediately
available."
And refer to random(7) for a comparison of `/dev/random` and
`/dev/urandom`.
What about other OSes?
----------------------
`/dev/urandom` exists and works on OS-X, FreeBSD, DragonFlyBSD, NetBSD
and OpenBSD, which cover all the non-Linux platforms we explicitly
support, aside from Windows.
On Windows `/dev/random` doesn't work either so we don't regress.
This is actually another argument in favour of using the newly
proposed 'rng-builtin' backend by default, as that will work on
Windows.
- - -
Given the above, change the entropy source for VirtIO-RNG device to
`/dev/urandom`.
Related discussion in these[1][2] past threads.
[1] https://lists.nongnu.org/archive/html/qemu-devel/2018-06/msg08335.html
-- "RNG: Any reason QEMU doesn't default to `/dev/urandom`?"
[2] https://lists.nongnu.org/archive/html/qemu-devel/2018-09/msg02724.html
-- "[RFC] Virtio RNG: Consider changing the default entropy source to
/dev/urandom"
Signed-off-by: Kashyap Chamarthy <kchamart@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Message-Id: <20190529143106.11789-2-lvivier@redhat.com>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
* Fix build
* xen-block: support feature-large-sector-size
* xen-block: Support IOThread polling for PV shared rings
* Avoid usage of a VLA
* Cleanup Xen headers usage
-----BEGIN PGP SIGNATURE-----
iQFOBAABCgA4FiEE+AwAYwjiLP2KkueYDPVXL9f7Va8FAl0Q7JgaHGFudGhvbnku
cGVyYXJkQGNpdHJpeC5jb20ACgkQDPVXL9f7Va8B1wf/bL9gdT/1R9474ZfbWAGZ
KzkCo0C3jXUWRXd9z/UVwkmOhz0tLj1otx0fR+HFM4An+YAY6D0oZAKO9SCHhGDQ
XflAK74dw1ieuZI+3Q5PXQO5xM1Oz0J+3TGlOFdZlh5UD68mEzteGnzU/zzs7i4E
AZiKVOdO4YzMdHLVO4X/AqZH48n82FxjKcog7cZ9fTqDUz8SZGwJVSWocUZ0yOWb
uhAacvhwHeZj64NuNShyF/RM7jolTk4CZWJv8Gy9CPxOM7noQIv0ttwu+QWCmODg
pdTxd8HrE4rTnKaQFiHVas/AZ3cOfRw9RjdsARhXtGJq8AaQag9Q0iLZpyBYBfsF
6w==
=MDEA
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/aperard/tags/pull-xen-20190624' into staging
Xen queue
* Fix build
* xen-block: support feature-large-sector-size
* xen-block: Support IOThread polling for PV shared rings
* Avoid usage of a VLA
* Cleanup Xen headers usage
# gpg: Signature made Mon 24 Jun 2019 16:30:32 BST
# gpg: using RSA key F80C006308E22CFD8A92E7980CF5572FD7FB55AF
# gpg: issuer "anthony.perard@citrix.com"
# gpg: Good signature from "Anthony PERARD <anthony.perard@gmail.com>" [marginal]
# gpg: aka "Anthony PERARD <anthony.perard@citrix.com>" [marginal]
# gpg: WARNING: This key is not certified with sufficiently trusted signatures!
# gpg: It is not certain that the signature belongs to the owner.
# Primary key fingerprint: 5379 2F71 024C 600F 778A 7161 D8D5 7199 DF83 42C8
# Subkey fingerprint: F80C 0063 08E2 2CFD 8A92 E798 0CF5 572F D7FB 55AF
* remotes/aperard/tags/pull-xen-20190624:
xen: Import other xen/io/*.h
Revert xen/io/ring.h of "Clean up a few header guard symbols"
xen: Drop includes of xen/hvm/params.h
xen: Avoid VLA
xen-bus / xen-block: add support for event channel polling
xen-bus: allow AioContext to be specified for each event channel
xen-bus: use a separate fd for each event channel
xen-block: support feature-large-sector-size
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
- The SSH block driver now uses libssh instead of libssh2
- The VMDK block driver gets read-only support for the seSparse
subformat
- Various fixes
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEkb62CjDbPohX0Rgp9AfbAGHVz0AFAl0Q4XASHG1yZWl0ekBy
ZWRoYXQuY29tAAoJEPQH2wBh1c9AIU8H+wVaUzYwlIZIgUkC6A9ZiCNbur9TngnZ
ywt9kc4CW8kEuIHm/m87qayt6cwWlw6OnWktQW14HHEqFVKdg2LwbJN20yG3a+tm
YAkwcfkIbJ6ZMKx2QZ4TTwov6rbf57yQtwxPtkxXMMMe488WYoN/wZig25+6BYgU
Z8FLhSW36TVAjbYAxuO+O9/JSx4DzjmHFnTkwW3XRLFYRMmH+eY3jB2T90W2zV/8
QDDTo8ZvdaSASKlMRnH4iik8lX4QXKRTjfprTzuynoth0Zvtt8jarpfLKKSPyf8Z
9bpEHBGFxYSyapGq1fHt9Le1jNx3lPVpzVsbEeGpai0FjnHXg+fntDQ=
=YFoG
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/maxreitz/tags/pull-block-2019-06-24' into staging
Block patches:
- The SSH block driver now uses libssh instead of libssh2
- The VMDK block driver gets read-only support for the seSparse
subformat
- Various fixes
# gpg: Signature made Mon 24 Jun 2019 15:42:56 BST
# gpg: using RSA key 91BEB60A30DB3E8857D11829F407DB0061D5CF40
# gpg: issuer "mreitz@redhat.com"
# gpg: Good signature from "Max Reitz <mreitz@redhat.com>" [full]
# Primary key fingerprint: 91BE B60A 30DB 3E88 57D1 1829 F407 DB00 61D5 CF40
* remotes/maxreitz/tags/pull-block-2019-06-24:
iotests: Fix 205 for concurrent runs
ssh: switch from libssh2 to libssh
vmdk: Add read-only support for seSparse snapshots
vmdk: Reduce the max bound for L1 table size
vmdk: Fix comment regarding max l1_size coverage
iotest 134: test cluster-misaligned encrypted write
blockdev: enable non-root nodes for transaction drive-backup source
nvme: do not advertise support for unsupported arbitration mechanism
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Xen folks are the actual maintainers for this.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Message-Id: <155912548463.2019004.3515830305299809902.stgit@bahia.lan>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
The sem_timedwait function has been annotated as requiring
non-null args in latest header files from GCC snapshot
representing the future 2.30 release.
This causes configure to fail when -Werror is used:
config-temp/qemu-conf.c: In function ‘main’:
config-temp/qemu-conf.c:2:25: error: null argument where non-null required (argument 1) [-Werror=nonnull]
2 | int main(void) { return sem_timedwait(0, 0); }
| ^~~~~~~~~~~~~
config-temp/qemu-conf.c:2:25: error: null argument where non-null required (argument 2) [-Werror=nonnull]
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Message-Id: <20190617114114.24897-1-berrange@redhat.com>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
The configure script breaks when the qemu source directory is in a path
containing white spaces, in particular the list of targets is not
correctly generated when calling "./configure --help" because of how the
default_target_list variable is built.
In addition to that, *building* qemu from a directory with spaces breaks
some assumptions in the Makefiles, even if the original source path does
not contain spaces like in the case of an out-of-tree build, or when
symlinks are involved.
To avoid these issues, refuse to run the configure script and the
Makefile if there are spaces or colons in the source path or the build
path, taking as inspiration what the kbuild system in linux does.
Buglink: https://bugs.launchpad.net/qemu/+bug/1817345
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Antonio Ospite <antonio.ospite@collabora.com>
Message-Id: <20190526144747.30019-3-ao2@ao2.it>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Since commit 79d77bcd36 (configure: Remove --source-path option,
2019-04-29) source_path cannot be overridden anymore, move it out of the
"default parameters" block since the word "default" may suggest that the
value can change, while in fact it does not.
While at it, only set source_path once and separate the positional
argument of basename with "--" to more robustly cover the case of path
names starting with a dash.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Antonio Ospite <antonio.ospite@collabora.com>
Message-Id: <20190526144747.30019-2-ao2@ao2.it>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
This interface has been introduced in 2005 with the
coldfire implementation (e6e5906b6e ColdFire target.)
and looks like to do what the linux-user interface already
does with the TRAP exception rather than the ILLEGAL
exception.
This interface has not been maintained since that.
The semi-hosting interface is not removed so coldfire kernel
with semi-hosting is always supported.
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Message-Id: <20190524162049.806-1-laurent@vivier.eu>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Altering all comments in target/m68k to match Qemu coding styles so that future
patches wont fail due to style breaches.
Signed-off-by: Lucien Murray-Pitts <lucienmp.qemu@gmail.com>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Message-Id: <20190606234125.GA4830@localhost.localdomain>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
The register request via gdbstub would return the SR part
which contains the Trace/Master/IRQ state flags, but
would be missing the CR (Condition Register) state bits.
This fix adds this support by merging them in the m68k
specific gdbstub handler m68k_cpu_gdb_read_register for SR register.
Signed-off-by: Lucien Murray-Pitts <lucienmp.qemu@gmail.com>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Message-Id: <20190609105154.GA16755@localhost.localdomain>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Fix big endian host behavior for interleave MSA instructions. Previous
fix used TARGET_WORDS_BIGENDIAN instead of HOST_WORDS_BIGENDIAN, which
was a mistake.
Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Reviewed-by: Aleksandar Rikalo <arikalo@wavecomp.com>
Message-Id: <1561543629-20327-9-git-send-email-aleksandar.markovic@rt-rk.com>
Fix certian test cases for MSA pack instructions.
Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Reviewed-by: Aleksandar Rikalo <arikalo@wavecomp.com>
Message-Id: <1561543629-20327-8-git-send-email-aleksandar.markovic@rt-rk.com>
Add files for MSA MIPS32R6 target testings (copiling and running).
Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Reviewed-by: Aleksandar Rikalo <arikalo@wavecomp.com>
Message-Id: <1561543629-20327-7-git-send-email-aleksandar.markovic@rt-rk.com>
Add files for MSA big-endian target testings (copiling and running).
Little-endian files are renamed and ammended too.
Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Reviewed-by: Aleksandar Rikalo <arikalo@wavecomp.com>
Message-Id: <1561543629-20327-6-git-send-email-aleksandar.markovic@rt-rk.com>
Amend tests for MSA int multiply instructions.
Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Reviewed-by: Aleksandar Rikalo <arikalo@wavecomp.com>
Message-Id: <1561543629-20327-5-git-send-email-aleksandar.markovic@rt-rk.com>
Add tests for instructions whose result depends on the value in destination
register (prior to instruction execution).
Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Reviewed-by: Aleksandar Rikalo <arikalo@wavecomp.com>
Message-Id: <1561543629-20327-4-git-send-email-aleksandar.markovic@rt-rk.com>
Add tests for MSA bit move instructions.
Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Reviewed-by: Aleksandar Rikalo <arikalo@wavecomp.com>
Message-Id: <1561543629-20327-2-git-send-email-aleksandar.markovic@rt-rk.com>
Fix some simple checkpatch.pl warnings in rc4030.c.
Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Reviewed-by: Aleksandar Rikalo <arikalo@wavecomp.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <1561472838-32272-3-git-send-email-aleksandar.markovic@rt-rk.com>
The size is one byte less than it should be:
address-space: rc4030-dma
0000000000000000-00000000fffffffe (prio 0, i/o): rc4030.dma
rc4030 is used in MIPS Jazz board context.
Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Reviewed-by: Aleksandar Rikalo <arikalo@wavecomp.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <1561472838-32272-2-git-send-email-aleksandar.markovic@rt-rk.com>
One byte is missing, use an aligned size.
(qemu) info mtree
memory-region: pci0-mem
0000000000000000-00000000fffffffe (prio 0, i/o): pci0-mem
^
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Message-Id: <20190624222844.26584-8-f4bug@amsat.org>
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Message-Id: <20190624222844.26584-7-f4bug@amsat.org>
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Message-Id: <20190624222844.26584-6-f4bug@amsat.org>
Since we'll move this code around, fix its style first:
ERROR: space prohibited between function name and open parenthesis
ERROR: line over 90 characters
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Message-Id: <20190624222844.26584-5-f4bug@amsat.org>
Since we'll move this code around, fix its style first:
ERROR: braces {} are necessary for all arms of this statement
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Message-Id: <20190624222844.26584-4-f4bug@amsat.org>
Since we'll move this code around, fix its style first:
ERROR: code indent should never use tabs
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Message-Id: <20190624222844.26584-3-f4bug@amsat.org>
Since commit 8c06fbdf36 checkpatch.pl enforce a new multiline
comment syntax. Since we'll move this code around, fix its style
first.
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Message-Id: <20190624222844.26584-2-f4bug@amsat.org>
The default CPU for pseries has been set to POWER9 by default.
We can use the same default for linux-user
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Message-Id: <20190609143521.19374-2-laurent@vivier.eu>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
QEMU_PPC_FEATURE2_VEC_CRYPTO enables the use
of VSX instructions in libcrypto that are accelerated
by the TCG vector instructions now.
QEMU_PPC_FEATURE2_DARN allows to use the new builtin
qemu_guest_getrandom() function.
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Message-Id: <20190609143521.19374-1-laurent@vivier.eu>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Add support for the option IPV6_<ADD|DROP>_MEMBERSHIP of the syscall
setsockopt(). This option controls membership in multicast groups.
Argument is a pointer to a struct ipv6_mreq.
The glibc <netinet/in.h> header defines the ipv6_mreq structure,
which includes the following members:
struct in6_addr ipv6mr_multiaddr;
unsigned int ipv6mr_interface;
Whereas the kernel in its <linux/in6.h> header defines following
members of the same structure:
struct in6_addr ipv6mr_multiaddr;
int ipv6mr_ifindex;
POSIX defines ipv6mr_interface [1].
__UAPI_DEF_IVP6_MREQ appears in kernel headers with v3.12:
cfd280c91253 net: sync some IP headers with glibc
Without __UAPI_DEF_IVP6_MREQ, kernel defines ipv6mr_ifindex, and
this is explained in cfd280c91253:
"If you include the kernel headers first you get those,
and if you include the glibc headers first you get those,
and the following patch arranges a coordination and
synchronization between the two."
So before 3.12, a program can't include both <netinet/in.h> and
<linux/in6.h>.
In linux-user/syscall.c, we only include <netinet/in.h> (glibc) and
not <linux/in6.h> (kernel headers), so ipv6mr_interface is the one
to use.
[1] http://pubs.opengroup.org/onlinepubs/009695399/basedefs/netinet/in.h.html
Signed-off-by: Neng Chen <nchen@wavecomp.com>
Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Message-Id: <1560953834-29584-2-git-send-email-aleksandar.markovic@rt-rk.com>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Add support for options SOL_ALG of the syscall setsockopt(). This
option is used in relation to Linux kernel Crypto API, and allows
a user to set additional information for the cipher operation via
syscall setsockopt(). The field "optname" must be one of the
following:
- ALG_SET_KEY – seting the key
- ALG_SET_AEAD_AUTHSIZE – set the authentication tag size
SOL_ALG is relatively newer setsockopt() option. Therefore, the
code that handles SOL_ALG is enclosed in "ifdef" so that the build
does not fail for older kernels that do not contain support for
SOL_ALG. "ifdef" also contains check if ALG_SET_KEY and
ALG_SET_AEAD_AUTHSIZE are defined.
Signed-off-by: Yunqiang Su <ysu@wavecomp.com>
Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Message-Id: <1560953834-29584-3-git-send-email-aleksandar.markovic@rt-rk.com>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
When we have updated kernel headers to 5.2-rc1 we have introduced
new syscall numbers that can be not supported by older kernels
and fail with ENOSYS while the guest emulation succeeded before
because the syscalls were emulated with ipc().
This patch fixes the problem by using ipc() if the new syscall
returns ENOSYS.
Fixes: 86e636951d ("linux-user: fix __NR_semtimedop undeclared error")
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Message-Id: <20190529084804.25950-1-laurent@vivier.eu>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
If one uses -L $PATH to point to a full chroot, the startup time
is significant. In addition, the existing probing algorithm fails
to handle symlink loops.
Instead, probe individual paths on demand. Cache both positive
and negative results within $PATH, so that any one filename is
probed only once.
Use glib filename functions for clarity.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Tested-by: Laurent Vivier <laurent@vivier.eu>
Message-Id: <20190519201953.20161-2-richard.henderson@linaro.org>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Tests should place their files into the test directory. This includes
Unix sockets. 205 currently fails to do so, which prevents it from
being run concurrently.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20190618210238.9524-1-mreitz@redhat.com
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Rewrite the implementation of the ssh block driver to use libssh instead
of libssh2. The libssh library has various advantages over libssh2:
- easier API for authentication (for example for using ssh-agent)
- easier API for known_hosts handling
- supports newer types of keys in known_hosts
Use APIs/features available in libssh 0.8 conditionally, to support
older versions (which are not recommended though).
Adjust the iotest 207 according to the different error message, and to
find the default key type for localhost (to properly compare the
fingerprint with).
Contributed-by: Max Reitz <mreitz@redhat.com>
Adjust the various Docker/Travis scripts to use libssh when available
instead of libssh2. The mingw/mxe testing is dropped for now, as there
are no packages for it.
Signed-off-by: Pino Toscano <ptoscano@redhat.com>
Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Acked-by: Alex Bennée <alex.bennee@linaro.org>
Message-id: 20190620200840.17655-1-ptoscano@redhat.com
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 5873173.t2JhDm7DL7@lindworm.usersys.redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in
QEMU).
This format was lacking in the following:
* Grain directory (L1) and grain table (L2) entries were 32-bit,
allowing access to only 2TB (slightly less) of data.
* The grain size (default) was 512 bytes - leading to data
fragmentation and many grain tables.
* For space reclamation purposes, it was necessary to find all the
grains which are not pointed to by any grain table - so a reverse
mapping of "offset of grain in vmdk" to "grain table" must be
constructed - which takes large amounts of CPU/RAM.
The format specification can be found in VMware's documentation:
https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf
In ESXi 6.5, to support snapshot files larger than 2TB, a new format was
introduced: SESparse (Space Efficient).
This format fixes the above issues:
* All entries are now 64-bit.
* The grain size (default) is 4KB.
* Grain directory and grain tables are now located at the beginning
of the file.
+ seSparse format reserves space for all grain tables.
+ Grain tables can be addressed using an index.
+ Grains are located in the end of the file and can also be
addressed with an index.
- seSparse vmdks of large disks (64TB) have huge preallocated
headers - mainly due to L2 tables, even for empty snapshots.
* The header contains a reverse mapping ("backmap") of "offset of
grain in vmdk" to "grain table" and a bitmap ("free bitmap") which
specifies for each grain - whether it is allocated or not.
Using these data structures we can implement space reclamation
efficiently.
* Due to the fact that the header now maintains two mappings:
* The regular one (grain directory & grain tables)
* A reverse one (backmap and free bitmap)
These data structures can lose consistency upon crash and result
in a corrupted VMDK.
Therefore, a journal is also added to the VMDK and is replayed
when the VMware reopens the file after a crash.
Since ESXi 6.7 - SESparse is the only snapshot format available.
Unfortunately, VMware does not provide documentation regarding the new
seSparse format.
This commit is based on black-box research of the seSparse format.
Various in-guest block operations and their effect on the snapshot file
were tested.
The only VMware provided source of information (regarding the underlying
implementation) was a log file on the ESXi:
/var/log/hostd.log
Whenever an seSparse snapshot is created - the log is being populated
with seSparse records.
Relevant log records are of the form:
[...] Const Header:
[...] constMagic = 0xcafebabe
[...] version = 2.1
[...] capacity = 204800
[...] grainSize = 8
[...] grainTableSize = 64
[...] flags = 0
[...] Extents:
[...] Header : <1 : 1>
[...] JournalHdr : <2 : 2>
[...] Journal : <2048 : 2048>
[...] GrainDirectory : <4096 : 2048>
[...] GrainTables : <6144 : 2048>
[...] FreeBitmap : <8192 : 2048>
[...] BackMap : <10240 : 2048>
[...] Grain : <12288 : 204800>
[...] Volatile Header:
[...] volatileMagic = 0xcafecafe
[...] FreeGTNumber = 0
[...] nextTxnSeqNumber = 0
[...] replayJournal = 0
The sizes that are seen in the log file are in sectors.
Extents are of the following format: <offset : size>
This commit is a strict implementation which enforces:
* magics
* version number 2.1
* grain size of 8 sectors (4KB)
* grain table size of 64 sectors
* zero flags
* extent locations
Additionally, this commit proivdes only a subset of the functionality
offered by seSparse's format:
* Read-only
* No journal replay
* No space reclamation
* No unmap support
Hence, journal header, journal, free bitmap and backmap extents are
unused, only the "classic" (L1 -> L2 -> data) grain access is
implemented.
However there are several differences in the grain access itself.
Grain directory (L1):
* Grain directory entries are indexes (not offsets) to grain
tables.
* Valid grain directory entries have their highest nibble set to
0x1.
* Since grain tables are always located in the beginning of the
file - the index can fit into 32 bits - so we can use its low
part if it's valid.
Grain table (L2):
* Grain table entries are indexes (not offsets) to grains.
* If the highest nibble of the entry is:
0x0:
The grain in not allocated.
The rest of the bytes are 0.
0x1:
The grain is unmapped - guest sees a zero grain.
The rest of the bits point to the previously mapped grain,
see 0x3 case.
0x2:
The grain is zero.
0x3:
The grain is allocated - to get the index calculate:
((entry & 0x0fff000000000000) >> 48) |
((entry & 0x0000ffffffffffff) << 12)
* The difference between 0x1 and 0x2 is that 0x1 is an unallocated
grain which results from the guest using sg_unmap to unmap the
grain - but the grain itself still exists in the grain extent - a
space reclamation procedure should delete it.
Unmapping a zero grain has no effect (0x2 will not change to 0x1)
but unmapping an unallocated grain will (0x0 to 0x1) - naturally.
In order to implement seSparse some fields had to be changed to support
both 32-bit and 64-bit entry sizes.
Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com>
Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com>
Message-id: 20190620091057.47441-4-shmuel.eiderman@oracle.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
512M of L1 entries is a very loose bound, only 32M are required to store
the maximal supported VMDK file size of 2TB.
Fixed qemu-iotest 59# - now failure occures before on impossible L1
table size.
Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com>
Message-id: 20190620091057.47441-3-shmuel.eiderman@oracle.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Commit b0651b8c24 ("vmdk: Move l1_size check into vmdk_add_extent")
extended the l1_size check from VMDK4 to VMDK3 but did not update the
default coverage in the moved comment.
The previous vmdk4 calculation:
(512 * 1024 * 1024) * 512(l2 entries) * 65536(grain) = 16PB
The added vmdk3 calculation:
(512 * 1024 * 1024) * 4096(l2 entries) * 512(grain) = 1PB
Adding the calculation of vmdk3 to the comment.
In any case, VMware does not offer virtual disks more than 2TB for
vmdk4/vmdk3 or 64TB for the new undocumented seSparse format which is
not implemented yet in qemu.
Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com>
Message-id: 20190620091057.47441-2-shmuel.eiderman@oracle.com
Reviewed-by: yuchenlin <yuchenlin@synology.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
COW (even empty/zero) areas require encryption too
Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20190516143028.81155-1-anton.nefedov@virtuozzo.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
We forget to enable it for transaction .prepare, while it is already
enabled in do_drive_backup since commit a2d665c1bc
"blockdev: loosen restrictions on drive-backup source node"
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20190618140804.59214-1-vsementsov@virtuozzo.com
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>