hyperv-fixes for v6.10-rc5

-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmZv3pUTHHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXlH8B/wMHji/KH1UFTVjGG8YBT2SSzeDrVjD
 5GZ7HAVvSb2xLHbgLM8ioi7t1YoRv30d+cjGzNcIz6PSICqVB9q7QWaY1Vc3rb9G
 0/77GMnrwF+RFMPQzF2sgbQLILmBYi/47qeJOjPF6P/pvpd4xhrPuGQDJAfS75e7
 UJalTFT4l2ENRxOLuni/8NGjAZG/OVMpQY+XVaoHvYamnGhYcXnOamKPg0nNC6f5
 /oYh2s1HjWH1HCtDT9UHXBHiS8Jt4WrchD8uGII8K8SaxL5dhT9Lm0v9CcAagPq3
 l/7PxtrDE09dDirokZnRMQlhiuYBIEMZoIHl5bCBfODGukGOmnsMk0zi
 =LoeY
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-fixes-signed-20240616' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull Hyper-V fixes from Wei Liu:

 - Some cosmetic changes for hv.c and balloon.c (Aditya Nagesh)

 - Two documentation updates (Michael Kelley)

 - Suppress the invalid warning for packed member alignment (Saurabh
   Sengar)

 - Two hv_balloon fixes (Michael Kelley)

* tag 'hyperv-fixes-signed-20240616' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  Drivers: hv: Cosmetic changes for hv.c and balloon.c
  Documentation: hyperv: Improve synic and interrupt handling description
  Documentation: hyperv: Update spelling and fix typo
  tools: hv: suppress the invalid warning for packed member alignment
  hv_balloon: Enable hot-add for memblock sizes > 128 MiB
  hv_balloon: Use kernel macros to simplify open coded sequences
This commit is contained in:
Linus Torvalds 2024-06-17 11:05:56 -07:00
commit 6226e74900
6 changed files with 207 additions and 205 deletions

View File

@ -62,12 +62,21 @@ shared page with scale and offset values into user space. User
space code performs the same algorithm of reading the TSC and
applying the scale and offset to get the constant 10 MHz clock.
Linux clockevents are based on Hyper-V synthetic timer 0. While
Hyper-V offers 4 synthetic timers for each CPU, Linux only uses
timer 0. Interrupts from stimer0 are recorded on the "HVS" line in
/proc/interrupts. Clockevents based on the virtualized PIT and
local APIC timer also work, but the Hyper-V synthetic timer is
preferred.
Linux clockevents are based on Hyper-V synthetic timer 0 (stimer0).
While Hyper-V offers 4 synthetic timers for each CPU, Linux only uses
timer 0. In older versions of Hyper-V, an interrupt from stimer0
results in a VMBus control message that is demultiplexed by
vmbus_isr() as described in the Documentation/virt/hyperv/vmbus.rst
documentation. In newer versions of Hyper-V, stimer0 interrupts can
be mapped to an architectural interrupt, which is referred to as
"Direct Mode". Linux prefers to use Direct Mode when available. Since
x86/x64 doesn't support per-CPU interrupts, Direct Mode statically
allocates an x86 interrupt vector (HYPERV_STIMER0_VECTOR) across all CPUs
and explicitly codes it to call the stimer0 interrupt handler. Hence
interrupts from stimer0 are recorded on the "HVS" line in /proc/interrupts
rather than being associated with a Linux IRQ. Clockevents based on the
virtualized PIT and local APIC timer also work, but Hyper-V stimer0
is preferred.
The driver for the Hyper-V synthetic system clock and timers is
drivers/clocksource/hyperv_timer.c.

View File

@ -40,7 +40,7 @@ Linux guests communicate with Hyper-V in four different ways:
arm64, these synthetic registers must be accessed using explicit
hypercalls.
* VMbus: VMbus is a higher-level software construct that is built on
* VMBus: VMBus is a higher-level software construct that is built on
the other 3 mechanisms. It is a message passing interface between
the Hyper-V host and the Linux guest. It uses memory that is shared
between Hyper-V and the guest, along with various signaling
@ -54,8 +54,8 @@ x86/x64 architecture only.
.. _Hyper-V Top Level Functional Spec (TLFS): https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/tlfs
VMbus is not documented. This documentation provides a high-level
overview of VMbus and how it works, but the details can be discerned
VMBus is not documented. This documentation provides a high-level
overview of VMBus and how it works, but the details can be discerned
only from the code.
Sharing Memory
@ -74,7 +74,7 @@ follows:
physical address space. How Hyper-V is told about the GPA or list
of GPAs varies. In some cases, a single GPA is written to a
synthetic register. In other cases, a GPA or list of GPAs is sent
in a VMbus message.
in a VMBus message.
* Hyper-V translates the GPAs into "real" physical memory addresses,
and creates a virtual mapping that it can use to access the memory.
@ -133,9 +133,9 @@ only the CPUs actually present in the VM, so Linux does not report
any hot-add CPUs.
A Linux guest CPU may be taken offline using the normal Linux
mechanisms, provided no VMbus channel interrupts are assigned to
the CPU. See the section on VMbus Interrupts for more details
on how VMbus channel interrupts can be re-assigned to permit
mechanisms, provided no VMBus channel interrupts are assigned to
the CPU. See the section on VMBus Interrupts for more details
on how VMBus channel interrupts can be re-assigned to permit
taking a CPU offline.
32-bit and 64-bit
@ -169,14 +169,14 @@ and functionality. Hyper-V indicates feature/function availability
via flags in synthetic MSRs that Hyper-V provides to the guest,
and the guest code tests these flags.
VMbus has its own protocol version that is negotiated during the
initial VMbus connection from the guest to Hyper-V. This version
VMBus has its own protocol version that is negotiated during the
initial VMBus connection from the guest to Hyper-V. This version
number is also output to dmesg during boot. This version number
is checked in a few places in the code to determine if specific
functionality is present.
Furthermore, each synthetic device on VMbus also has a protocol
version that is separate from the VMbus protocol version. Device
Furthermore, each synthetic device on VMBus also has a protocol
version that is separate from the VMBus protocol version. Device
drivers for these synthetic devices typically negotiate the device
protocol version, and may test that protocol version to determine
if specific device functionality is present.

View File

@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0
VMbus
VMBus
=====
VMbus is a software construct provided by Hyper-V to guest VMs. It
VMBus is a software construct provided by Hyper-V to guest VMs. It
consists of a control path and common facilities used by synthetic
devices that Hyper-V presents to guest VMs. The control path is
used to offer synthetic devices to the guest VM and, in some cases,
@ -12,9 +12,9 @@ and the synthetic device implementation that is part of Hyper-V, and
signaling primitives to allow Hyper-V and the guest to interrupt
each other.
VMbus is modeled in Linux as a bus, with the expected /sys/bus/vmbus
entry in a running Linux guest. The VMbus driver (drivers/hv/vmbus_drv.c)
establishes the VMbus control path with the Hyper-V host, then
VMBus is modeled in Linux as a bus, with the expected /sys/bus/vmbus
entry in a running Linux guest. The VMBus driver (drivers/hv/vmbus_drv.c)
establishes the VMBus control path with the Hyper-V host, then
registers itself as a Linux bus driver. It implements the standard
bus functions for adding and removing devices to/from the bus.
@ -49,9 +49,9 @@ synthetic NIC is referred to as "netvsc" and the Linux driver for
the synthetic SCSI controller is "storvsc". These drivers contain
functions with names like "storvsc_connect_to_vsp".
VMbus channels
VMBus channels
--------------
An instance of a synthetic device uses VMbus channels to communicate
An instance of a synthetic device uses VMBus channels to communicate
between the VSP and the VSC. Channels are bi-directional and used
for passing messages. Most synthetic devices use a single channel,
but the synthetic SCSI controller and synthetic NIC may use multiple
@ -73,7 +73,7 @@ write indices and some control flags, followed by the memory for the
actual ring. The size of the ring is determined by the VSC in the
guest and is specific to each synthetic device. The list of GPAs
making up the ring is communicated to the Hyper-V host over the
VMbus control path as a GPA Descriptor List (GPADL). See function
VMBus control path as a GPA Descriptor List (GPADL). See function
vmbus_establish_gpadl().
Each ring buffer is mapped into contiguous Linux kernel virtual
@ -102,10 +102,10 @@ resources. For Windows Server 2019 and later, this limit is
approximately 1280 Mbytes. For versions prior to Windows Server
2019, the limit is approximately 384 Mbytes.
VMbus messages
--------------
All VMbus messages have a standard header that includes the message
length, the offset of the message payload, some flags, and a
VMBus channel messages
----------------------
All messages sent in a VMBus channel have a standard header that includes
the message length, the offset of the message payload, some flags, and a
transactionID. The portion of the message after the header is
unique to each VSP/VSC pair.
@ -137,7 +137,7 @@ control message contains a list of GPAs that describe the data
buffer. For example, the storvsc driver uses this approach to
specify the data buffers to/from which disk I/O is done.
Three functions exist to send VMbus messages:
Three functions exist to send VMBus channel messages:
1. vmbus_sendpacket(): Control-only messages and messages with
embedded data -- no GPAs
@ -154,20 +154,51 @@ Historically, Linux guests have trusted Hyper-V to send well-formed
and valid messages, and Linux drivers for synthetic devices did not
fully validate messages. With the introduction of processor
technologies that fully encrypt guest memory and that allow the
guest to not trust the hypervisor (AMD SNP-SEV, Intel TDX), trusting
guest to not trust the hypervisor (AMD SEV-SNP, Intel TDX), trusting
the Hyper-V host is no longer a valid assumption. The drivers for
VMbus synthetic devices are being updated to fully validate any
VMBus synthetic devices are being updated to fully validate any
values read from memory that is shared with Hyper-V, which includes
messages from VMbus devices. To facilitate such validation,
messages from VMBus devices. To facilitate such validation,
messages read by the guest from the "in" ring buffer are copied to a
temporary buffer that is not shared with Hyper-V. Validation is
performed in this temporary buffer without the risk of Hyper-V
maliciously modifying the message after it is validated but before
it is used.
VMbus interrupts
Synthetic Interrupt Controller (synic)
--------------------------------------
Hyper-V provides each guest CPU with a synthetic interrupt controller
that is used by VMBus for host-guest communication. While each synic
defines 16 synthetic interrupts (SINT), Linux uses only one of the 16
(VMBUS_MESSAGE_SINT). All interrupts related to communication between
the Hyper-V host and a guest CPU use that SINT.
The SINT is mapped to a single per-CPU architectural interrupt (i.e,
an 8-bit x86/x64 interrupt vector, or an arm64 PPI INTID). Because
each CPU in the guest has a synic and may receive VMBus interrupts,
they are best modeled in Linux as per-CPU interrupts. This model works
well on arm64 where a single per-CPU Linux IRQ is allocated for
VMBUS_MESSAGE_SINT. This IRQ appears in /proc/interrupts as an IRQ labelled
"Hyper-V VMbus". Since x86/x64 lacks support for per-CPU IRQs, an x86
interrupt vector is statically allocated (HYPERVISOR_CALLBACK_VECTOR)
across all CPUs and explicitly coded to call vmbus_isr(). In this case,
there's no Linux IRQ, and the interrupts are visible in aggregate in
/proc/interrupts on the "HYP" line.
The synic provides the means to demultiplex the architectural interrupt into
one or more logical interrupts and route the logical interrupt to the proper
VMBus handler in Linux. This demultiplexing is done by vmbus_isr() and
related functions that access synic data structures.
The synic is not modeled in Linux as an irq chip or irq domain,
and the demultiplexed logical interrupts are not Linux IRQs. As such,
they don't appear in /proc/interrupts or /proc/irq. The CPU
affinity for one of these logical interrupts is controlled via an
entry under /sys/bus/vmbus as described below.
VMBus interrupts
----------------
VMbus provides a mechanism for the guest to interrupt the host when
VMBus provides a mechanism for the guest to interrupt the host when
the guest has queued new messages in a ring buffer. The host
expects that the guest will send an interrupt only when an "out"
ring buffer transitions from empty to non-empty. If the guest sends
@ -176,63 +207,55 @@ unnecessary. If a guest sends an excessive number of unnecessary
interrupts, the host may throttle that guest by suspending its
execution for a few seconds to prevent a denial-of-service attack.
Similarly, the host will interrupt the guest when it sends a new
message on the VMbus control path, or when a VMbus channel "in" ring
buffer transitions from empty to non-empty. Each CPU in the guest
may receive VMbus interrupts, so they are best modeled as per-CPU
interrupts in Linux. This model works well on arm64 where a single
per-CPU IRQ is allocated for VMbus. Since x86/x64 lacks support for
per-CPU IRQs, an x86 interrupt vector is statically allocated (see
HYPERVISOR_CALLBACK_VECTOR) across all CPUs and explicitly coded to
call the VMbus interrupt service routine. These interrupts are
visible in /proc/interrupts on the "HYP" line.
Similarly, the host will interrupt the guest via the synic when
it sends a new message on the VMBus control path, or when a VMBus
channel "in" ring buffer transitions from empty to non-empty due to
the host inserting a new VMBus channel message. The control message stream
and each VMBus channel "in" ring buffer are separate logical interrupts
that are demultiplexed by vmbus_isr(). It demultiplexes by first checking
for channel interrupts by calling vmbus_chan_sched(), which looks at a synic
bitmap to determine which channels have pending interrupts on this CPU.
If multiple channels have pending interrupts for this CPU, they are
processed sequentially. When all channel interrupts have been processed,
vmbus_isr() checks for and processes any messages received on the VMBus
control path.
The guest CPU that a VMbus channel will interrupt is selected by the
The guest CPU that a VMBus channel will interrupt is selected by the
guest when the channel is created, and the host is informed of that
selection. VMbus devices are broadly grouped into two categories:
selection. VMBus devices are broadly grouped into two categories:
1. "Slow" devices that need only one VMbus channel. The devices
1. "Slow" devices that need only one VMBus channel. The devices
(such as keyboard, mouse, heartbeat, and timesync) generate
relatively few interrupts. Their VMbus channels are all
relatively few interrupts. Their VMBus channels are all
assigned to interrupt the VMBUS_CONNECT_CPU, which is always
CPU 0.
2. "High speed" devices that may use multiple VMbus channels for
2. "High speed" devices that may use multiple VMBus channels for
higher parallelism and performance. These devices include the
synthetic SCSI controller and synthetic NIC. Their VMbus
synthetic SCSI controller and synthetic NIC. Their VMBus
channels interrupts are assigned to CPUs that are spread out
among the available CPUs in the VM so that interrupts on
multiple channels can be processed in parallel.
The assignment of VMbus channel interrupts to CPUs is done in the
The assignment of VMBus channel interrupts to CPUs is done in the
function init_vp_index(). This assignment is done outside of the
normal Linux interrupt affinity mechanism, so the interrupts are
neither "unmanaged" nor "managed" interrupts.
The CPU that a VMbus channel will interrupt can be seen in
The CPU that a VMBus channel will interrupt can be seen in
/sys/bus/vmbus/devices/<deviceGUID>/ channels/<channelRelID>/cpu.
When running on later versions of Hyper-V, the CPU can be changed
by writing a new value to this sysfs entry. Because the interrupt
assignment is done outside of the normal Linux affinity mechanism,
there are no entries in /proc/irq corresponding to individual
VMbus channel interrupts.
by writing a new value to this sysfs entry. Because VMBus channel
interrupts are not Linux IRQs, there are no entries in /proc/interrupts
or /proc/irq corresponding to individual VMBus channel interrupts.
An online CPU in a Linux guest may not be taken offline if it has
VMbus channel interrupts assigned to it. Any such channel
VMBus channel interrupts assigned to it. Any such channel
interrupts must first be manually reassigned to another CPU as
described above. When no channel interrupts are assigned to the
CPU, it can be taken offline.
When a guest CPU receives a VMbus interrupt from the host, the
function vmbus_isr() handles the interrupt. It first checks for
channel interrupts by calling vmbus_chan_sched(), which looks at a
bitmap setup by the host to determine which channels have pending
interrupts on this CPU. If multiple channels have pending
interrupts for this CPU, they are processed sequentially. When all
channel interrupts have been processed, vmbus_isr() checks for and
processes any message received on the VMbus control path.
The VMbus channel interrupt handling code is designed to work
The VMBus channel interrupt handling code is designed to work
correctly even if an interrupt is received on a CPU other than the
CPU assigned to the channel. Specifically, the code does not use
CPU-based exclusion for correctness. In normal operation, Hyper-V
@ -242,23 +265,23 @@ when Hyper-V will make the transition. The code must work correctly
even if there is a time lag before Hyper-V starts interrupting the
new CPU. See comments in target_cpu_store().
VMbus device creation/deletion
VMBus device creation/deletion
------------------------------
Hyper-V and the Linux guest have a separate message-passing path
that is used for synthetic device creation and deletion. This
path does not use a VMbus channel. See vmbus_post_msg() and
path does not use a VMBus channel. See vmbus_post_msg() and
vmbus_on_msg_dpc().
The first step is for the guest to connect to the generic
Hyper-V VMbus mechanism. As part of establishing this connection,
the guest and Hyper-V agree on a VMbus protocol version they will
Hyper-V VMBus mechanism. As part of establishing this connection,
the guest and Hyper-V agree on a VMBus protocol version they will
use. This negotiation allows newer Linux kernels to run on older
Hyper-V versions, and vice versa.
The guest then tells Hyper-V to "send offers". Hyper-V sends an
offer message to the guest for each synthetic device that the VM
is configured to have. Each VMbus device type has a fixed GUID
known as the "class ID", and each VMbus device instance is also
is configured to have. Each VMBus device type has a fixed GUID
known as the "class ID", and each VMBus device instance is also
identified by a GUID. The offer message from Hyper-V contains
both GUIDs to uniquely (within the VM) identify the device.
There is one offer message for each device instance, so a VM with
@ -275,7 +298,7 @@ type based on the class ID, and invokes the correct driver to set up
the device. Driver/device matching is performed using the standard
Linux mechanism.
The device driver probe function opens the primary VMbus channel to
The device driver probe function opens the primary VMBus channel to
the corresponding VSP. It allocates guest memory for the channel
ring buffers and shares the ring buffer with the Hyper-V host by
giving the host a list of GPAs for the ring buffer memory. See
@ -285,7 +308,7 @@ Once the ring buffer is set up, the device driver and VSP exchange
setup messages via the primary channel. These messages may include
negotiating the device protocol version to be used between the Linux
VSC and the VSP on the Hyper-V host. The setup messages may also
include creating additional VMbus channels, which are somewhat
include creating additional VMBus channels, which are somewhat
mis-named as "sub-channels" since they are functionally
equivalent to the primary channel once they are created.

View File

@ -45,8 +45,8 @@ int hv_init(void)
* This involves a hypercall.
*/
int hv_post_message(union hv_connection_id connection_id,
enum hv_message_type message_type,
void *payload, size_t payload_size)
enum hv_message_type message_type,
void *payload, size_t payload_size)
{
struct hv_input_post_message *aligned_msg;
unsigned long flags;
@ -86,7 +86,7 @@ int hv_post_message(union hv_connection_id connection_id,
status = HV_STATUS_INVALID_PARAMETER;
} else {
status = hv_do_hypercall(HVCALL_POST_MESSAGE,
aligned_msg, NULL);
aligned_msg, NULL);
}
local_irq_restore(flags);
@ -111,7 +111,7 @@ int hv_synic_alloc(void)
hv_context.hv_numa_map = kcalloc(nr_node_ids, sizeof(struct cpumask),
GFP_KERNEL);
if (hv_context.hv_numa_map == NULL) {
if (!hv_context.hv_numa_map) {
pr_err("Unable to allocate NUMA map\n");
goto err;
}
@ -120,11 +120,11 @@ int hv_synic_alloc(void)
hv_cpu = per_cpu_ptr(hv_context.cpu_context, cpu);
tasklet_init(&hv_cpu->msg_dpc,
vmbus_on_msg_dpc, (unsigned long) hv_cpu);
vmbus_on_msg_dpc, (unsigned long)hv_cpu);
if (ms_hyperv.paravisor_present && hv_isolation_type_tdx()) {
hv_cpu->post_msg_page = (void *)get_zeroed_page(GFP_ATOMIC);
if (hv_cpu->post_msg_page == NULL) {
if (!hv_cpu->post_msg_page) {
pr_err("Unable to allocate post msg page\n");
goto err;
}
@ -147,14 +147,14 @@ int hv_synic_alloc(void)
if (!ms_hyperv.paravisor_present && !hv_root_partition) {
hv_cpu->synic_message_page =
(void *)get_zeroed_page(GFP_ATOMIC);
if (hv_cpu->synic_message_page == NULL) {
if (!hv_cpu->synic_message_page) {
pr_err("Unable to allocate SYNIC message page\n");
goto err;
}
hv_cpu->synic_event_page =
(void *)get_zeroed_page(GFP_ATOMIC);
if (hv_cpu->synic_event_page == NULL) {
if (!hv_cpu->synic_event_page) {
pr_err("Unable to allocate SYNIC event page\n");
free_page((unsigned long)hv_cpu->synic_message_page);
@ -203,14 +203,13 @@ int hv_synic_alloc(void)
return ret;
}
void hv_synic_free(void)
{
int cpu, ret;
for_each_present_cpu(cpu) {
struct hv_per_cpu_context *hv_cpu
= per_cpu_ptr(hv_context.cpu_context, cpu);
struct hv_per_cpu_context *hv_cpu =
per_cpu_ptr(hv_context.cpu_context, cpu);
/* It's better to leak the page if the encryption fails. */
if (ms_hyperv.paravisor_present && hv_isolation_type_tdx()) {
@ -262,8 +261,8 @@ void hv_synic_free(void)
*/
void hv_synic_enable_regs(unsigned int cpu)
{
struct hv_per_cpu_context *hv_cpu
= per_cpu_ptr(hv_context.cpu_context, cpu);
struct hv_per_cpu_context *hv_cpu =
per_cpu_ptr(hv_context.cpu_context, cpu);
union hv_synic_simp simp;
union hv_synic_siefp siefp;
union hv_synic_sint shared_sint;
@ -277,8 +276,8 @@ void hv_synic_enable_regs(unsigned int cpu)
/* Mask out vTOM bit. ioremap_cache() maps decrypted */
u64 base = (simp.base_simp_gpa << HV_HYP_PAGE_SHIFT) &
~ms_hyperv.shared_gpa_boundary;
hv_cpu->synic_message_page
= (void *)ioremap_cache(base, HV_HYP_PAGE_SIZE);
hv_cpu->synic_message_page =
(void *)ioremap_cache(base, HV_HYP_PAGE_SIZE);
if (!hv_cpu->synic_message_page)
pr_err("Fail to map synic message page.\n");
} else {
@ -296,8 +295,8 @@ void hv_synic_enable_regs(unsigned int cpu)
/* Mask out vTOM bit. ioremap_cache() maps decrypted */
u64 base = (siefp.base_siefp_gpa << HV_HYP_PAGE_SHIFT) &
~ms_hyperv.shared_gpa_boundary;
hv_cpu->synic_event_page
= (void *)ioremap_cache(base, HV_HYP_PAGE_SIZE);
hv_cpu->synic_event_page =
(void *)ioremap_cache(base, HV_HYP_PAGE_SIZE);
if (!hv_cpu->synic_event_page)
pr_err("Fail to map synic event page.\n");
} else {
@ -348,8 +347,8 @@ int hv_synic_init(unsigned int cpu)
*/
void hv_synic_disable_regs(unsigned int cpu)
{
struct hv_per_cpu_context *hv_cpu
= per_cpu_ptr(hv_context.cpu_context, cpu);
struct hv_per_cpu_context *hv_cpu =
per_cpu_ptr(hv_context.cpu_context, cpu);
union hv_synic_sint shared_sint;
union hv_synic_simp simp;
union hv_synic_siefp siefp;

View File

@ -25,6 +25,7 @@
#include <linux/notifier.h>
#include <linux/percpu_counter.h>
#include <linux/page_reporting.h>
#include <linux/sizes.h>
#include <linux/hyperv.h>
#include <asm/hyperv-tlfs.h>
@ -41,8 +42,6 @@
* Begin protocol definitions.
*/
/*
* Protocol versions. The low word is the minor version, the high word the major
* version.
@ -71,8 +70,6 @@ enum {
DYNMEM_PROTOCOL_VERSION_CURRENT = DYNMEM_PROTOCOL_VERSION_WIN10
};
/*
* Message Types
*/
@ -101,7 +98,6 @@ enum dm_message_type {
DM_VERSION_1_MAX = 12
};
/*
* Structures defining the dynamic memory management
* protocol.
@ -115,7 +111,6 @@ union dm_version {
__u32 version;
} __packed;
union dm_caps {
struct {
__u64 balloon:1;
@ -148,8 +143,6 @@ union dm_mem_page_range {
__u64 page_range;
} __packed;
/*
* The header for all dynamic memory messages:
*
@ -174,7 +167,6 @@ struct dm_message {
__u8 data[]; /* enclosed message */
} __packed;
/*
* Specific message types supporting the dynamic memory protocol.
*/
@ -271,7 +263,6 @@ struct dm_status {
__u32 io_diff;
} __packed;
/*
* Message to ask the guest to allocate memory - balloon up message.
* This message is sent from the host to the guest. The guest may not be
@ -286,14 +277,13 @@ struct dm_balloon {
__u32 reservedz;
} __packed;
/*
* Balloon response message; this message is sent from the guest
* to the host in response to the balloon message.
*
* reservedz: Reserved; must be set to zero.
* more_pages: If FALSE, this is the last message of the transaction.
* if TRUE there will atleast one more message from the guest.
* if TRUE there will be at least one more message from the guest.
*
* range_count: The number of ranges in the range array.
*
@ -314,7 +304,7 @@ struct dm_balloon_response {
* to the guest to give guest more memory.
*
* more_pages: If FALSE, this is the last message of the transaction.
* if TRUE there will atleast one more message from the guest.
* if TRUE there will be at least one more message from the guest.
*
* reservedz: Reserved; must be set to zero.
*
@ -342,7 +332,6 @@ struct dm_unballoon_response {
struct dm_header hdr;
} __packed;
/*
* Hot add request message. Message sent from the host to the guest.
*
@ -390,7 +379,6 @@ enum dm_info_type {
MAX_INFO_TYPE
};
/*
* Header for the information message.
*/
@ -425,11 +413,11 @@ struct dm_info_msg {
* The range start_pfn : end_pfn specifies the range
* that the host has asked us to hot add. The range
* start_pfn : ha_end_pfn specifies the range that we have
* currently hot added. We hot add in multiples of 128M
* chunks; it is possible that we may not be able to bring
* online all the pages in the region. The range
* currently hot added. We hot add in chunks equal to the
* memory block size; it is possible that we may not be able
* to bring online all the pages in the region. The range
* covered_start_pfn:covered_end_pfn defines the pages that can
* be brough online.
* be brought online.
*/
struct hv_hotadd_state {
@ -480,10 +468,10 @@ static unsigned long last_post_time;
static int hv_hypercall_multi_failure;
module_param(hot_add, bool, (S_IRUGO | S_IWUSR));
module_param(hot_add, bool, 0644);
MODULE_PARM_DESC(hot_add, "If set attempt memory hot_add");
module_param(pressure_report_delay, uint, (S_IRUGO | S_IWUSR));
module_param(pressure_report_delay, uint, 0644);
MODULE_PARM_DESC(pressure_report_delay, "Delay in secs in reporting pressure");
static atomic_t trans_id = ATOMIC_INIT(0);
@ -502,11 +490,13 @@ enum hv_dm_state {
DM_INIT_ERROR
};
static __u8 recv_buffer[HV_HYP_PAGE_SIZE];
static __u8 balloon_up_send_buffer[HV_HYP_PAGE_SIZE];
static unsigned long ha_pages_in_chunk;
#define HA_BYTES_IN_CHUNK (ha_pages_in_chunk << PAGE_SHIFT)
#define PAGES_IN_2M (2 * 1024 * 1024 / PAGE_SIZE)
#define HA_CHUNK (128 * 1024 * 1024 / PAGE_SIZE)
struct hv_dynmem_device {
struct hv_device *dev;
@ -595,12 +585,12 @@ static inline bool has_pfn_is_backed(struct hv_hotadd_state *has,
struct hv_hotadd_gap *gap;
/* The page is not backed. */
if ((pfn < has->covered_start_pfn) || (pfn >= has->covered_end_pfn))
if (pfn < has->covered_start_pfn || pfn >= has->covered_end_pfn)
return false;
/* Check for gaps. */
list_for_each_entry(gap, &has->gap_list, list) {
if ((pfn >= gap->start_pfn) && (pfn < gap->end_pfn))
if (pfn >= gap->start_pfn && pfn < gap->end_pfn)
return false;
}
@ -724,28 +714,21 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
unsigned long processed_pfn;
unsigned long total_pfn = pfn_count;
for (i = 0; i < (size/HA_CHUNK); i++) {
start_pfn = start + (i * HA_CHUNK);
for (i = 0; i < (size/ha_pages_in_chunk); i++) {
start_pfn = start + (i * ha_pages_in_chunk);
scoped_guard(spinlock_irqsave, &dm_device.ha_lock) {
has->ha_end_pfn += HA_CHUNK;
if (total_pfn > HA_CHUNK) {
processed_pfn = HA_CHUNK;
total_pfn -= HA_CHUNK;
} else {
processed_pfn = total_pfn;
total_pfn = 0;
}
has->covered_end_pfn += processed_pfn;
has->ha_end_pfn += ha_pages_in_chunk;
processed_pfn = umin(total_pfn, ha_pages_in_chunk);
total_pfn -= processed_pfn;
has->covered_end_pfn += processed_pfn;
}
reinit_completion(&dm_device.ol_waitevent);
nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn));
ret = add_memory(nid, PFN_PHYS((start_pfn)),
(HA_CHUNK << PAGE_SHIFT), MHP_MERGE_RESOURCE);
HA_BYTES_IN_CHUNK, MHP_MERGE_RESOURCE);
if (ret) {
pr_err("hot_add memory failed error is %d\n", ret);
@ -760,7 +743,7 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
do_hot_add = false;
}
scoped_guard(spinlock_irqsave, &dm_device.ha_lock) {
has->ha_end_pfn -= HA_CHUNK;
has->ha_end_pfn -= ha_pages_in_chunk;
has->covered_end_pfn -= processed_pfn;
}
break;
@ -787,8 +770,8 @@ static void hv_online_page(struct page *pg, unsigned int order)
guard(spinlock_irqsave)(&dm_device.ha_lock);
list_for_each_entry(has, &dm_device.ha_region_list, list) {
/* The page belongs to a different HAS. */
if ((pfn < has->start_pfn) ||
(pfn + (1UL << order) > has->end_pfn))
if (pfn < has->start_pfn ||
(pfn + (1UL << order) > has->end_pfn))
continue;
hv_bring_pgs_online(has, pfn, 1UL << order);
@ -800,7 +783,7 @@ static int pfn_covered(unsigned long start_pfn, unsigned long pfn_cnt)
{
struct hv_hotadd_state *has;
struct hv_hotadd_gap *gap;
unsigned long residual, new_inc;
unsigned long residual;
int ret = 0;
guard(spinlock_irqsave)(&dm_device.ha_lock);
@ -836,15 +819,9 @@ static int pfn_covered(unsigned long start_pfn, unsigned long pfn_cnt)
* our current limit; extend it.
*/
if ((start_pfn + pfn_cnt) > has->end_pfn) {
/* Extend the region by multiples of ha_pages_in_chunk */
residual = (start_pfn + pfn_cnt - has->end_pfn);
/*
* Extend the region by multiples of HA_CHUNK.
*/
new_inc = (residual / HA_CHUNK) * HA_CHUNK;
if (residual % HA_CHUNK)
new_inc += HA_CHUNK;
has->end_pfn += new_inc;
has->end_pfn += ALIGN(residual, ha_pages_in_chunk);
}
ret = 1;
@ -855,7 +832,7 @@ static int pfn_covered(unsigned long start_pfn, unsigned long pfn_cnt)
}
static unsigned long handle_pg_range(unsigned long pg_start,
unsigned long pg_count)
unsigned long pg_count)
{
unsigned long start_pfn = pg_start;
unsigned long pfn_cnt = pg_count;
@ -866,7 +843,7 @@ static unsigned long handle_pg_range(unsigned long pg_start,
unsigned long res = 0, flags;
pr_debug("Hot adding %lu pages starting at pfn 0x%lx.\n", pg_count,
pg_start);
pg_start);
spin_lock_irqsave(&dm_device.ha_lock, flags);
list_for_each_entry(has, &dm_device.ha_region_list, list) {
@ -902,22 +879,19 @@ static unsigned long handle_pg_range(unsigned long pg_start,
if (start_pfn > has->start_pfn &&
online_section_nr(pfn_to_section_nr(start_pfn)))
hv_bring_pgs_online(has, start_pfn, pgs_ol);
}
if ((has->ha_end_pfn < has->end_pfn) && (pfn_cnt > 0)) {
if (has->ha_end_pfn < has->end_pfn && pfn_cnt > 0) {
/*
* We have some residual hot add range
* that needs to be hot added; hot add
* it now. Hot add a multiple of
* HA_CHUNK that fully covers the pages
* ha_pages_in_chunk that fully covers the pages
* we have.
*/
size = (has->end_pfn - has->ha_end_pfn);
if (pfn_cnt <= size) {
size = ((pfn_cnt / HA_CHUNK) * HA_CHUNK);
if (pfn_cnt % HA_CHUNK)
size += HA_CHUNK;
size = ALIGN(pfn_cnt, ha_pages_in_chunk);
} else {
pfn_cnt = size;
}
@ -1010,10 +984,7 @@ static void hot_add_req(struct work_struct *dummy)
rg_start = dm->ha_wrk.ha_region_range.finfo.start_page;
rg_sz = dm->ha_wrk.ha_region_range.finfo.page_cnt;
if ((rg_start == 0) && (!dm->host_specified_ha_region)) {
unsigned long region_size;
unsigned long region_start;
if (rg_start == 0 && !dm->host_specified_ha_region) {
/*
* The host has not specified the hot-add region.
* Based on the hot-add page range being specified,
@ -1021,19 +992,13 @@ static void hot_add_req(struct work_struct *dummy)
* that need to be hot-added while ensuring the alignment
* and size requirements of Linux as it relates to hot-add.
*/
region_size = (pfn_cnt / HA_CHUNK) * HA_CHUNK;
if (pfn_cnt % HA_CHUNK)
region_size += HA_CHUNK;
region_start = (pg_start / HA_CHUNK) * HA_CHUNK;
rg_start = region_start;
rg_sz = region_size;
rg_start = ALIGN_DOWN(pg_start, ha_pages_in_chunk);
rg_sz = ALIGN(pfn_cnt, ha_pages_in_chunk);
}
if (do_hot_add)
resp.page_count = process_hot_add(pg_start, pfn_cnt,
rg_start, rg_sz);
rg_start, rg_sz);
dm->num_pages_added += resp.page_count;
#endif
@ -1211,11 +1176,10 @@ static void post_status(struct hv_dynmem_device *dm)
sizeof(struct dm_status),
(unsigned long)NULL,
VM_PKT_DATA_INBAND, 0);
}
static void free_balloon_pages(struct hv_dynmem_device *dm,
union dm_mem_page_range *range_array)
union dm_mem_page_range *range_array)
{
int num_pages = range_array->finfo.page_cnt;
__u64 start_frame = range_array->finfo.start_page;
@ -1231,8 +1195,6 @@ static void free_balloon_pages(struct hv_dynmem_device *dm,
}
}
static unsigned int alloc_balloon_pages(struct hv_dynmem_device *dm,
unsigned int num_pages,
struct dm_balloon_response *bl_resp,
@ -1278,7 +1240,6 @@ static unsigned int alloc_balloon_pages(struct hv_dynmem_device *dm,
page_to_pfn(pg);
bl_resp->range_array[i].finfo.page_cnt = alloc_unit;
bl_resp->hdr.size += sizeof(union dm_mem_page_range);
}
return i * alloc_unit;
@ -1332,7 +1293,7 @@ static void balloon_up(struct work_struct *dummy)
if (num_ballooned == 0 || num_ballooned == num_pages) {
pr_debug("Ballooned %u out of %u requested pages.\n",
num_pages, dm_device.balloon_wrk.num_pages);
num_pages, dm_device.balloon_wrk.num_pages);
bl_resp->more_pages = 0;
done = true;
@ -1366,16 +1327,15 @@ static void balloon_up(struct work_struct *dummy)
for (i = 0; i < bl_resp->range_count; i++)
free_balloon_pages(&dm_device,
&bl_resp->range_array[i]);
&bl_resp->range_array[i]);
done = true;
}
}
}
static void balloon_down(struct hv_dynmem_device *dm,
struct dm_unballoon_request *req)
struct dm_unballoon_request *req)
{
union dm_mem_page_range *range_array = req->range_array;
int range_count = req->range_count;
@ -1389,7 +1349,7 @@ static void balloon_down(struct hv_dynmem_device *dm,
}
pr_debug("Freed %u ballooned pages.\n",
prev_pages_ballooned - dm->num_pages_ballooned);
prev_pages_ballooned - dm->num_pages_ballooned);
if (req->more_pages == 1)
return;
@ -1414,8 +1374,7 @@ static int dm_thread_func(void *dm_dev)
struct hv_dynmem_device *dm = dm_dev;
while (!kthread_should_stop()) {
wait_for_completion_interruptible_timeout(
&dm_device.config_event, 1*HZ);
wait_for_completion_interruptible_timeout(&dm_device.config_event, 1 * HZ);
/*
* The host expects us to post information on the memory
* pressure every second.
@ -1439,9 +1398,8 @@ static int dm_thread_func(void *dm_dev)
return 0;
}
static void version_resp(struct hv_dynmem_device *dm,
struct dm_version_response *vresp)
struct dm_version_response *vresp)
{
struct dm_version_request version_req;
int ret;
@ -1502,7 +1460,7 @@ static void version_resp(struct hv_dynmem_device *dm,
}
static void cap_resp(struct hv_dynmem_device *dm,
struct dm_capabilities_resp_msg *cap_resp)
struct dm_capabilities_resp_msg *cap_resp)
{
if (!cap_resp->is_accepted) {
pr_err("Capabilities not accepted by host\n");
@ -1535,7 +1493,7 @@ static void balloon_onchannelcallback(void *context)
switch (dm_hdr->type) {
case DM_VERSION_RESPONSE:
version_resp(dm,
(struct dm_version_response *)dm_msg);
(struct dm_version_response *)dm_msg);
break;
case DM_CAPABILITIES_RESPONSE:
@ -1565,7 +1523,7 @@ static void balloon_onchannelcallback(void *context)
dm->state = DM_BALLOON_DOWN;
balloon_down(dm,
(struct dm_unballoon_request *)recv_buffer);
(struct dm_unballoon_request *)recv_buffer);
break;
case DM_MEM_HOT_ADD_REQUEST:
@ -1603,17 +1561,15 @@ static void balloon_onchannelcallback(void *context)
default:
pr_warn_ratelimited("Unhandled message: type: %d\n", dm_hdr->type);
}
}
}
#define HV_LARGE_REPORTING_ORDER 9
#define HV_LARGE_REPORTING_LEN (HV_HYP_PAGE_SIZE << \
HV_LARGE_REPORTING_ORDER)
static int hv_free_page_report(struct page_reporting_dev_info *pr_dev_info,
struct scatterlist *sgl, unsigned int nents)
struct scatterlist *sgl, unsigned int nents)
{
unsigned long flags;
struct hv_memory_hint *hint;
@ -1648,7 +1604,7 @@ static int hv_free_page_report(struct page_reporting_dev_info *pr_dev_info,
*/
/* page reporting for pages 2MB or higher */
if (order >= HV_LARGE_REPORTING_ORDER ) {
if (order >= HV_LARGE_REPORTING_ORDER) {
range->page.largepage = 1;
range->page_size = HV_GPA_PAGE_RANGE_PAGE_SIZE_2MB;
range->base_large_pfn = page_to_hvpfn(
@ -1662,23 +1618,21 @@ static int hv_free_page_report(struct page_reporting_dev_info *pr_dev_info,
range->page.additional_pages =
(sg->length / HV_HYP_PAGE_SIZE) - 1;
}
}
status = hv_do_rep_hypercall(HV_EXT_CALL_MEMORY_HEAT_HINT, nents, 0,
hint, NULL);
local_irq_restore(flags);
if (!hv_result_success(status)) {
pr_err("Cold memory discard hypercall failed with status %llx\n",
status);
status);
if (hv_hypercall_multi_failure > 0)
hv_hypercall_multi_failure++;
if (hv_result(status) == HV_STATUS_INVALID_PARAMETER) {
pr_err("Underlying Hyper-V does not support order less than 9. Hypercall failed\n");
pr_err("Defaulting to page_reporting_order %d\n",
pageblock_order);
pageblock_order);
page_reporting_order = pageblock_order;
hv_hypercall_multi_failure++;
return -EINVAL;
@ -1712,7 +1666,7 @@ static void enable_page_reporting(void)
pr_err("Failed to enable cold memory discard: %d\n", ret);
} else {
pr_info("Cold memory discard hint enabled with order %d\n",
page_reporting_order);
page_reporting_order);
}
}
@ -1795,7 +1749,7 @@ static int balloon_connect_vsp(struct hv_device *dev)
if (ret)
goto out;
t = wait_for_completion_timeout(&dm_device.host_event, 5*HZ);
t = wait_for_completion_timeout(&dm_device.host_event, 5 * HZ);
if (t == 0) {
ret = -ETIMEDOUT;
goto out;
@ -1831,10 +1785,13 @@ static int balloon_connect_vsp(struct hv_device *dev)
cap_msg.caps.cap_bits.hot_add = hot_add_enabled();
/*
* Specify our alignment requirements as it relates
* memory hot-add. Specify 128MB alignment.
* Specify our alignment requirements for memory hot-add. The value is
* the log base 2 of the number of megabytes in a chunk. For example,
* with 256 MiB chunks, the value is 8. The number of MiB in a chunk
* must be a power of 2.
*/
cap_msg.caps.cap_bits.hot_add_alignment = 7;
cap_msg.caps.cap_bits.hot_add_alignment =
ilog2(HA_BYTES_IN_CHUNK / SZ_1M);
/*
* Currently the host does not use these
@ -1850,7 +1807,7 @@ static int balloon_connect_vsp(struct hv_device *dev)
if (ret)
goto out;
t = wait_for_completion_timeout(&dm_device.host_event, 5*HZ);
t = wait_for_completion_timeout(&dm_device.host_event, 5 * HZ);
if (t == 0) {
ret = -ETIMEDOUT;
goto out;
@ -1891,8 +1848,8 @@ static int hv_balloon_debug_show(struct seq_file *f, void *offset)
char *sname;
seq_printf(f, "%-22s: %u.%u\n", "host_version",
DYNMEM_MAJOR_VERSION(dm->version),
DYNMEM_MINOR_VERSION(dm->version));
DYNMEM_MAJOR_VERSION(dm->version),
DYNMEM_MINOR_VERSION(dm->version));
seq_printf(f, "%-22s:", "capabilities");
if (ballooning_enabled())
@ -1941,10 +1898,10 @@ static int hv_balloon_debug_show(struct seq_file *f, void *offset)
seq_printf(f, "%-22s: %u\n", "pages_ballooned", dm->num_pages_ballooned);
seq_printf(f, "%-22s: %lu\n", "total_pages_committed",
get_pages_committed(dm));
get_pages_committed(dm));
seq_printf(f, "%-22s: %llu\n", "max_dynamic_page_count",
dm->max_dynamic_page_count);
dm->max_dynamic_page_count);
return 0;
}
@ -1954,7 +1911,7 @@ DEFINE_SHOW_ATTRIBUTE(hv_balloon_debug);
static void hv_balloon_debugfs_init(struct hv_dynmem_device *b)
{
debugfs_create_file("hv-balloon", 0444, NULL, b,
&hv_balloon_debug_fops);
&hv_balloon_debug_fops);
}
static void hv_balloon_debugfs_exit(struct hv_dynmem_device *b)
@ -1984,8 +1941,23 @@ static int balloon_probe(struct hv_device *dev,
hot_add = false;
#ifdef CONFIG_MEMORY_HOTPLUG
/*
* Hot-add must operate in chunks that are of size equal to the
* memory block size because that's what the core add_memory()
* interface requires. The Hyper-V interface requires that the memory
* block size be a power of 2, which is guaranteed by the check in
* memory_dev_init().
*/
ha_pages_in_chunk = memory_block_size_bytes() / PAGE_SIZE;
do_hot_add = hot_add;
#else
/*
* Without MEMORY_HOTPLUG, the guest returns a failure status for all
* hot add requests from Hyper-V, and the chunk size is used only to
* specify alignment to Hyper-V as required by the host/guest protocol.
* Somewhat arbitrarily, use 128 MiB.
*/
ha_pages_in_chunk = SZ_128M / PAGE_SIZE;
do_hot_add = false;
#endif
dm_device.dev = dev;
@ -2097,7 +2069,6 @@ static int balloon_suspend(struct hv_device *hv_dev)
tasklet_enable(&hv_dev->channel->callback_event);
return 0;
}
static int balloon_resume(struct hv_device *dev)
@ -2156,7 +2127,6 @@ static struct hv_driver balloon_drv = {
static int __init init_balloon_drv(void)
{
return vmbus_driver_register(&balloon_drv);
}

View File

@ -17,6 +17,7 @@ endif
MAKEFLAGS += -r
override CFLAGS += -O2 -Wall -g -D_GNU_SOURCE -I$(OUTPUT)include
override CFLAGS += -Wno-address-of-packed-member
ALL_TARGETS := hv_kvp_daemon hv_vss_daemon
ifneq ($(ARCH), aarch64)