Commit graph

6722 commits

Author SHA1 Message Date
Hans Petter Selasky 3da1cf1e88 Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
Tycho Nightingale 896d1f7723 Add support for emulating the move instruction: "mov r/m8, imm8".
Reviewed by:	neel
2014-06-26 17:15:41 +00:00
Peter Grehan cf1d80d88c Expose the amount of resident and wired memory from the guest's vmspace.
This is different than the amount shown for the process e.g. by
/usr/bin/top - that is the mappings faulted in by the mmap'd region
of guest memory.

The values can be fetched with bhyvectl

 # bhyvectl --get-stats --vm=myvm
 ...
 Resident memory                         	413749248
 Wired memory                            	0
 ...

vmm_stat.[ch] -
 Modify the counter code in bhyve to allow direct setting of a counter
as opposed to incrementing, and providing a callback to fetch a
counter's value.

Reviewed by:	neel
2014-06-25 22:13:35 +00:00
Konstantin Belousov 633034fe0e Add FPU_KERN_KTHR flag to fpu_kern_enter(9), which avoids saving FPU
context into memory for the kernel threads which called
fpu_kern_thread(9).  This allows the fpu_kern_enter() callers to not
check for is_fpu_kern_thread() to get the optimization.

Apply the flag to padlock(4) and aesni(4).  In aesni_cipher_process(),
do not leak FPU context state on error.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-06-23 07:37:54 +00:00
Dmitry Chagin 2dedc1281a Revert r266925 as it can lead to instant panic at fexecve():
To allow to run the interpreter itself add a new ELF branding type.

Pointed out by:	kib, mjg
2014-06-17 05:29:18 +00:00
Tycho Nightingale a026dc3fcb Bring an overly enthusiastic KASSERT inline with the Intel SDM.
Reviewed by:	neel
2014-06-16 22:59:18 +00:00
Attilio Rao 3ae10f7477 - Modify vm_page_unwire() and vm_page_enqueue() to directly accept
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
  adding and removing pages to them.

Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.

This change effectively modifies the KPI.  __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	pho
2014-06-16 18:15:27 +00:00
Roger Pau Monné ef409ede7b amd64/i386: introduce APIC hooks for different APIC implementations.
This is needed for Xen PV(H) guests, since there's no hardware lapic
available on this kind of domains. This commit should not change
functionality.

Sponsored by: Citrix Systems R&D
Reviewed by: jhb
Approved by: gibbs

amd64/include/cpu.h:
amd64/amd64/mp_machdep.c:
i386/include/cpu.h:
i386/i386/mp_machdep.c:
 - Remove lapic_ipi_vectored hook from cpu_ops, since it's now
   implemented in the lapic hooks.

amd64/amd64/mp_machdep.c:
i386/i386/mp_machdep.c:
 - Use lapic_ipi_vectored directly, since it's now an inline function
   that will call the appropiate hook.

x86/x86/local_apic.c:
 - Prefix bare metal public lapic functions with native_ and mark them
   as static.
 - Define default implementation of apic_ops.

x86/include/apicvar.h:
 - Declare the apic_ops structure and create inline functions to
   access the hooks, so the change is transparent to existing users of
   the lapic_ functions.

x86/xen/hvm.c:
 - Switch to use the new apic_ops.
2014-06-16 08:43:03 +00:00
Tycho Nightingale 5ebc578ba6 Replace enum forward declarations with complete definitions.
Reviewed by:	neel
2014-06-10 18:46:00 +00:00
Neel Natu 404874659f Add helper functions to populate VM exit information for rendezvous and
astpending exits. This is to reduce code duplication between VT-x and
SVM implementations.
2014-06-10 16:45:58 +00:00
Neel Natu 0494cb1bcb Turn on interrupt window exiting unconditionally when an ExtINT is being
injected into the guest. This allows the hypervisor to inject another
ExtINT or APIC vector as soon as the guest is able to process interrupts.

This change is not to address any correctness issue but to guarantee that
any pending APIC vector that was preempted by the ExtINT will be injected
as soon as possible. Prior to this change such pending interrupts could be
delayed until the next VM exit.
2014-06-10 01:38:02 +00:00
Neel Natu 051f2bd19d Add reserved bit checking when doing %CR8 emulation and inject #GP if required.
Pointed out by:	grehan
Reviewed by:	tychon
2014-06-09 20:51:08 +00:00
Neel Natu 5fcf252f41 Add ioctl(VM_REINIT) to reinitialize the virtual machine state maintained
by vmm.ko. This allows the virtual machine to be restarted without having
to destroy it first.

Reviewed by:	grehan
2014-06-07 21:36:52 +00:00
Alan Cox dd05fa1945 Add a page size field to struct vm_page. Increase the page size field when
a partially populated reservation becomes fully populated, and decrease this
field when a fully populated reservation becomes partially populated.

Use this field to simplify the implementation of pmap_enter_object() on
amd64, arm, and i386.

On all architectures where we support superpages, the cost of creating a
superpage mapping is roughly the same as creating a base page mapping.  For
example, both kinds of mappings entail the creation of a single PTE and PV
entry.  With this in mind, use the page size field to make the
implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little
smarter.  Previously, if MAP_PREFAULT_PARTIAL was specified to
vm_map_pmap_enter(), that function would only map base pages.  Now, it will
create up to 96 base page or superpage mappings.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2014-06-07 17:12:26 +00:00
Tycho Nightingale 594db0024e Support guest accesses to %cr8.
Reviewed by:	neel
2014-06-06 18:23:49 +00:00
Warner Losh 3f1afabf09 Restore comments accidentally removed.
MFC after: 3 days
2014-06-06 04:08:55 +00:00
Neel Natu 95ebc360ef Activate vcpus from bhyve(8) using the ioctl VM_ACTIVATE_CPU instead of doing
it implicitly in vmm.ko.

Add ioctl VM_GET_CPUS to get the current set of 'active' and 'suspended' cpus
and display them via /usr/sbin/bhyvectl using the "--get-active-cpus" and
"--get-suspended-cpus" options.

This is in preparation for being able to reset virtual machine state without
having to destroy and recreate it.
2014-05-31 23:37:34 +00:00
Dmitry Chagin 5f56da1891 To allow to run the interpreter itself add a new ELF branding type.
Allow Linux ABI to run ELF interpreter.

MFC after:	3 days
2014-05-31 15:01:51 +00:00
Tycho Nightingale 11669a681c If VMX isn't enabled so long as the lock bit isn't set yet in MSR
IA32_FEATURE_CONTROL it still can be.

Approved by:	grehan (co-mentor)
2014-05-30 23:37:31 +00:00
Neel Natu 92754b1199 Remove bogus check for kmem_malloc() failure even though M_WAITOK is set.
Requested by:	jkim
2014-05-30 20:58:32 +00:00
Neel Natu 8e351f8a3c Allocate a zeroed LDT.
Failing to do this might result in the LDT appearing to run out of free
descriptors because of random junk in the descriptor's 'sd_type' field.

http://lists.freebsd.org/pipermail/freebsd-amd64/2014-May/016088.html

Reviewed by:	kib
MFC after:	2 weeks
2014-05-30 18:59:37 +00:00
Konstantin Belousov 64e9726555 When usermode loaded non-default segment selector into the %gs,
correctly prepare KGSBASE msr to restore the user descriptor base on
the last swapgs during return to usermode.

Reported and tested by:	peterj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-05-29 16:18:31 +00:00
Mark Johnston f2789bd5c7 Commit the rest of the changes that were intended to be part of r266826.
X-MFC-with:	r266826
2014-05-29 01:42:22 +00:00
John Baldwin 44a68c4e40 - Rework the XSAVE/XRSTOR emulation to only expose XCR0 features to the
guest for which the rules regarding xsetbv emulation are known.  In
  particular future extensions like AVX-512 have interdependencies among
  feature bits that could allow a guest to trigger a GP# in the host with
  the current approach of allowing anything the host supports.
- Add proper checking of Intel MPX and AVX-512 XSAVE features in the
  xsetbv emulation and allow these features to be exposed to the guest if
  they are enabled in the host.
- Expose a subset of known-safe features from leaf 0 of the structured
  extended features to guests if they are supported on the host including
  RDFSBASE/RDGSBASE, BMI1/2, AVX2, AVX-512, HLE, ERMS, and RTM.  Aside
  from AVX-512, these features are all new instructions available for use
  in ring 3 with no additional hypervisor changes needed.

Reviewed by:	neel
2014-05-27 19:04:38 +00:00
Neel Natu 65ffa035a7 Add segment protection and limits violation checks in vie_calculate_gla()
for 32-bit x86 guests.

Tested using ins/outs executed in a FreeBSD/i386 guest.
2014-05-27 04:26:22 +00:00
Neel Natu ae0780bbf1 Remove restriction on insb/insw/insl emulation. These instructions are
properly emulated.
2014-05-25 02:05:23 +00:00
Neel Natu 5382c19d81 Do the linear address calculation for the ins/outs emulation using a new
API function 'vie_calculate_gla()'.

While the current implementation is simplistic it forms the basis of doing
segmentation checks if the guest is in 32-bit protected mode.
2014-05-25 00:57:24 +00:00
Neel Natu da11f4aa1d Add libvmmapi functions vm_copyin() and vm_copyout() to copy into and out
of the guest linear address space. These APIs in turn use a new ioctl
'VM_GLA2GPA' to convert the guest linear address to guest physical.

Use the new copyin/copyout APIs when emulating ins/outs instruction in
bhyve(8).
2014-05-24 23:12:30 +00:00
Neel Natu e813a87350 Consolidate all the information needed by the guest page table walker into
'struct vm_guest_paging'.

Check for canonical addressing in vmm_gla2gpa() and inject a protection
fault into the guest if a violation is detected.

If the page table walk is restarted in vmm_gla2gpa() then reset 'ptpphys' to
point to the root of the page tables.
2014-05-24 20:26:57 +00:00
Neel Natu 37a723a5b3 When injecting a page fault into the guest also update the guest's %cr2 to
indicate the faulting linear address.

If the guest PML4 entry has the PG_PS bit set then inject a page fault into
the guest with the PGEX_RSV bit set in the error_code.

Get rid of redundant checks for the PG_RW violations when walking the page
tables.
2014-05-24 19:13:25 +00:00
Neel Natu a7424861fb Check for alignment check violation when processing in/out string instructions. 2014-05-23 19:59:14 +00:00
Neel Natu d17b5104a9 Add emulation of the "outsb" instruction. NetBSD guests use this to write to
the UART FIFO.

The emulation is constrained in a number of ways: 64-bit only, doesn't check
for all exception conditions, limited to i/o ports emulated in userspace.

Some of these constraints will be relaxed in followup commits.

Requested by:	grehan
Reviewed by:	tychon (partially and a much earlier version)
2014-05-23 05:15:17 +00:00
Neel Natu c5e423dd2e A Centos 6.4 guest will write 0xff to the 8259 mask register before beginning
the proper ICWx initialization sequence. It assumes, probably correctly, that
the boot firmware has done the 8259 initialization.

Since grub-bhyve does not initialize the 8259 this write to the mask register
takes a code path in which 'error' remains uninitialized (ready=0,icw_num=0).

Fix this by initializing 'error' at the start of the function.
2014-05-23 05:04:50 +00:00
John Baldwin 0eb7ae8d0a Don't permit users to request a subset of the AVX512 or MPX xsave masks.
These masks are documented in the Intel Architecture Instruction Set
Extensions Programming Reference (March 2014).

Reviewed by:	kib
MFC after:	1 month
2014-05-22 18:22:02 +00:00
Neel Natu ba6f5e23cc Allow vmx_getdesc() and vmx_setdesc() to be called for a vcpu that is in the
VCPU_RUNNING state. This will let the VMX exit handler inspect the vcpu's
segment descriptors without having to exit the critical section.
2014-05-22 17:22:37 +00:00
Justin Hibbits 81e3caaf77 imagact_binmisc builds for all supported architectures, so enable it for all.
Any bugs in execution will be dealt with as they crop up.

MFC after:	3 weeks
Relnotes:	Yes
2014-05-22 05:04:40 +00:00
Neel Natu fd949af642 Inject page fault into the guest if the page table walker detects an invalid
translation for the guest linear address.
2014-05-22 03:14:54 +00:00
Neel Natu f888763dd8 Add PG_RW check when translating a guest linear to guest physical address.
Set the accessed and dirty bits in the page table entry. If it fails then
restart the page table walk from the beginning. This might happen if another
vcpu modifies the page tables simultaneously.

Reviewed by:	alc, kib
2014-05-20 20:30:28 +00:00
John Baldwin 674b6d6e0d Add support for decoding the AMD SVM instructions. 2014-05-19 18:07:37 +00:00
Neel Natu e4c8a13d61 Add PG_U (user/supervisor) checks when translating a guest linear address
to a guest physical address.

PG_PS (page size) field is valid only in a PDE or a PDPTE so it is now
checked only in non-terminal paging entries.

Ignore the upper 32-bits of the CR3 for PAE paging.
2014-05-19 03:50:07 +00:00
Peter Grehan 897bb47e7b Make the vmx asm code dtrace-fbt-friendly by
- inserting frame enter/leave sequences
 - restructuring the vmx_enter_guest routine so that it subsumes
   the vm_exit_guest block, which was the #vmexit RIP and not a
   callable routine.

Reviewed by:	neel
MFC after:	3 weeks
2014-05-18 03:50:17 +00:00
John Baldwin 8b3949c344 Add support for decoding rdrand and rdseed. 2014-05-17 21:10:03 +00:00
John Baldwin 355d8a2f91 Add definitions for more structured extended features as well as
XSAVE Extended Features for AVX512 and MPX (Memory Protection Extensions).

Obtained from:	Intel's Instruction Set Extensions Programming Reference
                (March 2014)
2014-05-16 17:45:09 +00:00
John Baldwin b3e9732a76 Implement a PCI interrupt router to route PCI legacy INTx interrupts to
the legacy 8259A PICs.
- Implement an ICH-comptabile PCI interrupt router on the lpc device with
  8 steerable pins configured via config space access to byte-wide
  registers at 0x60-63 and 0x68-6b.
- For each configured PCI INTx interrupt, route it to both an I/O APIC
  pin and a PCI interrupt router pin.  When a PCI INTx interrupt is
  asserted, ensure that both pins are asserted.
- Provide an initial routing of PCI interrupt router (PIRQ) pins to
  8259A pins (ISA IRQs) and initialize the interrupt line config register
  for the corresponding PCI function with the ISA IRQ as this matches
  existing hardware.
- Add a global _PIC method for OSPM to select the desired interrupt routing
  configuration.
- Update the _PRT methods for PCI bridges to provide both APIC and legacy
  PRT tables and return the appropriate table based on the configured
  routing configuration.  Note that if the lpc device is not configured, no
  routing information is provided.
- When the lpc device is enabled, provide ACPI PCI link devices corresponding
  to each PIRQ pin.
- Add a VMM ioctl to adjust the trigger mode (edge vs level) for 8259A
  pins via the ELCR.
- Mark the power management SCI as level triggered.
- Don't hardcode the number of elements in Packages in the source for
  the DSDT.  iasl(8) will fill in the actual number of elements, and
  this makes it simpler to generate a Package with a variable number of
  elements.

Reviewed by:	tycho
2014-05-15 14:16:55 +00:00
Neel Natu f3db4c53e6 Increase the TSS limit by one byte. The processor requires an additional byte
with all bits set to 1 beyond the I/O permission bitmap.

Prior to this change accessing I/O ports [0xFFF8-0xFFFF] would trigger a
#GP fault even though the I/O bitmap allowed access to those ports.

For more details see section "I/O Permission Bit Map" in the Intel SDM, Vol 1.

Reviewed by:	kib
2014-05-14 22:24:09 +00:00
Neel Natu 055fc2cb5e Virtual machine halt detection is turned on by default. Allow it to be
disabled via the tunable 'hw.vmm.halt_detection'.
2014-05-05 16:19:24 +00:00
Nathan Whitehorn a9d0ed68b3 Disable ACPI and P4TCC throttling by default, following discussion on
freebsd-current. These CPU speed control techniques are usually unhelpful
at best. For now, continue building the relevant code into GENERIC so that
it can trivially be re-enabled at runtime if anyone wants it.

MFC after:	1 month
2014-05-04 16:38:21 +00:00
Kenneth D. Merry 991554f2c4 Bring in the mpr(4) driver for LSI's MPT3 12Gb SAS controllers.
This is derived from the mps(4) driver, but it supports only the 12Gb
IT and IR hardware including the SAS 3004, SAS 3008 and SAS 3108.

Some notes about this driver:
 o The 12Gb hardware can do "FastPath" I/O, and that capability is included in
   this driver.

 o WarpDrive functionality has been removed, since it isn't supported in
   the 12Gb driver interface.

 o The Scatter/Gather list handling code is significantly different between
   the 6Gb and 12Gb hardware.  The 12Gb boards support IEEE Scatter/Gather
   lists.

Thanks to LSI for developing and testing this driver for FreeBSD.

share/man/man4/mpr.4:
	mpr(4) man page.

sys/dev/mpr/*:
	mpr(4) driver files.

sys/modules/Makefile,
sys/modules/mpr/Makefile:
	Add a module Makefile for the mpr(4) driver.

sys/conf/files:
	Add the mpr(4) driver.

sys/amd64/conf/GENERIC,
sys/i386/conf/GENERIC,
sys/mips/conf/OCTEON1,
sys/sparc64/conf/GENERIC:
	Add the mpr(4) driver to all config files that currently
	have the mps(4) driver.

sys/ia64/conf/GENERIC:
	Add the mps(4) and mpr(4) drivers to the ia64 GENERIC
	config file.

sys/i386/conf/XEN:
	Exclude the mpr module from building here.

Submitted by:	Steve McConnell <Stephen.McConnell@lsi.com>
MFC after:	3 days
Tested by:	Chris Reeves <chrisr@spectralogic.com>
Sponsored by:	LSI, Spectra Logic
Relnotes:	LSI 12Gb SAS driver mpr(4) added
2014-05-02 20:25:09 +00:00
Eitan Adler 804e017089 lindev(4): finish the partial commit in r265212
lindev(4) was only used to provide /dev/full which is now a standard feature of
FreeBSD.  /dev/full was never linux-specific and provides a generally useful
feature.

Document this in UPDATING and bump __FreeBSD_version.  This will be documented
in the PH shortly.

Reported by:	jkim
2014-05-02 07:14:22 +00:00
Neel Natu e50ce2aa06 Add logic in the HLT exit handler to detect if the guest has put all vcpus
to sleep permanently by executing a HLT with interrupts disabled.

When this condition is detected the guest with be suspended with a reason of
VM_SUSPEND_HALT and the bhyve(8) process will exit.

Tested by executing "halt" inside a RHEL7-beta guest.

Discussed with:	grehan@
Reviewed by:	jhb@, tychon@
2014-05-02 00:33:56 +00:00