bus provide it with its needed memory resources.
This allows us to use PCIe on the ThunderX2 and, with a previous version
of the patch, on the SoftIron 3000 with ACPI.
Obtained from: ABT Systems Ltd
Sponsored by: The FreeBSD Foundation
Sponsored by: DARPA, AFRL
Sponsored by: Cavium (Hardware)
Differential Revision: https://reviews.freebsd.org/D8767
6.0 spec 6.4.3.5 bit 0 is ignored on QWord, DWord, and Word Address Space
Descriptors, but not Extended Address Space Descriptors.
Reviewed by: jhb
Sponsored by: DARPA, AFRL
Sponsored by: Cavium (Hardware)
Differential Revision: https://reviews.freebsd.org/D14516
allocated with a tag to come from the specified domain if it meets the
other constraints provided by the tag. Automatically create a tag at
the root of each bus specifying the domain local to that bus if
available.
Reviewed by: jhb, kib
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13545
This adds a new acpi_bus interface with a map_intr method. This is similar
to the Open Firmware map_intr method and allows us to create the needed
mapping from ACPI space to INTRNG space.
Obtained from: ABT Systems Ltd
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D8617
By the ACPI standard (ACPI 5 chapter 8.4 Declaring Processors) Processors
can be implemented in 2 distinct ways:
* Through a Processor object type (which provides P_BLK)
* Through a Device object type
Prior to this change, the FreeBSD driver only supported the former. AMD
Epyc / Poweredge systems we are testing both implement the latter only. Add
the missing support.
Because P_BLK is not defined in the device object case, C-states entering
must be completely controlled via _CST methods rather than P_LVL2/3.
John Baldwin points out that ACPI 6.0 formally deprecates the Processor
keyword, so eventually processors will only be enumerated as Device objects.
Submitted by: attilio
Reviewed by: jhb, markj, Anton Rang <rang AT acm.org>
Relnotes: maybe
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13457
From ACPICA 20170929, AcpiOsGetTimer() should be available early because
While() loop timeout mechanism was reimplemented with it. Unfortunately,
it means AcpiLoadTables() may cause panic when a While() loop is executed.
After having lengthy discussions with ACPICA developers, I have concluded
that dummy timecounter is good enough for the purpose and it is the least
intrusive solution for now. Also, they reminded me the ACPI specification
implies OS timer function should be available before loading tables.
This addresses a deadlock during boot when EARLY_AP_STARTUP is configured:
a taskqueue thread may call pause() with an ACPI mutex held, and thread0
may block on this mutex before configuring the eventtimer. In this case
the taskqueue thread will sleep forever waiting for its callout to fire.
PR: 220277
Submitted by: jhb
MFC after: 3 days
For GEN1 Hyper-V, vmbus is attached to pcib0, which contains the
resources for PCI passthrough and SR-IOV. There is no
acpi_syscontainer0 on GEN1 Hyper-V.
For GEN2 Hyper-V, vmbus is attached to acpi_syscontainer0, which
contains the resources for PCI passthrough and SR-IOV. There is
no pcib0 on GEN2 Hyper-V.
The ACPI VMBUS device now only holds its _CRS, which is empty as
of this commit; its existence is mainly for upward compatibility.
Device tree structure is suggested by jhb@.
Tested-by: dexuan@
Collabrated-wth: dexuan@
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D10565
- Rename the default implementation of 'pcib_request_feature' and add
a pcib_request_feature() wrapper function (as is often done for
new-bus APIs implemented via kobj) that accepts a single function.
Previously the call to pcib_request_feature() ended up invoking the
method on the great-great-grandparent of the bridge device instead
of the grandparent. For a bridge that was a direct child of pci0 on
x86 this resulted in the method skipping over the Host-PCI bridge
driver and being invoked against nexus0
- When invoking _OSC from a Host-PCI bridge driver, invoke
device_get_softc() against the Host-PCI bridge device instead of the
child bridge that is requesting HotPlug. Using the wrong softc data
resulted in garbage being passed for the ACPI handle causing the
_OSC call to fail.
- While here, perform some other cleanups to _OSC handling in the ACPI
Host-PCI bridge driver:
- Don't invoke _OSC when requesting a control that has already been
granted by the firmware.
- Don't set the first word of the capability array before invoking
_OSC. This word is always set explicitly by acpi_EvaluateOSC()
since it is UUID-independent.
- Don't modify the set of granted controls unless _OSC doesn't exist
(which is treated as always successful), or the _OSC method
doesn't fail.
- Don't require an _OSC status of 0 for success. _OSC always
returns the updated control mask even if it returns a non-zero
status in the first word.
- Whine if _OSC ever tries to revoke a previously-granted control.
(It is not supposed to do that.)
- While here, add constants for the _OSC status word in acpivar.h
(though currently unused).
Reported by: adrian
Reviewed by: imp
MFC after: 1 week
Tested on: Lenovo x220
Differential Revision: https://reviews.freebsd.org/D10520
The MFC will include a compat definition of smp_no_rendevous_barrier()
that calls smp_no_rendezvous_barrier().
Reviewed by: gnn, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D10313
During acpi_cmbat_attach() the acpi_cmbat_init_battery() notification
handler is registered. It has been observed this notification handler
can be called instantly, before the attach routine has returned. In
the notification handler there is a call to device_is_attached() which
returns false. Because the softc is set we know an attach is in
progress and the fix is simply to wait and try again in this case.
Reviewed by: avg @
MFC after: 1 week
Convert PCIe hot plug support over to asking the firmware, if any, for
permission to use the HotPlug hardware. Implement pci_request_feature
for ACPI. All other host pci connections to allowing all valid feature
requests.
Sponsored by: Netflix
On Core2 and older Intel CPUs, where TSC stops in C2, system does not
allow C2 entrance if timecounter hardware is TSC. This is done by
tc_windup() which tests for TC_FLAGS_C2STOP flag of the new
timecounter and increases cpu_disable_c2_sleep if flag is set. Right
now init_TSC_tc() only sets the flag if cpu_deepest_sleep >= 2, but
TSC is initialized too early for this variable to be set by
acpi_cpu.c.
There is no reason to require that ACPI reported C2 and deeper states
to set TC_FLAGS_C2STOP, so remove cpu_deepest_sleep test from
init_TSC_tc() condition. And since this is the only use of the
variable, remove it at all.
Reported and submitted by: Jia-Shiun Li <jiashiun@gmail.com>
Suggested by: jhb
MFC after: 2 weeks
using the ACPI C1/mwait sleep method.
Previously, the mwait instruction would return when an interrupt was
pending; however, the idle loop did not actually enable interrupts when
this occurred. This led to a situation where the idle loop could quickly
spin through the C1/mwait sleep method a number of times when an interrupt
was pending. (Eventually, the situation corrected itself when something
other than an interrupt triggered the idle loop to either enable interrupts
or schedule another thread.)
Reviewed by: kib, imp (earlier version)
Input from: jhb
MFC after: 1 week
Sponsored by: Netflix
memory-mapped devices that are normally PCIe drives. Devices can then use
the existing pci_get_class, etc. accessors to query this data.
The ivar values are different enough from the existing ACPI and ISA values
to not conflict.
Reviewed by: jhb
Obtained from: ABT Systems Ltd
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D8721
In order to make Prometheus do graphing/alerting on thermal sensors in a
generic fashion, we should attach the name of the thermal zone device as
a label. That way there is only a single metric for the temperature of a
thermal zone, with its name attached as a label.
Reviewed by: cem
Differential Revision: https://reviews.freebsd.org/D8775
If the bus number assigned to a Host-PCI bridge doesn't match the first
bus number in the associated producer range from _CRS, print a warning and
fail to attach rather than panicking due to an assertion failure.
At least one single-socket Dell machine leaves a "ghost" Host-PCI bridge
device in the ACPI namespace that seems to correspond to the I/O hub in
the second socket of a two-socket machine. However, the BIOS doesn't
configure the settings for this "ghost" bridge correctly, nor does it have
any PCI devices behind it.
Tested by: royger
MFC after: 2 weeks
When handling a GPE ACPI interrupt object the EcSpaceHandler()
function can be called which checks the EC_EVENT_SCI bit and then
recurse on the EcGpeQueryHandler() function. If there are multiple GPE
events pending the EC_EVENT_SCI bit will be set at the next call to
EcSpaceHandler() causing it to recurse again via the
EcGpeQueryHandler() function. This leads to a slow never ending
recursion during boot which prevents proper system startup, because
the EC_EVENT_SCI bit never gets cleared in this scenario.
The behaviour is reproducible with the ALASKA AMI in combination with
a newer Skylake based mainboard in the following way:
Enter BIOS and adjust the clock one hour forward. Save and exit the
BIOS. System fails to boot due to the above mentioned bug in
EcGpeQueryHandler() which was observed recursing multiple times.
This patch adds a simple recursion guard to the EcGpeQueryHandler()
function and also also adds logic to detect if new GPE events occurred
during the execution of EcGpeQueryHandler() and then loop on this
function instead of recursing.
Reviewed by: jhb
MFC after: 2 weeks
Right now, userspace (fast) gettimeofday(2) on x86 only works for
RDTSC. For older machines, like Core2, where RDTSC is not C2/C3
invariant, and which fall to HPET hardware, this means that the call
has both the penalty of the syscall and of the uncached hw behind the
QPI or PCIe connection to the sought bridge. Nothing can me done
against the access latency, but the syscall overhead can be removed.
System already provides mappable /dev/hpetX devices, which gives
straight access to the HPET registers page.
Add yet another algorithm to the x86 'vdso' timehands. Libc is updated
to handle both RDTSC and HPET. For HPET, the index of the hpet device
to mmap is passed from kernel to userspace, index might be changed and
libc invalidates its mapping as needed.
Remove cpu_fill_vdso_timehands() KPI, instead require that
timecounters which can be used from userspace, to provide
tc_fill_vdso_timehands{,32}() methods. Merge i386 and amd64
libc/<arch>/sys/__vdso_gettc.c into one source file in the new
libc/x86/sys location. __vdso_gettc() internal interface is changed
to move timecounter algorithm detection into the MD code.
Measurements show that RDTSC even with the syscall overhead is faster
than userspace HPET access. But still, userspace HPET is three-four
times faster than syscall HPET on several Core2 and SandyBridge
machines.
Tested by: Howard Su <howard0su@gmail.com>
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D7473
time, by, by default disallow writes to the mmaped HPET pages.
Intent is to allow userspace to use HPET as fast (i.e. no-syscall)
timecounter for gettimeofday(2). Unfortunately, the permission model
does not make it possible to safely unhide /dev/hpet in the jails even
if default mode is set to 0444, because untrusted jailed root may
change device permissions to writeable.
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
The SYSINIT runs at SI_SUB_KICK_SCHEDULER after the scheduler is fully
initialized and timers are working. This fixes booting in the
EARLY_AP_STARTUP case.
the graphics drivers can benefit from access to the lid handle for querying and getting notifications
Submitted by: kmacy
Differential Revision: https://reviews.freebsd.org/D6643
After r285994, sysctl(8) was fixed to use 273.15 instead of 273.20 as 0C
reference and as result, the temperature read in sysctl(8) now exibits a
+0.1C difference.
This commit fix the kernel references to match the reference value used in
sysctl(8) after r285994.
Sponsored by: Rubicon Communications (Netgate)
- Add a pcib_detach() function for the PCI-PCI bridge driver. It
tears down the NEW_PCIB and hotplug state including destroying
resource managers, deleting child devices, and disabling hotplug
events.
- Add a detach method to the ACPI PCI-PCI bridge driver which calls
pcib_detach() and then frees the copy of the _PRT interrupt routing
table.
- Add a detach method to the PCI-Cardbus bridge driver which frees
the PCI bus resources in addition to calling cbb_detach().
- Explicitly clear any pending hotplug events during attach to ensure
future events will generate an interrupt.
- If a the Command Completed bit is set in the slot status register
when the command completion timeout fires, treat it as if the
command completed and the completion interrupt was just lost rather
than forcing a detach.
- Don't wait for a Command Completed notification if Command Completion
interrupts are disabled. The spec explicitly says no interrupt is
enabled when clearing CCIE, and on my T400 no interrupt is generated
when CCIE is changed from cleared to set, either. In addition, the
T400 doesn't appear to set the Command Completed bit in the cases
where it doesn't generate an interrupt, so don't schedule the timer
either. (If the CC bit were always set, one could always set the timer
and rely on the logic of treating CC set as a missed interrupt.)
Reviewed by: imp (older version)
Differential Revision: https://reviews.freebsd.org/D6424
Some ACPI operations such as mutex acquires and event waits accept a
timeout. The ACPI OSD layer implements these timeouts by using regular
sleep timeouts. However, this doesn't work during early boot before
event timers are setup. Instead, use polling combined with DELAY()
to spin.
This fixes booting on upcoming Intel systems with Kaby Lake processors.
Tested by: "Jeffrey E Pieper" <jeffrey.e.pieper@intel.com>
Reviewed by: jimharris
MFC after: 1 week
Currently, Application Processors (non-boot CPUs) are started by
MD code at SI_SUB_CPU, but they are kept waiting in a "pen" until
SI_SUB_SMP at which point they are released to run kernel threads.
SI_SUB_SMP is one of the last SYSINIT levels, so APs don't enter
the scheduler and start running threads until fairly late in the
boot.
This change moves SI_SUB_SMP up to just before software interrupt
threads are created allowing the APs to start executing kernel
threads much sooner (before any devices are probed). This allows
several initialization routines that need to perform initialization
on all CPUs to now perform that initialization in one step rather
than having to defer the AP initialization to a second SYSINIT run
at SI_SUB_SMP. It also permits all CPUs to be available for
handling interrupts before any devices are probed.
This last feature fixes a problem on with interrupt vector exhaustion.
Specifically, in the old model all device interrupts were routed
onto the boot CPU during boot. Later after the APs were released at
SI_SUB_SMP, interrupts were redistributed across all CPUs.
However, several drivers for multiqueue hardware allocate N interrupts
per CPU in the system. In a system with many CPUs, just a few drivers
doing this could exhaust the available pool of interrupt vectors on
the boot CPU as each driver was allocating N * mp_ncpu vectors on the
boot CPU. Now, drivers will allocate interrupts on their desired CPUs
during boot meaning that only N interrupts are allocated from the boot
CPU instead of N * mp_ncpu.
Some other bits of code can also be simplified as smp_started is
now true much earlier and will now always be true for these bits of
code. This removes the need to treat the single-CPU boot environment
as a special case.
As a transition aid, the new behavior is available under a new kernel
option (EARLY_AP_STARTUP). This will allow the option to be turned off
if need be during initial testing. I plan to enable this on x86 by
default in a followup commit in the next few days and to have all
platforms moved over before 11.0. Once the transition is complete,
the option will be removed along with the !EARLY_AP_STARTUP code.
These changes have only been tested on x86. Other platform maintainers
are encouraged to port their architectures over as well. The main
things to check for are any uses of smp_started in MD code that can be
simplified and SI_SUB_SMP SYSINITs in MD code that can be removed in
the EARLY_AP_STARTUP case (e.g. the interrupt shuffling).
PR: kern/199321
Reviewed by: markj, gnn, kib
Sponsored by: Netflix
bus_get_cpus() returns a specified set of CPUs for a device. It accepts
an enum for the second parameter that indicates the type of cpuset to
request. Currently two valus are supported:
- LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to
the device when DEVICE_NUMA is enabled)
- INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core)
For systems that do not support NUMA (or if it is not enabled in the kernel
config), LOCAL_CPUS fails with EINVAL. INTR_CPUS is mapped to 'all_cpus'
by default. The idea is that INTR_CPUS should always return a valid set.
Device drivers which want to use per-CPU interrupts should start using
INTR_CPUS instead of simply assigning interrupts to all available CPUs.
In the future we may wish to add tunables to control the policy of
INTR_CPUS (e.g. should it be local-only or global, should it ignore
SMT threads or not).
The x86 nexus driver exposes the internal set of interrupt CPUs from the
the x86 interrupt code via INTR_CPUS.
The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable
LOCAL_CPUS set when _PXM exists and DEVICE_NUMA is enabled. They also and
the global INTR_CPUS set from the nexus driver with the per-domain set from
_PXM to generate a local INTR_CPUS set for child devices.
Compared to the r298933, this version uses 'struct _cpuset' in
<sys/bus.h> instead of 'cpuset_t' to avoid requiring <sys/param.h>
(<sys/_cpuset.h> still requires <sys/param.h> for MAXCPU even though
<sys/_bitset.h> does not after recent changes).
PCI-express HotPlug support is implemented via bits in the slot
registers of the PCI-express capability of the downstream port along
with an interrupt that triggers when bits in the slot status register
change.
This is implemented for FreeBSD by adding HotPlug support to the
PCI-PCI bridge driver which attaches to the virtual PCI-PCI bridges
representing downstream ports on HotPlug slots. The PCI-PCI bridge
driver registers an interrupt handler to receive HotPlug events. It
also uses the slot registers to determine the current HotPlug state
and drive an internal HotPlug state machine. For simplicty of
implementation, the PCI-PCI bridge device detaches and deletes the
child PCI device when a card is removed from a slot and creates and
attaches a PCI child device when a card is inserted into the slot.
The PCI-PCI bridge driver provides a bus_child_present which claims
that child devices are present on HotPlug-capable slots only when a
card is inserted. Rather than requiring a timeout in the RC for
config accesses to not-present children, the pcib_read/write_config
methods fail all requests when a card is not present (or not yet
ready).
These changes include support for various optional HotPlug
capabilities such as a power controller, mechanical latch,
electro-mechanical interlock, indicators, and an attention button.
It also includes support for devices which require waiting for
command completion events before initiating a subsequent HotPlug
command. However, it has only been tested on ExpressCard systems
which support surprise removal and have none of these optional
capabilities.
PCI-express HotPlug support is conditional on the PCI_HP option
which is enabled by default on arm64, x86, and powerpc.
Reviewed by: adrian, imp, vangyzen (older versions)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D6136
bus_get_cpus() returns a specified set of CPUs for a device. It accepts
an enum for the second parameter that indicates the type of cpuset to
request. Currently two valus are supported:
- LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to
the device when DEVICE_NUMA is enabled)
- INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core)
For systems that do not support NUMA (or if it is not enabled in the kernel
config), LOCAL_CPUS fails with EINVAL. INTR_CPUS is mapped to 'all_cpus'
by default. The idea is that INTR_CPUS should always return a valid set.
Device drivers which want to use per-CPU interrupts should start using
INTR_CPUS instead of simply assigning interrupts to all available CPUs.
In the future we may wish to add tunables to control the policy of
INTR_CPUS (e.g. should it be local-only or global, should it ignore
SMT threads or not).
The x86 nexus driver exposes the internal set of interrupt CPUs from the
the x86 interrupt code via INTR_CPUS.
The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable
LOCAL_CPUS set when _PXM exists and DEVICE_NUMA is enabled. They also and
the global INTR_CPUS set from the nexus driver with the per-domain set from
_PXM to generate a local INTR_CPUS set for child devices.
Reviewed by: wblock (manpage)
Differential Revision: https://reviews.freebsd.org/D5519
Arguably we should only be doing the probe/attach to children of
these devices as well.
Tested by: Michal Stanek <mst_semihalf.com> (arm64)
Differential Revision: https://reviews.freebsd.org/D6133
This allows the PCI-PCI bridge driver to save a reference to the child
device in its softc.
Note that this required moving the "pci" device creation out of
acpi_pcib_attach(). Instead, acpi_pcib_attach() is renamed to
acpi_pcib_fetch_prt() as it's sole action now is to fetch the PCI
interrupt routing table.
Differential Revision: https://reviews.freebsd.org/D6021
Both of the callers were expecting the input cap_set to be modified.
This fixes them to request cap_set to be updated with the returned buffer.
Reviewed by: jkim
Differential Revision: https://reviews.freebsd.org/D6040
Eventually with earlier AP startup this code will change to call the
startup function synchronously instead of queueing the task. Moving
the time we queue the task should be a no-op since taskqueue threads
don't start executing tasks until much later, but this reduces the diff
with the earlier AP startup patches.
Sponsored by: Netflix
Tell the firmware that we support PCI-express config space access
and MSI.
Reviewed by: jkim
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D6023
This wrapper does not translate errors in the first word to ACPI
error status returns. Use this wrapper in the acpi_cpu(4) driver in
place of the existing _OSC code. While here, fix a bug where the wrong
count of words was passed when invoking _OSC.
Reviewed by: jkim
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D6022
The ACPI and OFW PCI bus drivers as well as CardBus override this to
allocate the larger ivars to hold additional info beyond the stock PCI ivars.
This removes the need to pass the size to functions like pci_add_iov_child()
and pci_read_device() simplifying IOV and bus rescanning implementations.
As a result of this and earlier changes, the ACPI PCI bus driver no longer
needs its own device_attach and pci_create_iov_child methods but can use
the methods in the stock PCI bus driver instead.
Differential Revision: https://reviews.freebsd.org/D5891
VM_NUMA_ALLOC is used to enable use of domain-aware memory allocation in
the virtual memory system. DEVICE_NUMA is used to enable affinity
reporting for devices such as bus_get_domain().
MAXMEMDOM must still be set to a value greater than for any NUMA support
to be effective. Note that 'cpuset -gd' always works if MAXMEMDOM is
enabled and the system supports NUMA.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D5782
Previously, the ACPI PCI bus driver did a single pass over the devices in
the namespace that were a child of a given PCI bus to associate the
PCI bus-enumerated device_t devices with the corresponding ACPI handles.
However, this meant that handles were only established at runtime for devices
found during the initial PCI bus scan.
PCI_IOV adds devices that show up after the initial PCI bus scan, and coming
changes to add a bus rescan can also add devices after the initial scan.
This change adds a pci_child_added() callback to the ACPI PCI bus that walks
the namespace to find the ACPI handle for each device that is added. Using
a callback means that the handle is correctly set for any device no matter
how it is added (initial scan, IOV, or a bus rescan).
Instead of providing a wrapper around device_delete_child() that the PCI
bus and child bus drivers must call explicitly, move the bulk of the logic
from pci_delete_child() into a bus_child_deleted() method
(pci_child_deleted()). This allows PCI devices to be safely deleted via
device_delete_child().
- Add a bus_child_deleted method to the ACPI PCI bus which clears the
device_t associated with the corresponding ACPI handle in addition to
the normal PCI bus cleanup.
- Change cardbus_detach_card to call device_delete_children() and move
CardBus-specific delete logic into a new cardbus_child_deleted() method.
- Use device_delete_child() instead of pci_delete_child() in the SRIOV code.
- Add a bus_child_deleted method to the OpenFirmware PCI bus drivers which
frees the OpenFirmware device info for each PCI device.
Reviewed by: imp
Tested on: amd64 (CardBus and PCI-e hotplug)
Differential Revision: https://reviews.freebsd.org/D5831
On some architectures, u_long isn't large enough for resource definitions.
Particularly, powerpc and arm allow 36-bit (or larger) physical addresses, but
type `long' is only 32-bit. This extends rman's resources to uintmax_t. With
this change, any resource can feasibly be placed anywhere in physical memory
(within the constraints of the driver).
Why uintmax_t and not something machine dependent, or uint64_t? Though it's
possible for uintmax_t to grow, it's highly unlikely it will become 128-bit on
32-bit architectures. 64-bit architectures should have plenty of RAM to absorb
the increase on resource sizes if and when this occurs, and the number of
resources on memory-constrained systems should be sufficiently small as to not
pose a drastic overhead. That being said, uintmax_t was chosen for source
clarity. If it's specified as uint64_t, all printf()-like calls would either
need casts to uintmax_t, or be littered with PRI*64 macros. Casts to uintmax_t
aren't horrible, but it would also bake into the API for
resource_list_print_type() either a hidden assumption that entries get cast to
uintmax_t for printing, or these calls would need the PRI*64 macros. Since
source code is meant to be read more often than written, I chose the clearest
path of simply using uintmax_t.
Tested on a PowerPC p5020-based board, which places all device resources in
0xfxxxxxxxx, and has 8GB RAM.
Regression tested on qemu-system-i386
Regression tested on qemu-system-mips (malta profile)
Tested PAE and devinfo on virtualbox (live CD)
Special thanks to bz for his testing on ARM.
Reviewed By: bz, jhb (previous)
Relnotes: Yes
Sponsored by: Alex Perez/Inertial Computing
Differential Revision: https://reviews.freebsd.org/D4544
acpi_GetInteger() execution. Intel DMAR interrupt remapping code
needs to know UID of the HPET to properly route the FSB interrupts
from the HPET, even when interrupt remapping is disabled, and the code
is executed under some non-sleepable mutexes.
Cache HPET UIDs in the device softc at the attach time and provide
lock-less method to get UID, use the method from the dmar hpet
handling code instead of calling GetInteger().
Reported and tested by: Larry Rosenman <ler@lerctr.org>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This simplifies checking for default resource range for bus_alloc_resource(),
and improves readability.
This is part of, and related to, the migration of rman_res_t from u_long to
uintmax_t.
Discussed with: jhb
Suggested by: marcel
Summary:
Migrate to using the semi-opaque type rman_res_t to specify rman resources. For
now, this is still compatible with u_long.
This is step one in migrating rman to use uintmax_t for resources instead of
u_long.
Going forward, this could feasibly be used to specify architecture-specific
definitions of resource ranges, rather than baking a specific integer type into
the API.
This change has been broken out to facilitate MFC'ing drivers back to 10 without
breaking ABI.
Reviewed By: jhb
Sponsored by: Alex Perez/Inertial Computing
Differential Revision: https://reviews.freebsd.org/D5075
to shut down; close laptop lid" scenario which otherwise tended to end
with a laptop overheating or the battery dying.
The implementation uses a new sysctl, kern.suspend_blocked; init(8) sets
this while rc.suspend runs, and the ACPI sleep code ignores requests while
the sysctl is set.
Discussed on: freebsd-acpi (35 emails)
MFC after: 1 week
When the system has more than a single PCI domain, the bus numbers
are not unique, thus they cannot be used for "pci" device numbering.
Change bus numbers to -1 (i.e. to-be-determined automatically)
wherever the code did not care about domains.
Reviewed by: jhb
Obtained from: Semihalf
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3406
drivers, one for fdt, one for acpi. It then uses this to decide if it will
use fdt or acpi.
The GICv2 (interrupt controller) and Generic Timer drivers have been
updated to handle both cases.
As this is early code we still need FDT to find the kernel console, and
some parts are still missing, including PCI support.
Differential Revision: https://reviews.freebsd.org/D2463
Reviewed by: jhb, jkim, emaste
Obtained from: ABT Systems Ltd
Relnotes: Yes
Sponsored by: The FreeBSD Foundation
years for head. However, it is continuously misused as the mpsafe argument
for callout_init(9). Deprecate the flag and clean up callout_init() calls
to make them more consistent.
Differential Revision: https://reviews.freebsd.org/D2613
Reviewed by: jhb
MFC after: 2 weeks
bridges only supported Intel Pentium and Pentium II era processors and there
is no reason for hardware virtualizations to emulate these quirks.
MFC after: 1 week
interacts with interrupts, query ACPI and use MWAIT for entrance into
Cx sleep states. Support C1 "I/O then halt" mode. See Intel'
document 302223-007 "Intelб╝ Processor Vendor-Specific ACPI Interface
Specification" for description.
Move the acpi_cpu_c1() function into x86/cpu_machdep.c and use
it instead of inlining "sti; hlt" sequence in several places.
In the acpi(4) man page, besides documenting the dev.cpu.N.cx_methods
sysctl, correct the names for dev.cpu.N.{cx_usage,cx_lowest,cx_supported}
sysctls.
Both jkim and avg have some other patches implementing the mwait
functionality; this work is unrelated. Linux does not rely on the
ACPI to provide correct tables describing Cx modes. Instead, the
driver has pre-defined knowledge of the CPU models, it was supplied by
Intel.
Tested by: pho (previous versions)
Sponsored by: The FreeBSD Foundation
its use in upcoming code.
This is inspired by something in jhb's NUMA IRQ allocation patchset.
However, the tricky bit here is that the PXM lookup for a node may
fail, requiring a lookup on the parent node. So if it doesn't
exist, don't fail - just go up to the parent. Only error out of the
lookup is the ACPI lookup returns an error.
Sponsored by: Norse Corp, Inc.
"Intelб╝ Processor Vendor-Specific ACPI Interface Specification",
issied Dec 2014. Previous revision 005 was from Sep 2006.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
under bootverbose. Every example I've seen to date has been due to
an ACPI system resource device reserving a range that overlaps with
system memory (which ram0 attempts to reserve) or a local or I/O APIC
(which apic0 attempts to reserve). These are always harmless but look
scary to users.
MFC after: 1 week
Implement the interace to create SR-IOV Virtual Functions (VFs).
When a driver registers that they support SR-IOV by calling
pci_setup_iov(), the SR-IOV code creates a new node in /dev/iov
for that device. An ioctl can be invoked on that device to
create VFs and have the driver initialize them.
At this point, allocating memory I/O windows (BARs) is not
supported.
Differential Revision: https://reviews.freebsd.org/D76
Reviewed by: jhb
MFC after: 1 month
Sponsored by: Sandvine Inc.
identifies the tested condition for _PRT as "BYTE value of 0", so the
remaining part of the conditionals is sufficient.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
allows the user to request administrative changes to individual devices
such as attach or detaching drivers or disabling and re-enabling devices.
- Add a new /dev/devctl2 character device which uses ioctls for device
requests. The ioctls use a common 'struct devreq' which is somewhat
similar to 'struct ifreq'.
- The ioctls identify the device to operate on via a string. This
string can either by the device's name, or it can be a bus-specific
address. (For unattached devices, a bus address is the only way to
locate a device.) Bus drivers register an eventhandler to claim
unrecognized device names that the driver recognizes as a valid address.
Two buses currently support addresses: ACPI recognizes any device
in the ACPI namespace via its full path starting with "\" and
the PCI bus driver recognizes an address specification of
'pci[<domain>:]<bus>:<slot>:<func>' (identical to the PCI selector
strings supported by pciconf).
- To make it easier to cut and paste, change the PnP location string
in the PCI bus driver to output a full PCI selector string rather
than 'slot=<slot> function=<func>'.
- Add a devctl(3) interface in libdevctl which provides a wrapper around
the ioctls and is the preferred interface for other userland code.
- Add a devctl(8) program which is a simple wrapper around the requests
supported by devctl(3).
- Add a device_is_suspended() function to check DF_SUSPENDED.
- Add a resource_unset_value() function that can be used to remove a
hint from the kernel environment. This is used to clear a
hint.<driver>.<unit>.disabled hint when re-enabling a boot-time
disabled device.
Reviewed by: imp (parts)
Requested by: imp (changing PCI location string)
Relnotes: yes
Also, split power_suspend into power_suspend and power_suspend_early.
power_suspend_early is called before the userland is frozen.
power_suspend is called after the userland is frozen.
Currently only VT switching is hooked to power_suspend_early.
This is needed because switching away from X server requires its
cooperation, so obviously X server must not be frozen when that happens.
Freezing userland during ACPI suspend is useful because not all drivers
correctly handle suspension concurrent with other activity. This is
especially applicable to drivers ported from other operating systems
that suspend all software activity between placing drivers and hardware
into suspended state.
In particular drm2/radeon (radeonkms) depends on the described
procedure. The driver does not have any internal synchronization
between suspension activities and processing of userland requests.
Many thanks to kib for the code that allows to freeze and thaw all
userland threads.
Note that ideally we also need to park / inhibit (non-special) kernel
threads as well to ensure that they do not call into drivers.
MFC after: 17 days
accidentally enable non-existent states.
This bug was triggered if ACPI advertises the presence of a C2 state
which we fail to parse via acpi_PkgGas due to our lack of support for
FFixedHW resources, and causes an immediate panic when an attempt is
made to enter the (NULL) state.
One affected platform is the EC2 c4.8xlarge VM instance type; there
may be others.
MFC after: 1 week
Thanks to: jkim, @_msw_
may also halt in C2 and not just C3 (it seems that in some cases the BIOS
advertises its C3 state as a C2 state in _CST). Just play it safe and
disable both C2 and C3 states if a user forces the use of the TSC as the
timecounter on such CPUs.
PR: 192316
Differential Revision: https://reviews.freebsd.org/D1441
No objection from: jkim
MFC after: 1 week
state said device should go into.
This was a snafu introduced in the ACPI/PCI awareness separation.
When putting a device into a power state, the bus (and thus firmware,
eg ACPI) should be asked before hand to check whether the device
can indeed go into that power state.
There's a set of nodes in ACPI under each device - the _SxD nodes - which
state which ACPI power state to put the device into when the system is
going into power save state 'x'. So when going into S3, the existence
of an _S3D node would override whatever the system was trying to do.
By default the PCI code wants to put devices into D3 before suspending.
I have a laptop here (Asus Zenbook - check the PR) whose EHCI controller
really wants to be in D2 during suspend, not D3. So if we put it into
D3 and then try to enter S3, everything hangs. The device itself
can go into D3 - it just can't be there when the call to ACPI to enter
S3 occurs. The PCI patch fixes this.
jkim@ noticed that the same is needed for the ACPI child device
enumeration.
Thankyou to Matt Dillon (the programmer, not the actor) for buying me
this particular laptop so I could debug the issues with the Atheros
AR9485 that is in it. It's his fault that I ended up with this
laptop and was sufficiently annoyed by the lack of USB suspend
to go down this rabbit hole.
Tested:
* Thinkpad T400
* Thinkpad X230
* Thinkpad T42
* Thinkpad T60
* Asus Zenbook (see PR)
* Asus EEEPC 701
* Asus EEEPC 1001PX
TODO:
* Figure out what we should do about devices we unload drivers for
that want to be in a specific state when entering S3 / S4 -
the "put devices into D3 if they're not bound to a driver" option
may also mess with things.
PR: kern/194884
Reviewed by: jhb, jkim
MFC after: 1 week
Relnotes: yes
Sponsored by: Matt Dillon <dillon@apollo.backplane.com> (hardware)
directly accessed. Although this will work on some platforms, it can
throw an exception if the pointer is invalid and then panic the kernel.
Add a missing SYSCTL_IN() of "SCTP_BASE_STATS" structure.
MFC after: 3 days
Sponsored by: Mellanox Technologies
It had two bugs: one where mmap was still allowed and another where
D_TRACKCLOSE doesn't handle all cases.
Thanks to jhb and kib for pointing them out.
MFC after: 1 week
In some cases, TSC is broken and special applications might benefit
from memory mapping HPET and reading the registers to count time.
Most often the main HPET counter is 32-bit only[1], so this only gives
the application a 300 second window based on the default HPET
interval.
Other applications, such as Intel's DPDK, expect /dev/hpet to be
present and use it to count time as well.
Although we have an almost userland version of gettimeofday() which
uses rdtsc in userland, it's not always possible to use it, depending
on how broken the multi-socket hardware is.
Install the acpi_hpet.h so that applications can use the HPET register
definitions.
[1] I haven't found a system where HPET's main counter uses more than
32 bit. There seems to be a discrepancy in the Intel documentation
(claiming it's a 64-bit counter) and the actual implementation (a
32-bit counter in a 64-bit memory area).
MFC after: 1 week
Relnotes: yes
in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv().
This fixes a namespace collision with libc symbols.
Submitted by: kmacy
Tested by: make universe
* Add a bus_if.m method - get_domain() - returning the VM domain or
ENOENT if the device isn't in a VM domain;
* Add bus methods to print out the domain of the device if appropriate;
* Add code in srat.c to save the PXM -> VM domain mapping that's done and
expose a function to translate VM domain -> PXM;
* Add ACPI and ACPI PCI methods to check if the bus has a _PXM attribute
and if so map it to the VM domain;
* (.. yes, this works recursively.)
* Have the pci bus glue print out the device VM domain if present.
Note: this is just the plumbing to start enumerating information -
it doesn't at all modify behaviour.
Differential Revision: D906
Reviewed by: jhb
Sponsored by: Norse Corp
This patch adds support for MSI interrupts when running on Xen. Apart
from adding the Xen related code needed in order to register MSI
interrupts this patch also makes the msi_init function a hook in
init_ops, so different MSI implementations can have different
initialization functions.
Sponsored by: Citrix Systems R&D
xen/interface/physdev.h:
- Add the MAP_PIRQ_TYPE_MULTI_MSI to map multi-vector MSI to the Xen
public interface.
x86/include/init.h:
- Add a hook for setting custom msi_init methods.
amd64/amd64/machdep.c:
i386/i386/machdep.c:
- Set the default msi_init hook to point to the native MSI
initialization method.
x86/xen/pv.c:
- Set the Xen MSI init hook when running as a Xen guest.
x86/x86/local_apic.c:
- Call the msi_init hook instead of directly calling msi_init.
xen/xen_intr.h:
x86/xen/xen_intr.c:
- Introduce support for registering/releasing MSI interrupts with
Xen.
- The MSI interrupts will use the same PIC as the IO APIC interrupts.
xen/xen_msi.h:
x86/xen/xen_msi.c:
- Introduce a Xen MSI implementation.
x86/xen/xen_nexus.c:
- Overwrite the default MSI hooks in the Xen Nexus to use the Xen MSI
implementation.
x86/xen/xen_pci.c:
- Introduce a Xen specific PCI bus that inherits from the ACPI PCI
bus and overwrites the native MSI methods.
- This is needed because when running under Xen the MSI messages used
to configure MSI interrupts on PCI devices are written by Xen
itself.
dev/acpica/acpi_pci.c:
- Lower the quality of the ACPI PCI bus so the newly introduced Xen
PCI bus can take over when needed.
conf/files.i386:
conf/files.amd64:
- Add the newly created files to the build process.
resume that is a superset of a pcb. Move the FPU state out of the pcb and
into this new structure. As part of this, move the FPU resume code on
amd64 into a C function. This allows resumectx() to still operate only on
a pcb and more closely mirrors the i386 code.
Reviewed by: kib (earlier version)
Also disable a couple of ACPI devices that are not usable under Dom0.
To this end a couple of booleans are added that allow disabling ACPI
specific devices.
Sponsored by: Citrix Systems R&D
Reviewed by: jhb
x86/xen/xen_nexus.c:
- Return BUS_PROBE_SPECIFIC in the Xen Nexus attachement routine to
force the usage of the Xen Nexus.
- Attach the ACPI bus when running as Dom0.
dev/acpica/acpi_cpu.c:
dev/acpica/acpi_hpet.c:
dev/acpica/acpi_timer.c
- Add a variable that gates the addition of the devices.
x86/include/init.h:
- Declare variables that control the attachment of ACPI cpu, hpet and
timer devices.
This includes:
o All directories named *ia64*
o All files named *ia64*
o All ia64-specific code guarded by __ia64__
o All ia64-specific makefile logic
o Mention of ia64 in comments and documentation
This excludes:
o Everything under contrib/
o Everything under crypto/
o sys/xen/interface
o sys/sys/elf_common.h
Discussed at: BSDcan
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
PCI root bridges except for the one known-valid case on x86 where bridges
claim the I/O port registers used for PCI config space access.
Tested by: Hilko Meyer <hilko.meyer@gmx.de>
MFC after: 1 week
instead of trying to cache it.
Previously, we only trusted the state if we did not have a cached state.
However, once a state was cached, the _STA method was always ignored.
Specifically, once a power resource had been turned on once (e.g.
during resume), the driver assumed it was always on even if _STA said it
was off and never turned it back on. This prevented the power resource
from being turned back on if a laptop was resumed twice, for example.
To fix, just remove the cached state entirely and always use the results
of _STA. The loops already skip any resources where _STA fails.
Submitted by: trasz (initial patch to invoke _ON)
MFC after: 1 week
that are being done by the OS.
For now this'll match up with the "wakeups"; although I'll dig deeper into
this to see if we can determine which sleep state the CPU managed to get
into. Most things I've seen these days only expose up to C2 or C3 via
ACPI even though the CPU goes all the way down to C6 or C7.
I/O windows, the default is to preserve the firmware-assigned resources.
PCI bus numbers are only managed if NEW_PCIB is enabled and the architecture
defines a PCI_RES_BUS resource type.
- Add a helper API to create top-level PCI bus resource managers for each
PCI domain/segment. Host-PCI bridge drivers use this API to allocate
bus numbers from their associated domain.
- Change the PCI bus and CardBus drivers to allocate a bus resource for
their bus number from the parent PCI bridge device.
- Change the PCI-PCI and PCI-CardBus bridge drivers to allocate the
full range of bus numbers from secbus to subbus from their parent bridge.
The drivers also always program their primary bus register. The bridge
drivers also support growing their bus range by extending the bus resource
and updating subbus to match the larger range.
- Add support for managing PCI bus resources to the Host-PCI bridge drivers
used for amd64 and i386 (acpi_pcib, mptable_pcib, legacy_pcib, and qpi_pcib).
- Define a PCI_RES_BUS resource type for amd64 and i386.
Reviewed by: imp
MFC after: 1 month
the memory ranges that they decode for downstream devices rather than
creating ResourceProducer range resource entries. The result is that
we allocate the full range to the PCI root bridge device causing
allocations in child devices to all fail.
As a workaround, ignore any standard memory resources on a PCI root
bridge device. It is normal for a PCI root bridge to allocate an I/O
resource for the I/O ports used for PCI config access, but I have not
seen any PCI root bridges that legitimately allocate a memory resource.
Reviewed by: jkim
MFC after: 1 week
shifts into the sign bit. Instead use (1U << 31) which gets the
expected result.
This fix is not ideal as it assumes a 32 bit int, but does fix the issue
for most cases.
A similar change was made in OpenBSD.
Discussed with: -arch, rdivacky
Reviewed by: cperciva
resist easy conversion since they implement a great deal of their attach
logic inside probe(). Some of this could be fixed by moving it to attach(),
but some requires something more subtle than BUS_PROBE_NOWILDCARD.
1.3 of Intelб╝ Virtualization Technology for Directed I/O Architecture
Specification. The Extended Context and PASIDs from the rev. 2.2 are
not supported, but I am not aware of any released hardware which
implements them. Code does not use queued invalidation, see comments
for the reason, and does not provide interrupt remapping services.
Code implements the management of the guest address space per domain
and allows to establish and tear down arbitrary mappings, but not
partial unmapping. The superpages are created as needed, but not
promoted. Faults are recorded, fault records could be obtained
programmatically, and printed on the console.
Implement the busdma(9) using DMARs. This busdma backend avoids
bouncing and provides security against misbehaving hardware and driver
bad programming, preventing leaks and corruption of the memory by wild
DMA accesses.
By default, the implementation is compiled into amd64 GENERIC kernel
but disabled; to enable, set hw.dmar.enable=1 loader tunable. Code is
written to work on i386, but testing there was low priority, and
driver is not enabled in GENERIC. Even with the DMAR turned on,
individual devices could be directed to use the bounce busdma with the
hw.busdma.pci<domain>:<bus>:<device>:<function>.bounce=1 tunable. If
DMARs are capable of the pass-through translations, it is used,
otherwise, an identity-mapping page table is constructed.
The driver was tested on Xeon 5400/5500 chipset legacy machine,
Haswell desktop and E5 SandyBridge dual-socket boxes, with ahci(4),
ata(4), bce(4), ehci(4), mfi(4), uhci(4), xhci(4) devices. It also
works with em(4) and igb(4), but there some fixes are needed for
drivers, which are not committed yet. Intel GPUs do not work with
DMAR (yet).
Many thanks to John Baldwin, who explained me the newbus integration;
Peter Holm, who did all testing and helped me to discover and
understand several incredible bugs; and to Jim Harris for the access
to the EDS and BWG and for listening when I have to explain my
findings to somebody.
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Xen PVHVM guest.
Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
Reviewed by: gibbs
Approved by: re (blanket Xen)
MFC after: 2 weeks
sys/amd64/amd64/mp_machdep.c:
sys/i386/i386/mp_machdep.c:
- Make sure that are no MMU related IPIs pending on migration.
- Reset pending IPI_BITMAP on resume.
- Init vcpu_info on resume.
sys/amd64/include/intr_machdep.h:
sys/i386/include/intr_machdep.h:
sys/x86/acpica/acpi_wakeup.c:
sys/x86/x86/intr_machdep.c:
sys/x86/isa/atpic.c:
sys/x86/x86/io_apic.c:
sys/x86/x86/local_apic.c:
- Add a "suspend_cancelled" parameter to pic_resume(). For the
Xen PIC, restoration of interrupt services differs between
the aborted suspend and normal resume cases, so we must provide
this information.
sys/dev/acpica/acpi_timer.c:
sys/dev/xen/timer/timer.c:
sys/timetc.h:
- Don't swap out "suspend safe" timers across a suspend/resume
cycle. This includes the Xen PV and ACPI timers.
sys/dev/xen/control/control.c:
- Perform proper suspend/resume process for PVHVM:
- Suspend all APs before going into suspension, this allows us
to reset the vcpu_info on resume for each AP.
- Reset shared info page and callback on resume.
sys/dev/xen/timer/timer.c:
- Implement suspend/resume support for the PV timer. Since FreeBSD
doesn't perform a per-cpu resume of the timer, we need to call
smp_rendezvous in order to correctly resume the timer on each CPU.
sys/dev/xen/xenpci/xenpci.c:
- Don't reset the PCI interrupt on each suspend/resume.
sys/kern/subr_smp.c:
- When suspending a PVHVM domain make sure there are no MMU IPIs
in-flight, or we will get a lockup on resume due to the fact that
pending event channels are not carried over on migration.
- Implement a generic version of restart_cpus that can be used by
suspended and stopped cpus.
sys/x86/xen/hvm.c:
- Implement resume support for the hypercall page and shared info.
- Clear vcpu_info so it can be reset by APs when resuming from
suspension.
sys/dev/xen/xenpci/xenpci.c:
sys/x86/xen/hvm.c:
sys/x86/xen/xen_intr.c:
- Support UP kernel configurations.
sys/x86/xen/xen_intr.c:
- Properly rebind per-cpus VIRQs and IPIs on resume.
A warning is emitted again if the temperature became briefly valid
meanwhile. This avoids spamming the user when the sensor is broken.
Other values (ie. not _TMP) always raise a warning.
settings for ACPI-enumerated serial ports by forcing any IRQs that use
an ISA IRQ value with these settings to active-high instead of active-low.
This is known to occur with the BIOS on an Intel D2500CCE motherboard.
Tested by: Robert Ames <robertames@hotmail.com>, lev
Submitted by: Juergen Weiss weiss at uni-mainz.de (original patch)
we are probing a PCI-PCI bridge it is because we found one by enumerating
the devices on a PCI bus, so the bridge is definitely present. A few
BIOSes report incorrect status (_STA) for some bridges that claimed they
were not present when in fact they were.
While here, move this check earlier for Host-PCI bridges so attach fails
before doing any work that needs to be torn down.
PR: kern/91594
Tested by: Jack Vogel @ Intel
MFC after: 1 week
that uses non-ISA IRQs but use a plain IRQ resource in _CRS. However,
a non-ISA IRQ can't fit into a plain IRQ resource. If we encounter a
link like this, build the resource buffer from _PRS instead of _CRS.
- Set the correct size of the end tag in a resource buffer.
Tested by: Benjamin Lee <ben@b1c1l1.com>
MFC after: 2 weeks
Switch eventtimers(9) from using struct bintime to sbintime_t.
Even before this not a single driver really supported full dynamic range of
struct bintime even in theory, not speaking about practical inexpediency.
This change legitimates the status quo and cleans up the code.
When CPU becomes idle, cpu_idleclock() calculates time to the next timer
event in order to reprogram hw timer. Return that time in sbintime_t to
the caller and pass it to acpi_cpu_idle(), where it can be used as one
more factor (quite precise) to extimate furter sleep time and choose
optimal sleep state. This is a preparatory change for further callout
improvements will be committed in the next days.
The commmit is not targeted for MFC.
This hack is picked up from Linux, which claims that it follows
Windows behavior.
PR: amd64/174409
Tested by: Sergey V. Dyatko <sergey.dyatko@gmail.com>,
KAHO Toshikazu <kaho@elam.kais.kyoto-u.ac.jp>,
Slawa Olhovchenkov <slw@zxy.spb.ru>
MFC after: 13 days
ASUS P8Z77-V board reports _AC2, _AC3 and _AC4 setpoints as 0C. With active
cooling already automatically set to _AC2, that still caused driver to print
two useless lines about temperature above _AC3 and _AC4 every ten seconds.
Three setponts of 0C is probably a board bug, but the same spam could happen
also in correct case if system is runnign not with the lowest cooling level.
... to avoid any races or inconsistencies.
This should fix a regression introduced in r243404.
Also, remove a stale comment that has not been true for quite a while
now.
Pointyhat to: avg
Teested by: trociny, emaste, dumbbell (earlier version)
MFC after: 1 week
... instead of the ever increasing ones.
Also, do free old resources when allocating new ones when cx states
change.
Tested by: Tom Lislegaard <Tom.Lislegaard@proact.no>
Obtained from: jkim
MFC after: 1 week
This alleviates issues on newer Sandy/Ivy Bridge gear that seems to require
boatloads more ACPI resources than before.
Reviewed by: avg@
Obtained from: Yahoo! Inc.
MFC after: 2 weeks
parent adapter's _DOD list, only check the low 16 bits of both _ADR and
_DOD. The language in the ACPI spec seems to indicate that the _ADR values
should exactly match the entries in _DOD. However, I assume that the
masking added to _DOD values was added to work around some known busted
machines (the commit history doesn't indicate either way), and the ACPI
spec does require that the low 16 bits are unique for all video outputs,
so only check the low 16 bits should be fine.
This fixes recognition of video outputs that use the new standardized
device ID scheme in ACPI 3.0 that set bit 31 such as certain Dell laptops.
Tested by: Juergen Lock nox jelal kn-bremen de
MFC after: 3 days
(1022) in HPET. But according to report they still haven't fixed problem
with level-triggered interrupts.
Make workaround used for earlier chipsets apply to this new ID also.
PR: amd64/171355
MFC after: 3 days
For C1 and C2 states use cpu_ticks() to measure sleep time instead of much
slower ACPI timer. We can't do it for C3, as TSC may stop there. But it is
less important there as wake up latency is high any way.
For C1 and C2 states do not check/clear bus mastering activity status, as
it is important only for C3. As side effect it can make CPU enter C2 instead
of C3 if last BM activity was two sleeps back (unlike one before), but
that may be even good because of collecting more statistics. Premature BM
wakeup from C3, entered because of overestimation, can easily be worse then
entering C2 from both performance and power consumption points of view.
Together on dual Xeon E5645 system on sequential 512 bytes read test this
change makes cpu_idle_acpi() as fast as simplest cpu_idle_hlt() and only
few percents slower then cpu_idle_mwait(), while deeper states are still
actively used during idle periods.
To help with diagnostics, add C-state type into dev.cpu.X.cx_supported.
Sponsored by: iXsystems, Inc.
... from a user-set persistent limit on the said level.
Allow to set the user-imposed limit below current deepest available level
as the available levels may be dynamically changed by ACPI platform
in both directions.
Allow "Cmax" as an input value for cx_lowest sysctls to mean that there
is not limit and OS can use all available C-states.
Retire global cpu_cx_count as it no longer serves any meaningful
purpose.
Reviewed by: jhb, gianni, sbruno
Tested by: sbruno, Vitaly Magerya <vmagerya@gmail.com>
MFC after: 2 weeks
although by default only C1 is enabled (cx_lowest=0) and enabling deeper
states goes through acpi_cpu_set_cx_lowest which re-evaluates cpu_non_c3
MFC after: 2 weeks
cpu_non_c3 is already evaluated in acpi_cpu_cx_cst and in
acpi_cpu_set_cx_lowest.
Besides acpi_cpu_cx_list is not protected by any locking.
As a result also move setting of cpu_can_deep_sleep to more appropriate
places.
MFC after: 2 weeks
Adjust power_profile script to handle the new world order as well.
Some vendors are opting out of a C2 state and only defining C1 & C3. This
leads the acpi_cpu display to indicate that the machine supports C1 & C2
which is caused by the (mis)use of the index of the cx_state array as the
ACPI_STATE_CX value.
e.g. the code was pretending that cx_state[i] would
always convert to i by subtracting 1.
cx_state[2] == ACPI_STATE_C3
cx_state[1] == ACPI_STATE_C2
cx_state[0] == ACPI_STATE_C1
however, on certain machines this would lead to
cx_state[1] == ACPI_STATE_C3
cx_state[0] == ACPI_STATE_C1
This didn't break anything but led to a display of:
* dev.cpu.0.cx_supported: C1/1 C2/96
Instead of
* dev.cpu.0.cx_supported: C1/1 C3/96
MFC after: 2 weeks
suspend/resume procedures are minimized among them.
common:
- Add global cpuset suspended_cpus to indicate APs are suspended/resumed.
- Remove acpi_waketag and acpi_wakemap from acpivar.h (no longer used).
- Add some variables in acpi_wakecode.S in order to minimize the difference
among amd64 and i386.
- Disable load_cr3() because now CR3 is restored in resumectx().
amd64:
- Add suspend/resume related members (such as MSR) in PCB.
- Modify savectx() for above new PCB members.
- Merge acpi_switch.S into cpu_switch.S as resumectx().
i386:
- Merge(and remove) suspendctx() into savectx() in order to match with
amd64 code.
Reviewed by: attilio@, acpi@
(described in ACPICA source code).
- Move intr_disable() and intr_restore() from acpi_wakeup.c to acpi.c
and call AcpiLeaveSleepStatePrep() in interrupt disabled context.
- Add acpi_wakeup_machdep() to execute wakeup MD procedures and call
it twice in interrupt disabled/enabled context (ia64 version is
just dummy).
- Rename wakeup_cpus variable in acpi_sleep_machdep() to suspcpus in
order to be shared by acpi_sleep_machdep() and acpi_wakeup_machdep().
- Move identity mapping related code to acpi_install_wakeup_handler()
(i386 version) for preparation of x86/acpica/acpi_wakeup.c
(MFC candidate).
Reviewed by: jkim@
MFC after: 2 days
DEVICE_RESUME() should be done before AcpiLeaveSleepState() because
PCI config space evaluation can be occurred during control method
executions.
This should fix one of the hang up problems on resuming.
MFC after: 3 days
processor objects. Instead of forcing the new-bus CPU objects to use
a unit number equal to pc_cpuid, adjust acpi_pcpu_get_id() to honor the
MADT IDs by default. As with the previous change, setting
debug.acpi.cpu_unordered to 1 in the loader will revert to the old
behavior.
Tested by: jimharris
MFC after: 1 month
disabled or the system is in shutdown procedure.
This should fix the problem which kernel never response to the sleep
button press events after the message `suspend request ignored (not
ready yet)'.
MFC after: 3 days
Use MADT to match ACPI Processor objects to CPUs. MADT and DSDT/SSDTs may
list CPUs in different orders, especially for disabled logical cores. Now
we match ACPI IDs from the MADT with Processor objects, strictly order CPUs
accordingly, and ignore disabled cores. This prevents us from executing
methods for other CPUs, e. g., _PSS for disabled logical core, which may not
exist. Unfortunately, it is known that there are a few systems with buggy
BIOSes that do not have unique ACPI IDs for MADT and Processor objects. To
work around these problems, 'debug.acpi.cpu_unordered' tunable is added.
Set this to a non-zero value to restore the old behavior.
Many thanks to jhb for pointing me to the right direction and the manual
page change.
Reported by: Harris, James R (james dot r dot harris at intel dot com)
Tested by: Harris, James R (james dot r dot harris at intel dot com)
Reviewed by: jhb
MFC after: 1 month
list CPUs in different orders, especially for disabled logical cores. Now
we match ACPI IDs from the MADT with Processor objects, strictly order CPUs
accordingly, and ignore disabled cores. This prevents us from executing
methods for other CPUs, e. g., _PSS for disabled logical core, which may not
exist. Unfortunately, it is known that there are a few systems with buggy
BIOSes that do not have unique ACPI IDs for MADT and Processor objects. To
work around these problems
bridges. Rather than blindly enabling the windows on all of them, only
enable the window when an MSI interrupt is enabled for a device behind
the bridge, similar to what already happens for HT PCI-PCI bridges.
To implement this, each x86 Host-PCI bridge driver has to be able to
locate it's actual backing device on bus 0. For ACPI, use the _ADR
method to find the slot and function of the device. For the non-ACPI
case, the legacy(4) driver already scans bus 0 looking for Host-PCI
bridge devices. Now it saves the slot and function of each bridge that
it finds as ivars that the Host-PCI bridge driver can then use in its
pcib_map_msi() method.
This fixes machines where non-MSI interrupts were broken by the previous
round of HT MSI changes.
Tested by: bapt
MFC after: 1 week
Lower (ISA) IRQs are working, but allowed mask is not set correctly.
Block both by default to allow HP BL465c G6 blade system to boot.
Reported by: Attila Nagy <bra@fsn.hu>
MFC after: 1 week
The tag enforces a single restriction that all DMA transactions must not
cross a 4GB boundary. Note that while this restriction technically only
applies to PCI-express, this change applies it to all PCI devices as it
is simpler to implement that way and errs on the side of caution.
- Add a softc structure for PCI bus devices to hold the bus_dma tag and
a new pci_attach_common() routine that performs actions common to the
attach phase of all PCI bus drivers. Right now this only consists of
a bootverbose printf and the allocate of a bus_dma tag if necessary.
- Adjust all PCI bus drivers to allocate a PCI bus softc and to call
pci_attach_common() from their attach routines.
MFC after: 2 weeks
- Increase probing order for ECDT table to match HID-based probing.
- Decrease probing order for HPET table to match HID-based probing.
- Decrease probing order for CPUs and system resources.
- Fix ACPI_DEV_BASE_ORDER to reflect the reality.
decoded ranges. Pass any request for a specific range that fails because
it is not in a decoded range for an ACPI Host-PCI bridge up to the parent
to see if it can still be allocated. This is based on the assumption that
many BIOSes are inconsistent/broken and that settings programmed into BARs
or resources assigned to other built-in components are more trustworthy than
the list of decoded resource ranges in _CRS. This effectively limits the
decoded ranges to only being used for "wildcard" ranges when allocating
fresh resources for a BAR, etc. At some point I would like to only be
this permissive during an early scan of firmware-assigned resources during
boot and to be strict about all later allocations, but that isn't viable
currently.
MFC after: 2 weeks
one. Interestingly, these are actually the default for quite some time
(bus_generic_driver_added(9) since r52045 and bus_generic_print_child(9)
since r52045) but even recently added device drivers do this unnecessarily.
Discussed with: jhb, marcel
- While at it, use DEVMETHOD_END.
Discussed with: jhb
- Also while at it, use __FBSDID.
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
a decoded range for an ACPI Host-PCI bridge, try to allocate it from the
ACPI system resource range. If that works, permit the resource allocation
regardless.
MFC after: 1 week
avoid lost timer interrupts. Previous optimization attempt doing it only
for intervals less then 5000 ticks (~300us) reported to be unreliable by
some people. Probably because of some heavy SMI code on their boards.
Introduce additional safety interval of 128 counter ticks (~9us) between
programmed comparator and counter values to cover different cases of
delayed write found on some chipsets.
Approved by: re (kib)
the resource covers the entire range. Some BIOSes appear to mark
endpoints as non-fixed incorrectly (non-fixed endpoints are supposed to
be used in _PRS when OSPM is allowed to allocate a certain chunk of
address space within a larger range, I don't believe it is supposed to be
used for _CRS).
Approved by: re (kib)
resource allocation on x86 platforms:
- Add a new helper API that Host-PCI bridge drivers can use to restrict
resource allocation requests to a set of address ranges for different
resource types.
- For the ACPI Host-PCI bridge driver, use Producer address range resources
in _CRS to enumerate valid address ranges for a given Host-PCI bridge.
This can be disabled by including "hostres" in the debug.acpi.disabled
tunable.
- For the MPTable Host-PCI bridge driver, use entries in the extended
MPTable to determine the valid address ranges for a given Host-PCI
bridge. This required adding code to parse extended table entries.
Similar to the new PCI-PCI bridge driver, these changes are only enabled
if the NEW_PCIB kernel option is enabled (which is enabled by default on
amd64 and i386).
Approved by: re (kib)
processors unless the invariant TSC bit of CPUID is set. Intel processors
may stop incrementing TSC when DPSLP# pin is asserted, according to Intel
processor manuals, i. e., TSC timecounter is useless if the processor can
enter deep sleep state (C3/C4). This problem was accidentally uncovered by
r222869, which increased timecounter quality of P-state invariant TSC, e.g.,
for Core2 Duo T5870 (Family 6, Model f) and Atom N270 (Family 6, Model 1c).
Reported by: Fabian Keil (freebsd-listen at fabiankeil dot de)
Ian FREISLICH (ianf at clue dot co dot za)
Tested by: Fabian Keil (freebsd-listen at fabiankeil dot de)
- Core2 Duo T5870 (C3 state available/enabled)
jkim - Xeon X5150 (C3 state unavailable)
resource allocation from an x86 Host-PCI bridge driver so that it can be
reused by the ACPI Host-PCI bridge driver (and eventually the MPTable
Host-PCI bridge driver) instead of duplicating the same logic. Note that
this means that hw.acpi.host_mem_start is now replaced with the
hw.pci.host_mem_start tunable that was already used in the non-ACPI case.
This also removes hw.acpi.host_mem_start on ia64 where it was not
applicable (the implementation was very x86-specific).
While here, adjust the logic to apply the new start address on any
"wildcard" allocation even if that allocation comes from a subset of
the allowable address range.
Reviewed by: imp (1)
ACPI Device() objects that do not have any device IDs available via the
_HID or _CID methods. Without a device ID a device driver cannot attach
to the device anyway. Namespace objects that are devices but not of
type ACPI_TYPE_DEVICE are not affected.
A few BIOSes have also attached a _CRS method to a PCI device to
allocate resources that are not managed via a BAR. With the previous
code those resources are allocated from acpi0 directly which can interfere
with the new PCI-PCI bridge driver (since the PCI device in question may
be behind a bridge and its resources should be allocated from that
bridge's windows instead). The resources were also orphaned and
and would end up associated with some other random device whose device_t
reused the pointer of the original ACPI-enumerated device (after it was
free'd by the ACPI PCI bus driver) in devinfo output which was confusing.
If we want to handle _CRS on PCI devices we can adjust the ACPI PCI bus
driver to do that in the future and associate the resources with the
proper device object respecting PCI-PCI bridges, etc.
Note that with this change the ACPI PCI bus driver no longer has to
delete ACPI-enumerated device_t devices that mirror PCI devices since
they should in general not exist. There are rare cases when a BIOS
will give a PCI device a _HID (e.g. I've seen a PCI-ISA bridge given
a _HID for a system resource device). In that case we leave both the
ACPI and PCI-enumerated device_t objects around just as in the previous
code.
quality to 950. HPET on modern platforms usually have better resolution and
lower latency than ACPI timer. Effectively this changes default timecounter
hardware from ACPI-fast to HPET by default when both are available.
Discussed with: avg
driver would verify that requests for child devices were confined to any
existing I/O windows, but the driver relied on the firmware to initialize
the windows and would never grow the windows for new requests. Now the
driver actively manages the I/O windows.
This is implemented by allocating a bus resource for each I/O window from
the parent PCI bus and suballocating that resource to child devices. The
suballocations are managed by creating an rman for each I/O window. The
suballocated resources are mapped by passing the bus_activate_resource()
call up to the parent PCI bus. Windows are grown when needed by using
bus_adjust_resource() to adjust the resource allocated from the parent PCI
bus. If the adjust request succeeds, the window is adjusted and the
suballocation request for the child device is retried.
When growing a window, the rman_first_free_region() and
rman_last_free_region() routines are used to determine if the front or
end of the existing I/O window is free. From using that, the smallest
ranges that need to be added to either the front or back of the window
are computed. The driver will first try to grow the window in whichever
direction requires the smallest growth first followed by the other
direction if that fails.
Subtractive bridges will first attempt to satisfy requests for child
resources from I/O windows (including attempts to grow the windows). If
that fails, the request is passed up to the parent PCI bus directly
however.
The PCI-PCI bridge driver will try to use firmware-assigned ranges for
child BARs first and only allocate a "fresh" range if that specific range
cannot be accommodated in the I/O window. This allows systems where the
firmware assigns resources during boot but later wipes the I/O windows
(some ACPI BIOSen are known to do this) to "rediscover" the original I/O
window ranges.
The ACPI Host-PCI bridge driver has been adjusted to correctly honor
hw.acpi.host_mem_start and the I/O port equivalent when a PCI-PCI bridge
makes a wildcard request for an I/O window range.
The new PCI-PCI bridge driver is only enabled if the NEW_PCIB kernel option
is enabled. This is a transition aide to allow platforms that do not
yet support bus_activate_resource() and bus_adjust_resource() in their
Host-PCI bridge drivers (and possibly other drivers as needed) to use the
old driver for now. Once all platforms support the new driver, the
kernel option and old driver will be removed.
PR: kern/143874 kern/149306
Tested by: mav
safer for i386 because it can be easily over 4 GHz now. More worse, it can
be easily changed by user with 'machdep.tsc_freq' tunable (directly) or
cpufreq(4) (indirectly). Note it is intentionally not used in performance
critical paths to avoid performance regression (but we should, in theory).
Alternatively, we may add "virtual TSC" with lower frequency if maximum
frequency overflows 32 bits (and ignore possible incoherency as we do now).