Path: blob/master/Documentation/PCI/pcieaer-howto.txt
10821 views
The PCI Express Advanced Error Reporting Driver Guide HOWTO1T. Long Nguyen <[email protected]>2Yanmin Zhang <[email protected]>307/29/20064561. Overview781.1 About this guide910This guide describes the basics of the PCI Express Advanced Error11Reporting (AER) driver and provides information on how to use it, as12well as how to enable the drivers of endpoint devices to conform with13PCI Express AER driver.14151.2 Copyright (C) Intel Corporation 2006.16171.3 What is the PCI Express AER Driver?1819PCI Express error signaling can occur on the PCI Express link itself20or on behalf of transactions initiated on the link. PCI Express21defines two error reporting paradigms: the baseline capability and22the Advanced Error Reporting capability. The baseline capability is23required of all PCI Express components providing a minimum defined24set of error reporting requirements. Advanced Error Reporting25capability is implemented with a PCI Express advanced error reporting26extended capability structure providing more robust error reporting.2728The PCI Express AER driver provides the infrastructure to support PCI29Express Advanced Error Reporting capability. The PCI Express AER30driver provides three basic functions:3132- Gathers the comprehensive error information if errors occurred.33- Reports error to the users.34- Performs error recovery actions.3536AER driver only attaches root ports which support PCI-Express AER37capability.3839402. User Guide41422.1 Include the PCI Express AER Root Driver into the Linux Kernel4344The PCI Express AER Root driver is a Root Port service driver attached45to the PCI Express Port Bus driver. If a user wants to use it, the driver46has to be compiled. Option CONFIG_PCIEAER supports this capability. It47depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and48CONFIG_PCIEAER = y.49502.2 Load PCI Express AER Root Driver51There is a case where a system has AER support in BIOS. Enabling the AER52Root driver and having AER support in BIOS may result unpredictable53behavior. To avoid this conflict, a successful load of the AER Root driver54requires ACPI _OSC support in the BIOS to allow the AER Root driver to55request for native control of AER. See the PCI FW 3.0 Specification for56details regarding OSC usage. Currently, lots of firmwares don't provide57_OSC support while they use PCI Express. To support such firmwares,58forceload, a parameter of type bool, could enable AER to continue to59be initiated although firmwares have no _OSC support. To enable the60walkaround, pls. add aerdriver.forceload=y to kernel boot parameter line61when booting kernel. Note that forceload=n by default.6263nosourceid, another parameter of type bool, can be used when broken64hardware (mostly chipsets) has root ports that cannot obtain the reporting65source ID. nosourceid=n by default.66672.3 AER error output68When a PCI-E AER error is captured, an error message will be outputed to69console. If it's a correctable error, it is outputed as a warning.70Otherwise, it is printed as an error. So users could choose different71log level to filter out correctable error messages.7273Below shows an example:740000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)750000:50:00.0: device [8086:0329] error status/mask=00100000/00000000760000:50:00.0: [20] Unsupported Request (First)770000:50:00.0: TLP Header: 04000001 00200a03 05010000 000501007879In the example, 'Requester ID' means the ID of the device who sends80the error message to root port. Pls. refer to pci express specs for81other fields.8283843. Developer Guide8586To enable AER aware support requires a software driver to configure87the AER capability structure within its device and to provide callbacks.8889To support AER better, developers need understand how AER does work90firstly.9192PCI Express errors are classified into two types: correctable errors93and uncorrectable errors. This classification is based on the impacts94of those errors, which may result in degraded performance or function95failure.9697Correctable errors pose no impacts on the functionality of the98interface. The PCI Express protocol can recover without any software99intervention or any loss of data. These errors are detected and100corrected by hardware. Unlike correctable errors, uncorrectable101errors impact functionality of the interface. Uncorrectable errors102can cause a particular transaction or a particular PCI Express link103to be unreliable. Depending on those error conditions, uncorrectable104errors are further classified into non-fatal errors and fatal errors.105Non-fatal errors cause the particular transaction to be unreliable,106but the PCI Express link itself is fully functional. Fatal errors, on107the other hand, cause the link to be unreliable.108109When AER is enabled, a PCI Express device will automatically send an110error message to the PCIe root port above it when the device captures111an error. The Root Port, upon receiving an error reporting message,112internally processes and logs the error message in its PCI Express113capability structure. Error information being logged includes storing114the error reporting agent's requestor ID into the Error Source115Identification Registers and setting the error bits of the Root Error116Status Register accordingly. If AER error reporting is enabled in Root117Error Command Register, the Root Port generates an interrupt if an118error is detected.119120Note that the errors as described above are related to the PCI Express121hierarchy and links. These errors do not include any device specific122errors because device specific errors will still get sent directly to123the device driver.1241253.1 Configure the AER capability structure126127AER aware drivers of PCI Express component need change the device128control registers to enable AER. They also could change AER registers,129including mask and severity registers. Helper function130pci_enable_pcie_error_reporting could be used to enable AER. See131section 3.3.1321333.2. Provide callbacks1341353.2.1 callback reset_link to reset pci express link136137This callback is used to reset the pci express physical link when a138fatal error happens. The root port aer service driver provides a139default reset_link function, but different upstream ports might140have different specifications to reset pci express link, so all141upstream ports should provide their own reset_link functions.142143In struct pcie_port_service_driver, a new pointer, reset_link, is144added.145146pci_ers_result_t (*reset_link) (struct pci_dev *dev);147148Section 3.2.2.2 provides more detailed info on when to call149reset_link.1501513.2.2 PCI error-recovery callbacks152153The PCI Express AER Root driver uses error callbacks to coordinate154with downstream device drivers associated with a hierarchy in question155when performing error recovery actions.156157Data struct pci_driver has a pointer, err_handler, to point to158pci_error_handlers who consists of a couple of callback function159pointers. AER driver follows the rules defined in160pci-error-recovery.txt except pci express specific parts (e.g.161reset_link). Pls. refer to pci-error-recovery.txt for detailed162definitions of the callbacks.163164Below sections specify when to call the error callback functions.1651663.2.2.1 Correctable errors167168Correctable errors pose no impacts on the functionality of169the interface. The PCI Express protocol can recover without any170software intervention or any loss of data. These errors do not171require any recovery actions. The AER driver clears the device's172correctable error status register accordingly and logs these errors.1731743.2.2.2 Non-correctable (non-fatal and fatal) errors175176If an error message indicates a non-fatal error, performing link reset177at upstream is not required. The AER driver calls error_detected(dev,178pci_channel_io_normal) to all drivers associated within a hierarchy in179question. for example,180EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort.181If Upstream port A captures an AER error, the hierarchy consists of182Downstream port B and EndPoint.183184A driver may return PCI_ERS_RESULT_CAN_RECOVER,185PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on186whether it can recover or the AER driver calls mmio_enabled as next.187188If an error message indicates a fatal error, kernel will broadcast189error_detected(dev, pci_channel_io_frozen) to all drivers within190a hierarchy in question. Then, performing link reset at upstream is191necessary. As different kinds of devices might use different approaches192to reset link, AER port service driver is required to provide the193function to reset link. Firstly, kernel looks for if the upstream194component has an aer driver. If it has, kernel uses the reset_link195callback of the aer driver. If the upstream component has no aer driver196and the port is downstream port, we will perform a hot reset as the197default by setting the Secondary Bus Reset bit of the Bridge Control198register associated with the downstream port. As for upstream ports,199they should provide their own aer service drivers with reset_link200function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and201reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes202to mmio_enabled.2032043.3 helper functions2052063.3.1 int pci_enable_pcie_error_reporting(struct pci_dev *dev);207pci_enable_pcie_error_reporting enables the device to send error208messages to root port when an error is detected. Note that devices209don't enable the error reporting by default, so device drivers need210call this function to enable it.2112123.3.2 int pci_disable_pcie_error_reporting(struct pci_dev *dev);213pci_disable_pcie_error_reporting disables the device to send error214messages to root port when an error is detected.2152163.3.3 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);217pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable218error status register.2192203.4 Frequent Asked Questions221222Q: What happens if a PCI Express device driver does not provide an223error recovery handler (pci_driver->err_handler is equal to NULL)?224225A: The devices attached with the driver won't be recovered. If the226error is fatal, kernel will print out warning messages. Please refer227to section 3 for more information.228229Q: What happens if an upstream port service driver does not provide230callback reset_link?231232A: Fatal error recovery will fail if the errors are reported by the233upstream ports who are attached by the service driver.234235Q: How does this infrastructure deal with driver that is not PCI236Express aware?237238A: This infrastructure calls the error callback functions of the239driver when an error happens. But if the driver is not aware of240PCI Express, the device might not report its own errors to root241port.242243Q: What modifications will that driver need to make it compatible244with the PCI Express AER Root driver?245246A: It could call the helper functions to enable AER in devices and247cleanup uncorrectable status register. Pls. refer to section 3.3.2482492504. Software error injection251252Debugging PCIe AER error recovery code is quite difficult because it253is hard to trigger real hardware errors. Software based error254injection can be used to fake various kinds of PCIe errors.255256First you should enable PCIe AER software error injection in kernel257configuration, that is, following item should be in your .config.258259CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m260261After reboot with new kernel or insert the module, a device file named262/dev/aer_inject should be created.263264Then, you need a user space tool named aer-inject, which can be gotten265from:266http://www.kernel.org/pub/linux/utils/pci/aer-inject/267268More information about aer-inject can be found in the document comes269with its source code.270271272