KEMBAR78
PCIe Error Logging and Handling On A Typical SoC | PDF | System On A Chip | Physical Layer Protocols
0% found this document useful (0 votes)
353 views15 pages

PCIe Error Logging and Handling On A Typical SoC

Uploaded by

might
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
353 views15 pages

PCIe Error Logging and Handling On A Typical SoC

Uploaded by

might
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

PCIe error logging and handling

on a typical SoC
Umesh Pratap Singh, Truechip Solutions Pvt. Ltd.

Introduction:

In Today’s high speed systems PCI Express (PCIe-Peripheral Component


Interconnect-express) has become the backbone. PCIe is a third generation
high performance I/O bus used to interconnect peripheral devices in
applications such as computing and communication platforms. It is used to
provide the connections between motherboard peripherals like graphics card, Ethernet card to the
CPU and main memory.

The study of PCIe error handling on SoC has become crucial part because of PCIe’s applications.
Here are the details for PCIe error handling on a typical SoC(system on chip).PCIe provides rich set
of mechanisms for error logging and handling where error handling may involve only hardware,
device-specific software, or system software. This paper describes the errors associated with the
PCIe interface and error while delivery of transactions between transmitter and receiver. Here are
details of errors associated with each layer of PCIe, advanced error reporting (AER), advisory errors
and recommendations for multiple error handling.

This paper details first PCIe errors, error logging and then the error handling on a typical SoC.

An Itinerary to PCIe errors and handling mechanisms:

Pcie errors corresponding to each layer:

PCIe is a packet-based serial bus, provides a high-speed, high-performance, point-to-point, dual


simplex, differential signaling link for interconnecting devices. PCIe has three layered architecture for
communication between two devices. Here are the details of the errors found at each layer.

Transaction layer errors:

This is upper layer, where packet is formed .The transaction layer checks are done end to end
device, i.e. only by the requestor and completer and no checks at switch or bridge for below errors.

 TL layer is responsible for checking the below errors at end to end level.
 ECRC check failure (optional check based on end-to-end CRC and AER)
 Malformed TLP (error in packet format)
 Completion Time-outs during split transactions
 Flow Control Protocol errors (optional)
 Unsupported Requests
 Data Corruption (reported as a poisoned packet)
 Completer Abort (optional)
 Unexpected Completion (completion does not match any Request pending completion)
 Receiver Overflow (optional check)

Data Link Layer Errors:

This is middle layer, which is responsible for packet error and response handling .The below errors
are checked at DL layer of requester, switch and completer i.e. these errors are checked at
requester, switch and completer.

 LCRC check failure for TLPs


 Sequence Number check for TLP s
 LCRC check failure for DLLPs
 Replay Time-out
 Replay Number Rollover
 Data Link Layer Protocol errors

Physical Layer Errors:

This is third layer which is responsible for link training and transaction handling at interface level.
These errors are checked at requester, switch and completer.

 Receiver errors
 Link errors

PCIe error Classification:

Based on severity, PCIe errors are categorized as below

 Correctable errors — handled by hardware


 Uncorrectable error –Classified as fatal and non-fatal errors
o Uncorrectable errors-nonfatal — handled by device-specific software
o Uncorrectable errors-fatal — handled by system software

Correctable errors are the errors which may have an impact on performance (like latency,
bandwidth), but no data/information is lost and PCIe fabric remains reliable. Such errors are
corrected by hardware and no software intervention is required.

Examples: Bad TLP (bad LCRC or incorrect sequencer number), Bad DLLP − Replay timer timeout,
Receiver error (for example, Framing error).

Uncorrectable Non-fatal errors are the errors which don’t have impact on integrity of the PCI
Express fabric, but data/information is lost. Non-fatal errors are corrupted transactions that can’t be
corrected by PCIe hardware.
However, the PCI Express fabric continues to function correctly and other transactions are
unaffected, only particular transaction is affected. Recovery from a non-fatal error may or may not,
depends on device-specific software associated with the requester that initiated the transaction.

Examples: Poisoned TLP received, Unsupported Request (UR), Completion Timeout (CTO), Completer
Abort (CA), and Unexpected Completion.

Uncorrectable fatal errors are the errors which have impact on integrity of the PCI Express fabric
i.e. PCIe link is no more reliable and data/information is lost. Recovery from fatal errors is done by
resetting the component and link.

Examples: Malformed TLP Error, Link Training Error, DLL Protocol Error, Receiver Overflow, Flow
Control Protocol Error.

Such classification provides to related hardware or software, a method to recover the error without
resetting the components on the link and disturbing other transactions in progress.

Table1:PCIe error classification

Type of error Errors examples Pcie layer at which error found


Correctable Receiver Error Physical
Correctable Bad TLP Link
Correctable Bad DLLP Link
Correctable Replay Time-out Link
Correctable Replay Number Rollover Link
Uncorrectable - Non Fatal Poisoned TLP Received Transaction
Uncorrectable - Non Fatal ECRC Check Failed Transaction
Uncorrectable - Non Fatal Unsupported Request Transaction
Uncorrectable - Non Fatal Completion Time-out Transaction
Uncorrectable - Non Fatal Completion Abort Transaction
Uncorrectable - Non Fatal Unexpected Completion Transaction
Uncorrectable - Fatal Training Error Physical
Uncorrectable - Fatal DLL Protocol Error Link
Uncorrectable - Fatal Receiver Overflow Transaction
Uncorrectable - Fatal Flow Control Protocol Error Transaction
Uncorrectable - Fatal Malformed TLP Transaction

Description of common PCIe errors:

 Malformed packets :

PCIe defines the transaction rules at each layer. Any transaction/packet violating these rules
considered as malformed TLP.

Examples: Data payload exceeds max payload size, the actual data length does not match data
length specified in the header, TC to VC Mapping violation/errors.

 Corrupted or poisoned data errors or also called error forwarding:


Data poisoning is optional and indicates that data in packet is corrupted .If data is corrupted then the
“EP” bit in packet header is set. The data poisoning is used in conjunction with memory, I/O, and
configuration transactions that have a data payload. Data poisoning is done at the transaction layer
of a device.

For example when requester performs a Memory write transaction, the data (to be written) fetched
from local memory, can have parity error. In Such case requester send the memory write transaction
with setting “EP” field in packet header.

For corrupted data, the packet is sent to recipient with “EP” bit set. The recipient will drop or process
the packet, depends on implementation.

 ECRC error:

This ECRC is termed as end-to-end (ECRC) and ECRC is checked and reported by the ultimate
recipient of the transaction. ECRC generation and checking is optional. If any device or system
supports ECRC, it must implement advanced error reporting (AER).

Examples of ECRC error are:

ECRC in request packet: The completer will drop the packet and no completion will be returned .That
will result in a completion time-out within the requesting device and the requester will reschedule the
same transaction.

ECRC in completion packet: The requester will drop the packet and error reported to the function's
device driver via a function-specific interrupt.

 DL layer flow control-related errors:

The TL layer of PCIe provides the credit based flow control feature i.e. the transaction layer checks
flow control credits( before sending packet to RX,DL layer) to ensure that the receive buffers have
sufficient space to hold the transaction.

There can be flow control protocol errors which will prevent transactions from being sent. These
errors reported to the root complex (RC) and are considered uncorrectable.

For example:

1. The maximum number of data payload credits that can be reported is restricted to 2048
unused credits and 128 unused credits for headers. Exceeding these limits is considered an
FC protocol error.
2. During flow control (FC) initialization receivers are allowed to report infinite FC credits. FC
updates DLLP (data link layer packet) follow the init FC. FC updates are allowed providing
that the credit value field is set to zero, which is ignored by the recipient. If the data field
contains any value other than zero, it is considered an FC protocol error.

Completion transaction errors:

The completion packet header has the field “cmpl status” which indicates the status of completion
transaction. There are the below errors in completion transactions.

 Unsupported Request error:


When the receiver at other end, receives a transactions that is not supported by it, it returns
a completion transaction with unsupported request (UR) in the “completion status” field of
the packet header.

Few possible cases of unsupported request are :

o Message request received with unsupported or undefined message code.


o Request does not reference address space mapped within device.
o Type 1 configuration request received at endpoint.
o Completer Abort error:

These are optional error and depend on implementation for completion abort. A completer
that aborts a request may report the error to the root complex (RC) as a Non-Fatal Error
message or returns the completion packet as completion abort in completion status field of
packet header.

Possible scenario for completion abort condition can be:

A Completer receives a request, that can’t be completed by it because the request violates
the programming rules for the device. For example, some devices may be designed to permit
access to a single location within a specific Double Word, while any attempt to access the
other locations within the same Double Word will fail.

 Unexpected Completion:

Some time, the receiver may get the completion that was not expected as per the tag /id for the
packet sent by it.

The typical reason for this unexpected completion is that the completion was mis-routed on its
journey back to the intended requester.

 Completion Time-out:

As per the PCIe, the completion must be returned in specified time for the request else there will be
completion timeout. The completion time-out mechanism is implemented by any device that initiates
requests and require completions to be returned.

The reason for completion time out can be that the completion is wrongly routed or the PHY at
completer side is drop the packet.

PCIe Error reporting and handling mechanisms: How the errors are reported and handled
Fig1:PCIe error handling flow

PCIe error reporting:

Pcie provides mainly two ways for error reporting:

 By completion status field: which are used by the completer to report errors to the
requester, the completer or requester may be EP or RC.
 By error message transactions: which are used to report errors to the host/RC.

Error reporting by Completion Status:

The completion TLP have “compl status ” field to report the error from completer to requester.

Error reporting by Message TLP:

The message kind of TLP introduced in PCIe to serve many purpose such as error reporting, interrupt
handling etc. For error reporting, this includes identification of the device that detected the error and
an indication of the severity of each error.
In message TLP, there is message “code field” which gives the information about the objective of
message transactions.

Message Code Name Description


30h ERR_COR used when a PCI Express device detects a correctable error
31h ERR_NONFATAL used when a device detects a non-fatal, uncorrectable error
33h ERR_FATAL used when a device detects a fatal, uncorrectable error

NOTE: Message TLPs are always routed to RC.

Pcie error handling:

PCIe provides two mechanisms for error handling.

 Base line error handling mechanism.


 The PCIe baseline error handling mechanism can also be categorized as below:
o PCI-Compatible/legacy error handling mechanism: Supports the software or devices
that have no knowledge of PCIe.
o PCI Express /native devices Error handling mechanism: Supports the software or
devices that have knowledge of PCIe.
 Advanced error reporting mechanism.

Base line error reporting is done by PCI-compatible registers and PCI Express Capability registers
while advanced error reporting (AER) is done by the Advanced Error Reporting registers that are
mapped into extended configuration address space i.e. error reporting is done through configuration
registers which are mapped into three distinct regions of configuration space.

1. Error logging using PCI-compatible registers: This method provides backward compatibility
with existing PCI compatible software and is enabled via the PCI configuration Command
Register. These errors are mapped within PCI compatible error registers.
2. Error logging using PCIe capability registers: This method is error reporting of PCIe native
devices .In this method error reporting is enabled via the PCI Express Device Control
Register which are mapped within PCI-compatible configuration space.
3. Error logging using PCIe Advanced Error Reporting registers: This is optional method where
error reporting is done by the registers which are mapped into the extended configuration
address space. In this method PCIe enables error reporting for individual errors via the Error
Mask Register.

PCI-Compatible or legacy error handling mechanism:

PCIe provides registers mapping to support PCI related error. The PCI error reporting mechanism
involves the assertion of signals PERR# (data parity errors) and SERR# (unrecoverable errors). The
PCI Express mechanisms for handling these events are via the split transaction mechanism
(transaction completions) and virtual SERR# signaling via error messages.

This involves enabling error reporting and setting status bits that can be read by PCI-compliant
software. There is the configuration status and command registers, which have error related bits.

Below are the details of some important registers required for PCI compatible error handling.

 PCI-Compatible Configuration Command Register

Signal
Name in Description in PCIe
PCI
Setting this bit (1) enables the generation of the appropriate PCI Express error
SERR#
messages to the Root Complex. Error messages are sent by the device that has
Enable
detected either a fatal or non-fatal error.
This bit enables poisoned TLP reporting. This error is typically reported as an
Parity Error
Unsupported Request (UR) and may also result in a non-fatal error message if SERR#
Response
enable=1b. Note that reporting in some cases is device-specific.

 PCI-Compatible Status Register (Error-Related Bits): This provides the bits to indicate
the type of error such as system error, target abort .

PCI Express /native devices Error handling mechanism

This is PCI Express Baseline Error Handling mechanism which has PCI Express Capability Register
Set. These registers include error detection and handling bit fields regarding the nature of an error
that is supplied with standard PCI error handling. The baseline capability register space is different
for RC and EP mode.
Fig2: PCIe Baseline capability registers structure

These registers provide support for:

 Enabling/disabling error reporting (Error Message Generation)


 Providing error status
 Providing status for link training errors
 Initiating link re-training

Below are the details of some important registers required for baseline error handling.

 Device Control Register :

Setting the corresponding bit in the device control register enables the generation of the
corresponding error message which reports errors associated with each classification. Unsupported
Request errors are specified as Non-Fatal errors and are reported via a Non-Fatal Error Message, but
only when the UR Reporting Enable bit is set.

 Device Status Register:

An error status bit is set any time an error associated with its classification is detected. These bits
are set irrespective of the setting of the error reporting enable bits within the device control register.
Because Unsupported Request errors are by default considered Non-Fatal Errors, when these errors
occur both the Non-Fatal Error status bit and the Unsupported Request status bit will be set. Note
that these bits are cleared by software when writing a one (1) to the bit field.

 Link Errors: Link control and link status register

The physical link connecting two devices may fail causing a variety of errors. Link failures are
typically detected within the physical layer and communicated to the Data Link Layer. Because the
link has incurred errors, the error cannot be reported to the host via the failed link. Therefore, link
errors must be reported via the upstream port of switches or by the Root Port itself. Also the related
fields in the PCI Express Link Control and Status registers are only valid in Switch and Root
downstream ports (never within endpoint devices or switch upstream ports). This permits system
software to access link-related error registers on the port that is closest to the host.
Advanced Error Reporting Mechanism (this is optional)

Importance of AER: AER provides the granularity and pinpoint details of correctable and
uncorrectable errors. There are registers to define the error severity, error logging, error mask ability
and to identify source of error.

Fig3: PCIe advanced error reporting register structure

Below are the details of some important registers required for advanced error handling.

 Advanced Correctable Error status register

When a correctable error occurs the corresponding bit within the advanced correctable error status
register is set, independent of the mask register setting. These bits are automatically set by
hardware and are cleared by software when writing a "1" to the bit position.

 Advanced Correctable Error mask register:

The correctable errors can also be masked by setting the corresponding bit in the register. Only
affects the error reporting not the status bits. The masked errors are not logged in header log
register and are not reported to RC.

 Advanced Uncorrectable Error handling registers:

These errors can selectively cause the generation of an uncorrectable error message being sent to
the host system. Those uncorrectable errors that are selected to be non fatal will result in a nonfatal
error message being delivered and those selected as fatal errors will result in a fatal error message
delivered. However, whether or not an error message is generated for a given error is specified in
the advanced uncorrectable mask register.

 Advanced Uncorrectable Error status register:

When an uncorrectable error occurs the corresponding bit within the advanced uncorrectable error
status register bit is set, independent of the mask register setting. These bits are automatically set
by hardware and are cleared by software when writing a "1" to the bit position.

Advanced Uncorrectable Error severity register:


AER mechanism defines the error severity handling for uncorrectable errors whether which one error
is the more severe.

 Uncorrectable Error mask register:

The uncorrectable errors can also be masked by setting the corresponding bit in the register. The
default condition is to generate error messages for each type of error. Only affects the error
reporting not the status bits. The masked errors are not logged in header log register and are not
reported to RC.

 Root Complex Error Tracking and reporting

The root complex is the target of all error messages issued by devices within the PCI Express fabric.
Errors received by the RC result in status registers being updated and the error being conditionally
reported to the appropriate software handler or handlers.

 Root Complex Error Status register:

When RC receives an error message, it sets status bits within the root error status register. This
register indicates the types of errors received and also indicates when multiple errors of the same
type have been received.

 Root Error Command Register:

The root error command register enables interrupt generation for correctable or uncorrectable errors.

Basic flow chart for error handling:


Fig4: Basic flow chart for PCIe error handling

Note: in above diagram: ANF:-Advisory non fatal error and DC reg:- device control register

Advisory Non-Fatal errors:

The error are reported and signaled as ERR_COR, ERR_NONFATAL, ERR_FATAL or not signaled at all,
depending upon the role of the agent that detects the error and whether the agent implements AER.
But in some cases detecting agent is not the appropriate agent to determine the ultimate disposition
of the error, than the detecting agent with AER can signal the non-fatal error with ERR_COR, which
serves as an advisory notification to software. For example a receiver that’s not the ultimate
destination for a TLP (detects a non-fatal error with the TLP and severity is non fatal), than this
“intermediate” receiver, handle this case as an Advisory Non-Fatal error and receiver with AER,
signals the error (if enabled) by sending an ERR_COR message. A receiver without AER sends no
error message for this case. If the severity is fatal, the error is not an Advisory Non-Fatal Error and
must be signaled (if enabled) with ERR_FATAL.

Other case may be where, it is required to have continue operation for uncorrectable non fatal error,
than such scenario is handled as advisory non-fatal error by sending ERR_COR. For example a
poisoned TLP is received by its ultimate destination, if the severity is non-fatal and the receiver deals
with the poisoned data in a manner that permits continued operation, the receiver handle this case
as an Advisory Non-Fatal Error. The receiver with AER, signals the error (if enabled) by sending an
ERR_COR message and without AER sends no error message for this case. If the severity is fatal, the
error is not an Advisory Non-Fatal Error, and must be signaled (if enabled) with ERR_FATAL.

Nullified packet: This feature also called switch cut through, is development in PCIe over it’s earlier
PCI. Earlier the packet at ingress port (incoming port) of switch is not sent to egress port (out going
port) of switch until the tail end of packet is received and checked for CRC. In PCIe, the packet is
passed from ingress port to egress port without waiting for tail end. If there is CRC error is detected
on receiving tail end of TLP, than the TLP’s END is replaced with EDB (bad TLP) at egress port of
switch and CRC is inverted with what it should be. The switch sends NACK for this and when reaches
to end point (EP), it is discarded by EP, this is nullified TLP, EP doesn’t send any NACK for this
nullified TLP(TLP with EDB tail end). After receiving the NACK, the requester again send the same
TLP.

PCIe error handling on a typical SoC:

A typical SoC(System on Chip) consists of a core(CPU), memory blocks(RAM/FLASH), timing sources,


PLL, reset handling, external/off-chip interface, industry standards peripherals such as
USB/Ethernet/SPI/PCIE/ UART etc, analog interfaces like ADC/DAC,s and voltage regulators and
power management controllers. The core communicates (provides stimulus in hex/binary format)
with the modules (slave like PCIe) through an interface as the application layer. Here is the typical
case of PCIe error handling on SoC.

Core generates a MRd transaction to EP and suppose for EP, this is an unsupported request.

So EP will return the completion with status field “UR” to RC. EP may also return an ERR_NONFATAL
message, if enabled in EP’s Device Control Reg . And the EP logs this error in its:

 Device Status Register


 Uncorrectable Error Status Register
 Header Log Register

For this “UR” completion packet, RC terminates the MRd transaction and returns an internal
completion to the requester i.e. core .The result of such transaction is marked as error and “Bad
Data” to core. And RC logs this error in its:

- Secondary Status Register( for received UR completion) and Root Error Status Register , if
receiving an ERR_NONFATAL message

Core will not complete the instruction with the error status/“Bad Data” and core’s instruction
execution will paused and core’s execution pointer jumps to interrupt handler (corresponding to the
error).

Now how the core will proceed further with recovery options, depends on application and
vendor/implementation.

Similarly core jump to interrupt handler (corresponding to error) for other errors of PCIe and take
the implementation dependent actions.

Requirements and recommendations for reporting multiple errors:

Error pollution can occur if error conditions or root cause of error for a transaction can’t be ensured.
For example suppose the DL layer detects an error, subsequent errors which occur for the same
packet will not be reported by the transaction layer or suppose physical layer detects a receiver
error, to avoid having this error propagate and cause subsequent errors at upper layers (for
example, a TLP error at the Data Link Layer), making it more difficult to determine the root cause of
the error.

For such case It is required and recommended that no more than one error is reported for a single
received TLP, and the below precedence (from highest to lowest) is used:

 Uncorrectable internal error


 Receiver Overflow
 Flow Control Protocol Error
 Malformed TLP
 ECRC Check Failed
 AtomicOp Egress Blocked
 TLP Prefix Blocked
 ACS Violation
 MC Blocked TLP
 Unsupported Request (UR), Completer Abort (CA), or Unexpected Completion
 Poisoned TLP Received or Poisoned TLP Egress Blocked

Conclusion:

PCIe provides the very descriptive error reporting and handling methods. There are the various
registers for handling different kinds of errors. Here the error handling methods for legacy and native
devices are detailed.

The actions taken by a function when an error is detected is governed by the type of error and the
settings of the error-related configuration registers. The resultant actions for PCIe errors on SoCs are
application and implementation specific.

References:

https://www.kernel.org/doc/Documentation/PCI/pcieaer-howto.txt

Book:PCI Express System Architecture, Ravi Budruk, Don Anderson, Tom Shanley, MindShare,
Inc.,2006

If you wish to download a copy of this white paper, click here

Contact Truechip Solutions


Fill out this form for contacting a Truechip
Solutions representative.

Your Name:

Your E-mail address:

Your Company address:

Your Phone Number:


Write your message:
send

Home | Feedback | Register | Site Map

All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved.

You might also like