The Single Point Fault Metric, abbreviated SPFM, is a hardware architecture metric from the Functional Safety. It is primarily used in the context of ISO 26262 and assesses how well a safety-related hardware architecture is protected against random hardware failures that can directly lead to a violation of a safety goal or do so despite existing diagnostics. The SPFM is therefore a key metric for evaluating hardware in safety-critical systems, especially in Automotive-applications, but also in other areas where random hardware failures must be systematically analyzed and controlled.

Content

At its core, the SPFM answers a simple question: What proportion of safety-related hardware failures are not effectively mitigated as single-point faults or latent faults? The higher the metric, the better the architecture is protected against these types of failures. The ideal value is 100 percent. A value of 100 percent would mean that no relevant single-point faults and no relevant latent faults remain. In practice, this ideal value is rarely achieved. The crucial factor is whether the required target value for the respective Automotive Safety Integrity Level, or ASIL for short, is met.

SPFM is of particular importance for all safety-critical components in the vehicle.

Automotive ASIL Classification Overview - SPFM Plays a Role — Vehicle safety systems (Source: TUV Rheinland)

Definition

SPFM stands for Single Point Fault Metric. The translation is roughly „Metric for Single Faults.“ However, the term should be understood more narrowly: both Single Point Faults and Residual Faults. Both types of faults are relevant for evaluation because they can lead to a violation of a safety goal.

A single point fault is a fault that can directly lead to the violation of a safety goal without the simultaneous occurrence of another independent fault. This means: A single hardware fault is sufficient to jeopardize the safety function. Such a fault is particularly critical from a functional safety perspective because no additional fault combination is required.

A residual fault is a remaining error. In principle, there is a safety mechanism designed to detect or control a fault, but this mechanism does not fully cover the fault. The portion that is not diagnosed or controlled remains as a residual fault. Therefore, residual faults arise from incomplete diagnostic coverage.

Calculation

The Single Point Fault Metric is calculated from the failure rates of safety-related hardware elements. The failure rates of Single Point Faults and Residual Faults are put into relation to the total safety-related failure rate.

$SPFM = 1 – \frac{ \sum \lambda_{\mathrm{SPF}} + \sum \lambda_{\mathrm{RF}} }{ \sum \lambda_{\mathrm{SR}} }$

With the following conditions: $λ_SPF = Single Point Fault Failure Rate$ $$\lambda_{\mathrm{RF}}$ = Error rate of residual faults$ $λSR = safety-related error rate$

The formula can also be understood as the proportion of non-critical or controlled error rates: $SPFM = (sum lambda_SR – sum lambda_SPF – sum lambda_RF) / sum lambda_SR$

This second representation shows more clearly what it's about: the critical remaining portions are subtracted from the entire safety-related failure rate. What remains is the proportion of safety-related hardware failures that, from the perspective of the Single Point Fault Metric, do not remain effective as a Single Point Fault or Residual Fault.

Meaning of the individual formula components

The error rate $lambda$ describes the statistical frequency with which a hardware element fails. In safety analyses, it is often given in FIT. FIT stands for Failures in Time and means failures per $10^9$ 109 operating hours. $1 FIT = 1 failure per 10^9 hours$

The sum $∑ λ$ _SRincludes the safety-related error rates of the hardware under consideration. Not every conceivable hardware failure is safety-related. Those failures that must be considered in relation to a specific safety goal are relevant. The Single Point Fault Metric is therefore always a safety-related architectural metric.

The sum $Sum of SPF lambda$ contains the error rates of all errors classified as single point faults. These faults are particularly critical because they can lead to a violation of the safety goal without any further fault combination.

The sum $\sum \lambda_{\mathrm{RF}}$ It includes the error rates of residual faults. A residual fault remains after considering a safety mechanism. A typical example is a diagnostic mechanism with a certain diagnostic coverage. If this mechanism detects 90 percent of a certain fault type, 10 percent remain as a residual portion. This residual portion can be incorporated into the SPFM as a residual fault.

Example calculation

Assume a security-related hardware analysis yields the following failure rates: $\sum \lambda_{\mathrm{SR}} = 1000\, \mathrm{FIT}$ $Sum of $\lambda_{\mathrm{SPF}} = 20\,\mathrm{FIT}$$ $Sum of λ_RF = 30 FIT$

Then the result is: $SPFM = 1 – (20 FIT + 30 FIT) / 1000 FIT$ $SPFM = 1 – 50/1000$ $SPFM = 1 – 0.05 = 0.95$

As a percentage $\mathrm{SPFM} = 95\,\%$

The result means: 95 percent of the safety-related failure rate do not remain as a single point fault or residual fault. 5 percent of the safety-related failure rate is critical from an SPFM perspective.

Connection with FMEDA

The Single Point Fault Metric is typically determined based on an FMEDA. FMEDA stands for Failure Modes, Effects, and Diagnostic Analysis. This process systematically links hardware elements, failure modes, failure rates, failure effects, safety mechanisms, and diagnostic coverages.

In an FMEDA, potential failure modes are analyzed for each relevant hardware element. For each failure mode, it is assessed whether it is safety-relevant, how it affects the system, and whether a safety mechanism is present. Afterward, the failure mode is classified, for example, as a Safe Fault, Single Point Fault, Residual Fault, or Latent Fault.

The SPFM therefore arises from the classification of concrete error types. The quality of the metric therefore strongly depends on the quality of the underlying FMEDA. Incorrect assumptions about failure modes, wrong diagnostic coverage levels, or incomplete component lists lead to an incorrect value.

Single Point Fault

A single point fault is a fault that can, on its own, cause a violation of the safety goal. A simple example is a switching element that gets stuck in a dangerous state without diagnostics detecting this state and without a redundant shutdown possibility.

In system architecture, single point faults are particularly problematic. They indicate that a single hardware failure is sufficient to lose the safety function or create a dangerous output condition. The reduction of single point faults is therefore a main goal in improving SPFM.

Possible measures against single points of failure include redundancy, plausibility checks, monitoring, fail-safe states, diagnostic paths, watchdogs, readback signals, current measurement, voltage monitoring, or architectural separation. Which measure is suitable depends on the specific system, the failure mode, and the safety goal.

Residual Fault

A residual fault arises when a safety mechanism is in place, but not all relevant faults are completely detected or controlled. Therefore, a residual fault is the remaining critical portion after diagnosis.

Example: A sensor path is monitored by a plausibility check. This plausibility check detects many errors, but not all of them. If certain error patterns remain within plausible limits, they may go undetected. The proportion of these undetected errors is incorporated into the calculation as a residual fault.

Mathematically, the residual portion of a failure portion can be expressed via the diagnostic coverage: $\lambda_{\mathrm{RF}} = \lambda_{\mathrm{fault}} \cdot \left( 1 – DC \right)$

In doing so $Washington D.C.$ DC, the Diagnostic Coverage, as a factor between 0 and 1. With a Diagnostic Coverage of 90 percent, the following applies: $DC = 0.90$

Then remains: $1 – DC = 0.10$

This leaves 10 percent of the relevant error rate as residual error.

Single Point Fault Metric and Diagnostic Coverage

The SPFM is closely related to Diagnostic Coverage, but it is not the same. Diagnostic Coverage describes how well a particular safety mechanism detects or handles a specific fault class. The Single Point Fault Metric, on the other hand, is an architectural metric that is calculated across all considered safety-related hardware elements.

A high diagnostic coverage of individual mechanisms usually improves the Single Point Fault Metric because it reduces residual faults. However, a single good diagnostic mechanism is not automatically sufficient. If other critical failure modes remain undetected, the SPFM can still be too low.

Single Point Fault Metric in relation to LFM

In addition to the SPFM, the Latent Fault Metric, or LFM, is often used in hardware assessment. While the SPFM evaluates the handling of single point faults and residual faults, the LFM considers latent multiple point faults.

The LFM is described by the following formula: $LFM = 1 – SUM(λ_MPF,L) / (SUM(λ_SR) – SUM(λ_SPF) – SUM(λ_RF))$

With the following conditions: $\lambda_{\mathrm{MPF,L}} = Failure rate of latent multi-point faults$

Latent multipoint faults do not solely lead to the violation of a safety goal. They only become critical when an additional independent fault occurs. Nevertheless, they must be considered because they can exist in the system undetected for extended periods.

SPFM and LFM therefore assess different aspects of the hardware architecture. SPFM directly considers single-point and latent intermittent faults. LFM considers hidden multi-point faults that only become dangerous in combination.

Single Point Fault Metric in relation to PMHF

Another important metric is the PMHF. PMHF stands for Probabilistic Metric for random Hardware Failures. While SPFM and LFM are architectural metrics, PMHF considers the overall probabilistic failure rate of dangerous random hardware failures.

Simplified, the PMHF can be formed from the relevant hazardous parts: $PMHF = Σ λ$ _SPF + Σ λ_RF + Σ λ_MPF,L

The PMHF is given as a failure rate, for example in FIT or as a probability of failure per hour. SPFM and LFM, on the other hand, are dimensionless ratio values, often given in percent.

With these, the key figures fulfill different tasks. The SPFM answers the question of how well the architecture is protected against single point faults and residual faults. The LFM answers the question of how well latent multipoint faults are managed. The PMHF answers the question of how high the remaining dangerous random hardware failure rate is.

Why the Single Point Fault Metric is Important

The SPFM forces developers to consider hardware failures not only qualitatively but also quantitatively. Without such a metric, an architecture might seem plausible, yet still contain too many dangerous single or residual faults.

This quantitative consideration is particularly important with complex electronic systems. Microcontroller, Sensors, actuators, power supplies, driver stages, memory, communication interfaces, and monitoring circuits each have their own failure modes. These failure modes can have different effects on safety goals.

The SPFM makes visible where the architecture is weak. If the metric does not reach the required target value, the critical portion lies in the sum of Single Point Faults and Residual Faults. Then it must be analyzed which failure modes contribute the most. Often, a few dominant failure modes with high failure rates or poor diagnostic coverage are the cause.

Typical Measures

An SPFM can be improved by reducing the failure rates of single point faults and residual faults. There are several technical ways to achieve this.

Firstly, single point faults can be eliminated or reduced through architectural measures. These include redundant paths, independent shutdown mechanisms, readback functions, monitors, comparison logic, or watchdogs. The goal is that a single fault no longer immediately leads to the violation of the safety objective.

Secondly, residual faults can be reduced through better safety mechanisms. If a fault is already diagnosed but the diagnostic coverage is insufficient, improved diagnostics can help. Examples include tighter plausibility limits, additional measurement channels, cyclic tests, end-to-end monitoring, RAM tests, Flash CRC checks, clock monitoring, or voltage monitoring.

Third, the error rate of individual components can be reduced. If a component with a high error rate contributes significantly to the critical sum, a more robust component or a different circuit topology can help.

Fourth, the safety function can be distributed differently architecturally. Sometimes a poor SPFM value arises because too much safety responsibility is hanging on a single hardware path. A different distribution of monitoring, control, and shutdown can improve the metric.

Classification of Embedded Systems

In embedded systems, the Single Point Fault Metric concerns many typical hardware areas. These include microcontrollers, clock supply, memory, voltage regulators, sensors, actuators, power drivers, communication interfaces, and external monitoring circuits.

A microcontroller can be affected by, for example, CPU errors, memory errors, register errors, peripheral errors, or clock errors. Safety mechanisms can include lockstep architectures, RAM tests, flash checksums, MPU configurations, watchdogs, clock monitors, or internal safety mechanisms.

For sensors, errors such as shorts, opens, drift, stuck-at values, or plausible but incorrect signals play a role. For actuators and power drivers, errors such as short circuits to supply, short circuits to ground, open loads, output hangs, or thermal overload are relevant.

The SPFM forces an evaluation of its contribution to the dangerous failure rate. This makes it visible whether the safety architecture is only formally in place or if it actually reduces critical failure rates.

Meaning of ASIL Ratings

In the automotive context, the Single Point Fault Metric is evaluated depending on the ASIL. Higher ASIL levels place greater demands on the hardware architecture. A system with ASIL D requires stricter verification than a system with ASIL B.

The SPFM is not an isolated proof in this context. It is part of the hardware assessment and is related to safety goals, technical safety requirements, FMEDA, safety mechanisms, LFM, and PMHF. Passing an SPFM value does not replace the other proofs. It only shows that the architecture has reached the required target range with regard to single-point faults and residual faults.

Zurück zum Glossar

Threat and Risk Assessment