Do we need functional safety RTOS software?

A frequently asked question: Do we need for Functional Safety RTOSBasically no. Functional safety does not require the use of an RTOS as a basic condition. A safety-related embedded controller can also work with bare-metal software if the system deterministically meets its functional requirements, its timing conditions, and the FTTI derived from the safety goals. An RTOS then becomes technically relevant when scheduling, task separation, resource management, stack monitoring, or partitioning become part of the safety case.

Starting position

Developers and architects must decide whether to use bare metal, a standard RTOS, or a safety RTOS for embedded software. The decision affects not only runtime behavior and development effort. It also impacts the safety case, evidence generation, qualification of software components, and the system's architecture.

Bare-metal software runs directly on the Microcontroller or processor. The application manages hardware access, memory, interrupts, and timing itself. A typical pattern is an infinite loop in which tasks are executed sequentially. Interrupts interrupt this loop, execute a service routine, and then return execution. Periodic tasks can be triggered via timer interrupts.

In contrast, an RTOS provides services for task scheduling, synchronization, inter-task communication, timers, memory management, and error handling. It does not perform the safety function itself, but it can provide mechanisms upon which safety-related software can be built.

Professional background

Functional safety is by no means achieved by the mere choice of an operating system. It results from requirements, architecture, implementation, verification, safety mechanisms, and a traceable safety case. Theoretically and practically, a system can be developed in compliance with safety standards on bare metal, with a standard RTOS, or with a functional safety RTOS, provided that the safety case for the overall system is sound.

For simple control devices, bare metal can suffice. This is especially true if the function remains manageable, the number of interrupt sources is low, no extensive communication stacks are needed, and the timing requirements can demonstrably be met within a superloop or timer structure. SIL goals can also be achieved in such systems with bare metal, provided the necessary safety mechanisms, memory protection, tool qualification, and verification measures are considered.

An RTOS changes the proof question. It adds software that must be understood, configured, integrated, and audited. In return, it can provide services that would otherwise have to be developed independently for more extensive systems: priority-based scheduling, event processing, queues, semaphores, stack monitoring, resource management, and system analysis.

Make-or-Buy Decision Safety RTOS

In functional safety, a certified RTOS initially seems like the easier procurement decision. Products like SAFERTOS from Wittenstein High Integrity Systems, embOS in Safety variants (embOS-Safe (from Segger) or PikeOS from SysGo are used when teams do not want to provide full safety evidence for a self-compiled runtime environment. However, a certificate does not replace system responsibility. The specific configuration, the use of APIs, the task architecture, priorities, memory areas, and error responses remain part of the project evidence. The advantage of a Functional Safety Certification is usually existing tool qualification.

A standard RTOS like FreeRTOS or Zephyr can be used in a safety project if the project's own safety case covers its use. The effort then lies in qualification, limiting the scope of functionality used, verification, and documentation. This effort can be higher for safety-related products than the license and integration costs of a safety RTOS. However, it can be lower than changing to a safety RTOS if the application is small and the RTOS portion used remains tightly limited.

Bare metal is also not a cost-free solution. Once scheduling logic, priorities, communication buffers, mutual exclusion, watchdog reactions, and stack checks are self-built, a project-specific runtime environment is created. This must be treated like other safety-related software in the safety case.

The topic of tool qualification

Tool qualification means that development tools must be quantitatively evaluated when their results are incorporated into safety-related development evidence. What is crucial is not the tool as a whole, but the specific function used in the specific development process: a compiler, a code generator, a static analysis tool, or a test tool can either only be supportive, or introduce or conceal safety-relevant errors themselves, depending on its application. Therefore, it must be evaluated which tool functions are actually relevant, which malfunctions are conceivable, and whether these malfunctions would be discovered by downstream checks.

In practice, tool qualification primarily means establishing a robust proof of trust for critical tool functions. A sensible practice is therefore to regressively test the tools used: for defined, known inputs, it is regularly checked whether the tool produces the expected results and whether typical error cases are reliably detected. Tool qualification is thus a technical control mechanism in the development process. For a safety-qualified compiler or a safety-qualified RTOS, a significant part of this proof has already been provided by the manufacturer: defined assumptions for use, documented limitations, qualification artifacts, and test evidence in the form of, for example, regression tests.

Assumptions about Schedulers, Tasks, and Timing Behavior

In an RTOS, the scheduler is the mechanism that executes tasks based on priorities, states, and events. For functional safety, it becomes relevant when safety functions must react within defined time limits. The FTTI describes the period within which a fault must be detected and a reaction triggered before a safety violation occurs.

This behavior is easily justifiable with bare metal when the sequence of operations is fixed and the maximum runtime of each loop iteration is known. However, new functions change the reaction times of all polled events when they are inserted into the same loop. Critical events can be prioritized via interrupts, but even this structure requires an understandable analysis of runtime, lock times, and side effects.

An RTOS can reduce the temporal coupling between tasks. Events are assigned to tasks, tasks are assigned priorities, and blocking waits can be mapped via RTOS primitives instead of polling. This only helps if priorities, stack sizes, runtimes, and lock ranges are analyzed. An incorrectly implemented RTOS can cause deadlocks, priority inversion, stack overflows, or missed response times.

What do task systems, resources, and error handling achieve?

A task system separates software functions into executable units. This separation can make code more understandable when communication stacks, sensor processing, actuator control, and diagnostics require their own execution contexts. Queues and semaphores then facilitate controlled handoffs between these contexts.

For safety, this separation is only useful if the data flows and dependencies are documented. A task that processes safety-related data must not be uncontrollably blocked by a less critical function. Shared resources must be protected in such a way that race conditions and inconsistent data states are detected or avoided.

Many RTOSes offer error handling mechanisms. These include overflow detection, handling of invalid states, runtime error hooks, or task utilization diagnostics. These mechanisms can be part of a safety architecture. However, they must align with the specific hazard analysis. A stack overflow hook is not a safety concept if it's not defined what reaction the system performs in case of an error, or if it is performed at all.

Storage and Hardware Partitioning

Hardware mechanisms can reduce the need for a fully certified software stack. Microcontrollers with a Memory Protection Unit can separate memory areas from each other. Lockstep cores can detect certain hardware failures. Such mechanisms support the safety case because they limit error propagation or make errors visible.

An RTOS can leverage these hardware capabilities when it manages memory areas on a per-task or per-partition basis. In mixed-criticality systems, spatial and temporal partitioning can prevent a less critical function from impacting the execution of a safety-related function. This requires suitable hardware and an RTOS configuration that effectively enforces this separation.

In this context, a hypervisor is only useful for systems that run multiple operating systems, partitions, or functional domains on one hardware. For a small MCU system with a manageable control function, a hypervisor is usually an additional verification and integration effort. In HPC platforms or systems with separate safety and non-safety components, it can be part of the architecture. The use of specialized multicore controllers is sometimes advisable for hypervisor functions.

Standard RTOS, Functional Safety RTOS, or Bare Metal?

The decision can be narrowed down by specific system characteristics. Bare metal is suitable if the software remains small, time paths are calculable, communication protocols are limited, and no task isolation is required. The sources cite as a rough guideline that very small MCU applications below approximately 64 KB often do not require an RTOS, while applications in the range of 1 MB often use RTOS-based structures. These sizes are not safety rules but technical rules of thumb for software scope and architectural form.

A standard RTOS is suitable if multitasking, communication stacks, drivers, debugging tools, or reuse are required, and the safety case can provide qualification for the RTOS portion used. The project scope must then bear the proof for the specific RTOS version, configuration, and API usage. However, a fundamental weakness of free RTOS systems is that fault revelation, i.e., displaying and categorizing faults in threads or tasks, is not intrinsically part of the software.

A safety RTOS is suitable when the verification effort for a standard RTOS burdens the project effort or when partitioning, qualified artifacts, safety manuals, and predefined usage restrictions are required. The certificate is a building block in this process. It does not automatically answer whether the application meets its safety requirements.

Boundaries, Risks, and Open Issues

An RTOS can conceal timing problems if tasks are prioritized without runtime analysis. Polling then disappears from the superloop, but blocking times, interrupt latencies, and resource conflicts remain measurable and demonstrable. Communication stacks and drivers from an RTOS package also need to be examined for their influence on safety-related functions.

Bare metal can be easier to test as long as the software structure remains small. The limit is reached when self-developed scheduler logic, queues, task states, and synchronization mechanisms are introduced. Then, bare metal effectively becomes a project-specific RTOS replacement with less documentation and fewer pre-verified tools.

Open-source RTOS are not automatically unsuitable for functional safety. The effort lies in qualification, configuration control, test coverage, change history, toolchain, and limiting the code used. For standards such as IEC 61508 or comparable safety frameworks, the project must demonstrate that the chosen software base can master the required safety integrity.

What follows from this for practice

The first question is not whether an RTOS should be present. The first question is whether the safety function demonstrably meets its timing, diagnostic, and fault reaction requirements with the chosen runtime architecture.

For bare metal, at least the following points must be reliably answered: maximum loop execution times, interrupt latencies, FTTI coverage, memory accesses, stack consumption, error responses, toolchain, and protection of safety-related data. As a rule, these points can be proven with reasonable effort. Therefore, bare metal can always be a suitable architecture for systems with low communication.

For an RTOS, the task model, priorities, synchronization, memory areas, error hooks, drivers, communication stacks, and configuration must be included in the safety case. For a Functional Safety RTOS, manufacturer artifacts are added to support the evidence. Often, these are „closed source,“ which means that customization may sometimes have to be purchased externally, i.e., from the manufacturer. For a standard RTOS, the project must bear the qualification burden itself. This is usually also cost- and personnel-intensive.

Scheduler, task systems, thread management, and hypervisors are only justifiable if they fulfill a concrete task within the system: structuring time behavior, controlling resources, limiting error propagation, or separating functional domains. Without these tasks, they increase the scope of verification without improving the safety argument.

Leave a Reply

Your email address will not be published. Required fields are marked *