Safety–critical design engineering underpins many of the products and processes used in high-assurance applications. Hydrix’s track record extends to 20+ years in safety-critical projects and can be traced to the start of the company, with more recent project examples including implanted medical devices such as artificial hearts, through to railway signalling and aerospace avionics.
Given the diversity of these markets/projects, in this article we explore several of the challenges that are common to safety-critical developments, and how these challenges can be successfully addressed by employing multidisciplinary, innovative, development teams: Teams that with their diverse experiences can intelligently adapt tools and techniques to alternative problem spaces, leading to novel approaches to safety design and the management of safety risks.
What is a safety-critical system?
A safety-critical system is defined as a system whose failure or malfunction may result in any of the following outcomes:
- death or serious injury to people
- loss or severe damage to equipment/property
- environmental harm.
Given the severity of these outcomes, it is no surprise that the development of safety-critical systems requires highly specialised expertise, coupled with a safety focused development approach and culture.
In the context of this article, we are particularly interested in the development of software and hardware products that are used to realise safety-critical systems.
Our own experience as a contract R&D services company developing safety-critical solutions has shown that it is possible to develop cross-domain tools and techniques that provide a rigorous approach to safety-critical development. Often, it is the adaptation of methodologies, techniques and previously developed solutions from one industry that leads experienced developers to an efficient resolution of a safety-critical challenge in another industry.
The development of safety-critical hardware and software is conducted within a rigid regulatory framework. Understanding and implementing regulatory requirements and development standards in any specific highly regulated domain can be daunting, and the very thought of concurrently developing safety-critical products across multiple domains may seem altogether too difficult, but it doesn’t need to be viewed that way…
Synergies and opportunities exist that can enable flexible and agile organisations to merge individuals with deep domain-specific knowledge into highly effective teams that work together to derive innovative ways to manage safety and risk. The cross-pollination of ideas that this facilitates enables innovation while staying firmly within the strict boundaries of the regulated domains.
Examples of this commonality are evident when comparing quality standards such as ISO9001 (Generic Quality Management Systems) and ISO13485 (Medical Device Quality), and also when reviewing domain-specific design requirements such as DO-178C (Avionics Software), IEC62304 (Medical Device Software) and EN50128 (Railway applications).
Examples of Hydrix operating domains for safety-critical software development
While each domain is unique, common risk management strategies can be observed, and indeed, best practice methodologies are sometimes recognisable across domains. For example, most industries:
- Must comply with an organisational Quality Management System
- Conduct development activities in accordance with a controlled development lifecycle process
- Are required to comply with safety and risk management standards
- Comply to discipline-specific design lifecycle standards.
The rigour applied to the development process under these frameworks varies with each domain but is generally driven by the consequences or outcome of a failure or malfunction coupled with an assessment of likelihood. In the medical domain assessments of likelihood are predominantly qualitative with consequences considered at an individual level. In the rail sector, System Integrity Level (SIL) risks are derived from probabilistic determinations of likelihood with consequences often having a broad impact. Understanding the different treatments of risk under each domain is the key to a successful project outcome.
Establishing a common approach
At the heart of all safety-critical product development is structured systems engineering. The relationships between use cases, system and lower-level architecture(s), risk management activities, and requirements derived from safety standards, are typically managed by a systems engineering team. The controls derived as a result must be kept at the forefront of planning and development throughout the entire product development lifecycle.
A systems-based approach ensures a controlled and methodical approach that provides precision in understanding the necessary standards, application of suitable risk control measures from early design planning, and traceability of requirements and Risk Control Measures (RCMs) all the way to the final implementation.
Decomposition of the system architecture is necessary to understand the contribution of system components to the overall safety and function of the design. Viewing the design as a sum of components encourages the development team to consider each function or block individually, and to design an architecture that inherently supports both the applicable design regulations, and the acceptable risk profile (as they pertain to the particular domain).
Looking to each of the relevant domain standards, there is a common thread of design control that typically follows a pattern of:
- Establishing a Project Management Plan
- Following best Systems Engineering practices
- Undertaking risk and hazards analyses
- Responding to identified safety risks and hazards with design countermeasures
- Confirming adequate mitigations are delivered and functional
- Delivering all essential safety artefacts throughout the development process
- Document the Safety Approach
- Identify Applicable Safety Standards
- If required, conduct a standards gap-analysis
- Identify any special training required for the team
- Decompose the system
- Document the System Architecture and detailed design
- Document the requirements
- Implement the system
- Integrate components
- Verify the system
- Identify hazards and risks
- Determine risks control measures (RCMs)
- Capture safety needs as requirements
- Assign design assurance/safety integrity levels (DAL/SIL)
- Verify all RCMs have been implemented
- System Safety Plan
- Hazard Log
- Hazard Analysis Report
- Safety controls captured as requirements
- Verification plans and reports for all (safety) requirements
- Certification Deliverables
Common engineering activities undertaken during safety-critical design
An architecture developed and managed in this way will facilitate a risk management approach from the start of development activities. Establishing this mindset from the outset sets the direction for the entire development lifecycle.
Common safety issues to overcome
Across guidance in the regulatory standards, and indeed in practice, the failure modes to consider are often similar. Examples of failure modes that can lead to hazards include:
- Single component failure
- Design margin inadequacy
- Unexpected operating conditions
- Third-party malicious intent
- Foreseeable misuse
- 3rd party software systems
- Physical damage.
Failure modes may be specific to any particular domain, in which case they may require specific domain experience from the team and external subject matter experts. Alternately some failure modes may be common across domains, but the measures taken to mitigate or overcome them may vary according to specific domain or product class needs.
Where a failure can lead to an unacceptable hazard, a Risk Control Measure (RCM) must be implemented to reduce or remove the likelihood of occurrence, and/or the effect of the failure should it occur. Even after implementing an RCM as a primary risk mitigation, residual risk may remain. It is therefore essential to evaluate potential consequences of any such residual risk, and to further mitigate if possible. This cycle of analysis, evaluation, and mitigation must continue until the hazard has been reduced to a level safe for the domain and application.
Industry-based standards can provide guidance for common hazards. For example, ISO 14971 – Application of Risk Management to Medical Devices, lists hazards often associated with medical devices.
Common risk mitigation methods
RCMs are identified through adoption of rigorous process and must be strictly documented, tracked, implemented, and tested in the ensuing development activity. Again, countermeasures to common risk categories and verification techniques to measure their effectiveness may be shared across domains, with the depth of protection to be applied, and the degree of confirmation required, typically dependent on the specific domain.
The approaches and tools that are commonly used across domains include:
- Process related:
- Requirements and Traceability Matrices (RTM, VCRM)
- Test coverage targets and reports
- Technical design reviews
- Tool validation
- Qualified model-based development and auto-code generation
- Application of coding standards and reviews
- Static code analysis
- Dynamic code analysis
- Fault-tolerant designs
- Built-in test
- Component de-rating
- Component selection
- Use of pre-certified components
- Design simulation
- Digital twin and Monte-Carlo simulations
The value of a Culture of Innovation
Regulatory standards are the starting point for all safety-critical design projects, and a structured and actively managed QMS ensures that processes are in place and followed. However, another crucially important aspect of safety-critical design is team culture and in particular creative or innovative thinking.
While rigorous and systematic analysis of a product design and architecture is essential, it is the first step in the process of efficient product delivery. Only a team with engineering curiosity, who are experienced in multiple domains, who are willing to explore ‘corner cases’, and who have developed a culture that rewards searching for the best possible solutions can generate the most efficient, effective, and timely outcomes in safety-critical design projects.
Astronaut, Frank Borman aptly illustrated this during an enquiry into the Apollo 1 fire when he noted that despite the pressure to meet deadlines, safety had never been intentionally compromised. When asked “what caused the fire?” Borman’s response was “a failure of imagination”. The exact scenario hadn’t been conceived when the hazards were considered.
Safety-critical product design is a challenging activity and is often thought of as being a practice unique to the application domain. However, through identifying significant overlaps in the types of potential hazards, by evaluating the types of mitigation strategies available, in applying rigorous methodologies for analysis and evaluation of risks, and in using creativity in the exploration of use scenarios and potential failure modes, the Hydrix team has evolved safety-critical design as an advanced engineering offering.
A broad portfolio of safety-critical applications has given our teams a very strong library of experiences that allows us to identify where the cross-pollination of risk mitigation strategies between domains may be applied, and to enable our team to confidently deliver significant value to our clients and end users.
While our experience in safety-critical systems development gives us a broad ‘library’ of knowledge that can quickly lead to finding efficient risk mitigation solutions, structured innovation processes and skills can be the final piece of the puzzle. In some instances, the ‘best’ solution isn’t obvious or easily identified. In these cases, a team must be fully aware of the constraints within which they are working, and at the same time be flexible enough to consider innovative solutions. A structured innovation approach is essential in these situations as it both controls the solution boundaries and ensures the innovation process is applied in the way most appropriate for the situation, thereby minimising the risk of creating unexpected consequences, and driving efficiency.