A resilient system is a system that possesses the ability to survive and recover from the likelihood of damage due to disruptive events or mishaps. The concept that incorporates resiliency into engineering practices is known as engineering resilience. To date, engineering resilience is still predominantly application-oriented. Despite an increase in the usage of engineering resilience concept, the diversity of its applications in various engineering sectors complicates a universal agreement on its quantification and associated measurement techniques. There is a pressing need to develop a generally applicable engineering resilience analysis framework, which standardizes the modeling, assessment, and improvement of engineering resilience for a broader engineering discipline. This paper provides a literature survey of engineering resilience from the design perspective, with a focus on engineering resilience metrics and their design implications. The currently available engineering resilience quantification metrics are reviewed and summarized, the design implications toward the development of resilient-engineered systems are discussed, and further, the challenges of incorporating resilience into engineering design processes are evaluated. The presented study expects to serve as a building block toward developing a generally applicable engineering resilience analysis framework that can be readily used for system design.
Introduction
Change occurs perpetually in life. For an engineered system to adapt to changes, this ability has to be designed into the system. This practice is also known as engineering resilience. To promote a better understanding of engineering resilience, there are several basic questions that should be considered: (1) What is engineering resilience? (2) Why is engineering resilience necessary? (3) Where could engineering resilience be implemented? (4) When is engineering resilience desired? and (5) How can engineering resilience be modeled and quantified, and used to improve the design of engineered systems?
Primarily popularized by researchers in the field of ecology, resilience in an ecosystem is defined as the speed with which an ecosystem returns to its equilibrium state following a perturbation [1]. This idea of “speed of returning to equilibrium” has influenced the origin of the engineering resilience concept [2]. In engineering, speed of returning to equilibrium is typically associated with: (1) how fast an engineered system can adapt to deviation following a misfortune and/or (2) how swiftly an engineered system can be restored from its disrupted states. Engineering resilience is the concept that fuses resilience ability into engineering practices. Resilience in engineering implies the ability of an engineered system to autonomously sense and response to adverse changes in health conditions, to withstand failure events, and to recover from the effects of these unpredicted events [3]. A resilient system, from the perspective of the U.S. Department of Defense as reported in the literature [4], represents the system that exhibits specific resilience properties, such as ability to repel, resist, or absorb, ability to recover, and ability to adapt. A survey of the definitions of resilience that have been reported in different disciplines can be found in Refs. [5,6]. Engineering resilience has been sought as an alternative or as a complement to the traditional view of system safety to endure the possibility of failure [7–9]. The resilience of engineered systems has been addressed in many different aspects, leading to the fast growing engineering discipline referred to as “engineering resilience,” sometimes also addressed as “resilience engineering” in the engineering society.
The continuous pursuit of developing a better, safer, and longer lasting engineered system has pushed the continuous growth in complexity and scale of engineering systems [3,10]. Subject to operation in unpredictable and uncertain conditions, complex engineered systems may require extraordinarily high safety precautions in design to account for unforeseen failure modes, such as those induced by adverse natural disasters. However, in the early design stage, it is very challenging, if not impossible, for system designers to determine all the possible failure modes. Thus, noticeable consideration has been given to engineering resilience that it is necessary to be designed into engineered systems in order to cope with system complexity and unforeseen failure modes.
To date, the implementation of the engineering resilience concept has been widely spotted in various engineering disciplines. Many of the engineering resilience implementations are associated with large-interconnected-complex systems, such as transportation systems [11–19], power systems [5,20–24], production systems [25–30], multitier supply chains [3,25,31–39], general infrastructure systems [5,20,40–47], health care systems [48–51], and many more. The implementation of engineering resilience is not only limited to complex systems applications, but the engineering resilience concept could also be implemented to single-mechanical-design system, such as aircraft actuators [52], aircraft controllers [53–55], or computer numeric control machining systems [56].
Traditional research efforts were focused on developing a system with high reliability to prevent failures. Although the high reliability concept has managed to improve system performance, there are two main reasons why high reliability is no longer sufficient in some instances: (1) High reliability is costly. Improving reliability in a system typically involves backup, redundant, or standby systems and/or components. This simultaneously requires additional costs. The costs involved in improving reliability would increase substantially as the system reliability level approaches the maximum achievable reliability. At some point, it is no longer economical to improve system reliability further as the law of diminishing returns will apply. (2) Failures could be inevitable in many engineering applications, even with very high system reliability. For instance, a failure event with a zero probability of failure could still occur in engineering practice as suggested by the probability theory. In addition, there are some cases where the damage caused by the failure events is unavoidable and uncontrollable, especially those adverse failure events which are induced by nature. Engineering resilience has presented itself as the turning point in recent research efforts toward a more systematic way of addressing failures of engineering systems. In cases when achieving higher system reliability is no longer affordable and failure is inevitable, engineering resilience offers the ability to survive failures and to recover from calamities. Resilience is particularly appropriate when the system is expected to survive and recover from low frequency-high impact disruptions [57].
Although engineering resilience has gained popularity among designers, engineers, and practitioners, the consensus on how engineering resilience can be designed, quantified, and improved in engineered systems has not yet been reached. This may be partly because engineering resilience, during its implementation, is highly subject to the application context. It is dependent on the architecture of the systems, the operating conditions, the type of disruptive events, along with the magnitude of damage [57]. Different systems may be designed to be resilient to different disruptions, which would most likely require different approaches. The catch here is in what way or manner engineering resilience can be translated to unambiguous quantifiable measures. To design or create resilience in a system, a set of actions describing resiliency can be further interpreted in the same quantifiable measures as engineering resilience. After one identifies a proper way to quantify engineering resilience, modifying system designs and operations thereby improving resilience can be further carried out.
This paper provides a literature survey of existing studies in engineering resilience from a system design perspective, with the focus on engineering resilience metrics and their design implications. This paper would offer a better understanding of the engineering resilience concept in the engineering design community and help promoting further developments of generally applicable resilience quantification metrics, resilience analysis methodologies, and resilience design tools. These potential developments are expected to be applicable in a broad range of applications in the design of resilient-engineered systems. The rest of the paper is structured as follows. The conceptual attributes of an engineering resilience curve is first presented in Sec. 2, a survey of the available resilience quantification metrics is presented in Sec. 3, the design implications of engineering resilience are then discussed in Sec. 4, and conclusions drawn are summarized in Sec. 5.
Engineering Resilience Curve
Most engineered systems are exposed to uncertain, unpredictable, and potentially harsh operating conditions, which partake in the alteration of system performance level over time (P(t)). Figure 1 shows the performance behavior of a resilient-engineered system compared to that of a nonresilient-engineered system, after being subjected to a disruptive event.
A resilient-engineered system possesses the ability to recover the system performance level from its disruptive state to its operating state as indicated in Fig. 1(a). On the other hand, a nonresilient-engineered system may gradually decline toward a significantly low performance level due to an unexpected disruptive event. Depending on the inherent capabilities of the system to withstand mishaps, the system may reach an unhealthy or degraded stable-state (Fig. 1(b)). This scenario is indicated by a lower performance level (Pv). If the system cannot survive the disruption, it will continue to worsen until the systems face a complete failure or collapse state (Fig. 1(c)). From Fig. 1, it is apparent that engineering resilience is more favorable when the system is subjected to disruptive events.
Since resilience has been generally associated with the losses of system performances after a disruptive event, a resilience curve is thus typically represented as a system performance curve, P(t), plotted against time, t. In general, there are four states in the timeline of the engineering resilience concept. As illustrated in Fig. 2, these four states are briefly explained as follows:
- (1)
Reliability state (SI): Baseline or original state, when the system operates normally before the occurrence of disruptive events (Po).
- (2)
Unreliability state (SII): Vulnerable state, when the system degrades to Pv following a disruptive event at time td.
- (3)
Recovery state (SIII): Recovery state, when the system improves its performance functions as a result of restorative efforts. The restoration actions occur instantly from tv to tn.
- (4)
Recovered steady state (SIV): System performance reaches a newly recovered steady state after successfully completing the recovery state at time tn.
There are many variations of the engineering resilience curve apart from the one illustrated in Fig. 2. The various versions of engineering resilience curves originate from different perspectives that are mostly for conceptual and qualitative illustration of resilience in the application of interest. These variations are mostly due to differences in the unreliability profile and the recovery profile for different engineering applications. As a disruptive event typically varies in terms of severity and duration, the recovery response may also vary in different scenarios [36]. Figure 3 shows some examples of the conceptual attributes which lead to various forms of the engineering resilience curve.
Following a disruptive event, the impact level captures the severity of the event on the system performance. Impact level could be measured through the difference between the initial performance level and the performance after the disruptive event (Po − Pv) [36,60].
The unreliability profile and the degree of unreliability (θ) vary with the impact level and the inherent ability of the system to survive a disruption. Figure 3 shows three different unreliability profiles (u1, u2, u3). The first unreliability profile (u1) exhibits a sharp vertical performance drop (θu1 = 0 deg). In this scenario, the system is interpreted as unable to endure the impact of a disruption where the disruption may be unavoidable, sudden, and destructive. The second unreliability profile (u2) shows a gradual decrease in system performance and stabilized in a stable disruptive state before the recovery takes place. In the literature [18,20,61], this scenario is often referred to as a five-state resilience curve, as depicted in Fig. 4. The third unreliability profile (u3) expresses a gradual decline in system performance and immediate recovery. Since engineering resilience is associated with an accompanying swift recovery action, the recovery action should take place immediately once the system has sensed a continuous drop in system performance due to a disruptive event. The recovery action should be proactive and preferably triggered before the system settles to a stable disruptive state, as depicted in u2. The system unreliability state and disrupted state in Fig. 4 can be generally reflected as one unreliability state as both states exhibit nonoptimal performance level. Note that θu3 > θu2, which also explains that u3 is able to endure the impact of disruption better than u2. The performance loss area of θu3 is lesser than θu2 although both scenarios are recovered at the same time tv in Fig. 3.
The degree of recovery (γ) determines how much system performance can be recovered. Despite the fact that some failure events cannot be foreseen, engineering resilience offers swift recovery abilities to return the system performance function rapidly to its ideal operating condition (SIV). There are three possible outcomes as seen in Fig. 3, SIV could be improved (higher than baseline), stabilized (same as baseline), and deteriorated (lower than baseline), all in line with the built-in resilience ability in the system and the availability of required resources. The unreliability profile and recovery profile in most resilience curves in this paper are demonstrated as straight lines for simplicity purposes. In practical engineering applications, due to the presence of uncertainties, both unreliability and recovery profiles are more likely to exhibit nonlinear behavior. In some cases, convex and concave profiles are also observed [36,51]. Figure 5 shows four representative behaviors of a recovery profile.
Resilience Quantification Metrics
Quantification of engineering resilience plays an important role in defining resilience of an engineered system and further applying the resilience concept in the engineer design process. Although it has been explored in diverse engineering disciplines, to date, available engineering quantification metrics still exhibit very little standardization. Agreement on a general quantifiable measure remains a challenge. Many different approaches and aspects (including uncertainties) should be taken into consideration when it comes to quantifying engineering resilience. Highly dependent on the application of interest, quantification metrics could be classified as deterministic–probabilistic and/or static–dynamic [62].
In this section, the available metrics are grouped based on the derivation approaches of the resilience quantification metrics. Some metrics could fit into more than one category. There are strengths and weaknesses in every available resilience quantification metrics, depending on the purpose of study and application of interest. A compilation of resilience metrics as reported in this literature is provided in this section to show the diversity of the available metrics. These resilience metrics are categorized based on three categories, namely, (1) resilience curve, (2) pre- and postdisruptions performances, and (3) reliability and restoration, which are detailed below. Note that the available resilience quantification metrics provided in this paper is not exhaustive.
Resilience Metrics Based on Resilience Curve.
Since the resilience curve is often used to illustrate the resilient behavior of an engineered system undergoing a disruptive event, many researchers have used the properties from the resilience curve to quantitatively measure the resilience level of the system. In the resilience curve, the area of concern is the shaded area in Fig. 3 or 6. This area is also referred as the “impacted area (IA),” which defines the performance loss after a disturbance or disruptive event. If the area is enclosed by a nonlinear recovery profile, the performance loss can be approximated using the integral method. In Ref. [63], loss of resilience (Ψloss) is denoted as performance loss. Ψloss can be quantified by the magnitude of the expected degradation in performance quality over recovery time, mathematically expressed in the following equation:
where Po(to) is the initial performance function before a disruptive event at time (td), and P(t) is the performance quality of a system which varies with time.
The system performance does not necessarily show a steep or extreme drop in the aftermath of a disruptive event, as illustrated in Fig. 6. During td and tv, a gradual performance degradation may be experienced by the system, as illustrated in Fig. 7. Most of the gradual performance drops exhibit a nonlinear behavior. For the nonlinear unreliability and recovery profiles, resilience can be explained as the functional capability of a system following a hazard over the control period (T = tn − td). As mathematically shown in Eq. (3), Ψ can be quantified as the normalized shaded region under the system response (describing the functionality of a system) after a disruptive event denoted as AP(t) in Fig. 7 [65].
In cases where BP(t) is measured in a relative scale and assumed to be either 100% (or in other words a constant value of 1.0), the integral of the dominator in Eq. (4) will result in T*, and thus, Eq. (4) could be further rewritten as Eq. (3), in this specific case. Note that the time period proposed by Renschler et al. [65] in Eq. (3) is different than the one in Eq. (4).
where E[IA] is the expected impact area caused by the disruptive event. E[IA] accounts for all the possible damage intensities. λ is the occurrence rate of the disruptive event per year. P(t), AP(t), and BP(t) could either be deterministic or stochastic variables depending on the application of interest. In order to mimic reality, stochastic variables would be more preferred in resilience analysis because of the incorporation of probabilities, randomness, and uncertainties.
In addition to the performance loss, other resilience dimensions could also be derived from a resilience curve. Figure 8 depicts five resilience dimensions: recovery, impact, performance loss, recovery profile function (f(t)), and weighted-sum (g(t)) as proposed in Ref. [36]. The description of each dimension and the corresponding equations are listed in Table 1. The resulting resilience value could be calculated by the submission of the weighted resilience dimensions in Eq. (7). w1,.,5 is the weight corresponding with the dimension of resilience
Resilience Metrics Based on Pre- and Postdisruption Performances.
Engineering resilience is often affiliated with performance loss of the system undergoing a disruptive event. Therefore, one of the approaches to quantify resilience is the measurement of performance changes, where resilience metrics could be represented as the ratio of system performance before (pre-) and after (post-) disruption. Expressing resilience based on system performance is highly application-specific, as different applications generally have different performance functions. In addition, there are many cases where a unique application can be described by multiple performance functions. For example, in a networked-system, the performance function could be characterized in various ways, such as the flow/delivery value in a network, the system travel time (STT), the demand that has to be satisfied, etc.
where Ψi is the resilience of the demand node i, pj is the reliability of supply node j, qk is the reliability of supply link k, di is the demand quantity of demand node i, sj is the availability of supply node j, and ck is the capacity of supply link k, respectively.
In general, the maximum performance drop represents the worst case scenario that could happen for a system as the postdisruption effect, as shown in Fig. 9, where the worst case scenario has been denoted by Pmax. Based upon the worst case scenario, a resilience index was defined as the ratio of the avoided performance drop postdisruption and the potential maximum performance drop [60], as expressed in Eq. (12) mathematically.
Resilience Metrics Based on Reliability and Restoration.
in which the capacity restoration (ρ) can be considered as the degree of reliability recovery. The reliability and restoration can be derived as a set of conditional probabilities. The restoration in Eq. (16) was further quantified as a conditional probability of a system failure event (1 − R), a correct diagnosis event (ΛD), a correct prognosis event (ΛP), and a mitigation/recovery action success effect (κ) [52].
Reliability has also been expressed in terms of damage or performance loss in many available resilience quantification metrics. Resilience was quantified in Ref. [69] as Pr(A|i), which is the conditional probability that the system will meet predefined system performance standards (A) after the disruptive event i. The performance standards introduced in Ref. [69] include robustness (r*) and rapidity (t*). The robustness has been defined as the maximum acceptable loss, which can be considered as the ability of the system to endure failure or ensure reliability. Moreover, rapidity has been defined as the minimum acceptable disruption time or maximum time to full recovery. After the presence of a disruptive event, the initial loss (ro) and the time to full recovery (tn) should not exceed the performance standards, as shown in Fig. 10 where ro > r* and tn < t*. With this resilience quantification metric, a resilience objective where Pr(A|i) should also meet the reliability goal of R* can be set and represented as
where Sp is the speed recovery factor, Fr is the performance at the recovered stable state, Fd is the performance level immediately postdisruption, and Fo is the original stable system performance level predisruption. Fd/Fo and Fr/Fo are deemed to be the absorptive capacity and the adaptive capacity of the system, as discussed in Ref. [5]. In this scenario, absorptive capacity can be considered as reliability whereas the adaptive capacity can be considered as restoration of reliability losses.
where failure profile (F) and recovery profile (ρ) are measured based on failure event (f) and recovery event (p), respectively, over the performance P(t).The time notations have been labeled in Figure 11 accordingly. Moreover, efficiency of the system prior to disruption, E0, is also believed to have an effect on the recovery process. Resilience has been quantified in Ref. [72] for civil infrastructures under earthquake disruptions as the recovery over the loss of efficiency by taking into account Eo, the measures of damage transpired (Pd) after a disruptive event, and the measure of the recovery process (Pρ), respectively. The resulting resilience metric was then formulated as
where E(Pρ) indicates the efficiency of the recovery curve.
Resilience Scale.
Although resilience has been quantified in different manners for different application purposes, as discussed in Secs. 3.1–3.3, it is, however, important to reach an agreement within the community on a scale that resilience of an engineered system can be measured, which facilitates the resilience analysis and further the assessment of resilience performance for different system design alternatives. A resilience scale allows one to evaluate how much resilience has been gained or lost in a system. As reported in the literature, most of resilience metrics have taken a resilience scale between 0 and 1 [25,52,65], or may be expressed as a percentage value between 0% and 100%. Quantifying resilience based on different system performances of interest with a universal scale between 0 and 1 could potentially simplify the complication induced by all different resilience metrics, thereby reaching a generally applicable quantity.
First, as resilience could also be considered as one of the system characteristics, it is more convenient to quantify it at a relative scale based upon the performance changes before and after a disruptive event. In addition, when uncertainties are incorporated in resilience analysis, probabilistic resilience metrics can be used that generally possess a probability value between 0 and 1. By using a resilience scale, a resilience value could be interpreted based on system performance recovery after a disruptive event or based on the probabilistic concept on how likely the system would survive or recovery from the disruptive event in general. For example, a system that has a resilience value of 0.9 can be interpreted as that the system is 90% resilient to a particular disruptive event in general. Specifically, it could indicate a 90% probability that the system will survive a given disruptive event or recover to a predefined system performance within a given time period after the disruptive event.
From the resilience scale perspective, success in engineering resilience would point toward the ability of a system to sense the changes in health conditions, to prevent and/or survive the likelihood of damage, and to recover from the postdisruption effects successfully. Failures in engineering resilience imply the inability of a system to adequately adapt to changes following a mishap, instead of system breakdowns or malfunctions [73]. In addition, while there are multiple potential disruptions, an engineered system may possess different resilience performance toward different disruptive events. Depending on the severity of disruptions, the system could be more resilient to one type of disruption, but not to other types [57].
Engineering Design Implication
Based on the surveyed resilience quantification metrics, what engineering resilience has to offer from a system design perspective will be discussed in this section. Considering the perception of failure probability, a certain level of resiliency can be designed into a system to improve the system performance against disruptive events, as depicted in Fig. 12. In order to develop a high-resilience and low-cost engineered system from the system design perspectives, there are two questions with regards to integrating resilience in engineered systems: (1) How to connect the resilience quantification metrics to system design parameters, thereby assessing the resilience of different design alternatives? and (2) What resilience strategies can be used in engineering design to generate design alternatives and improve resilience of engineered systems? These two key questions will be further discussed as follows. Section 4.1 discusses the resilience attributes in general from a design perspective, Sec. 4.2 discusses predictive resilience analysis of system design alternatives, Sec. 4.3 describes potential resilience strategies that could be used in design in order to improve the resilience of an engineered system, and Sec. 4.4 provides the discussion for the challenges and further research needs in design for resilience.
Resilience Attributes for Design.
For a system to be resilient against disruptive events or potential failures, there are two essential properties that a system should possess before or after the occurrence of a perturbation, as shown in Fig. 13. The first one is the ability of the system to maintain function without failures, or generally referred to as “reliability.” The second one is the ability of the system to recover from misfortunes, or the ability to “recovery” or “recoverability.” These two key attributes of resilience could be designed and engineered to enable the failure resilience for an engineered system. Reliability and recovery attributes have also been viewed as passive and proactive survival rates [25,52], static and dynamic resilience [57], or absorptive capacity and adaptive capacity [5].
Considering the resilience quantification metrics suggested by the resilience curve, a resilient-engineered system can be designed by minimizing the performance losses for a given disruptive event. This design strategy can be further realized through reducing the impact of a disruptive event, such as reducing the magnitude and duration of performance losses, or increasing the speed of recovery. Similar implications can be drawn from the resilience quantification metrics based on the pre- and postdisruption performances. Although these metrics provide a conceptual representation of resilience in a straightforward manner, incorporating them into engineering design is still very challenging. Due to the growing complexity of an engineered system, as well as the difficulty of precisely knowing how a system would respond to the disruptive event at the early stage of system design, it would be very challenging to measure the resilience level for different design alternatives precisely.
Compared to the resilience metrics based on a resilience curve or pre- and postdisruption performances, the resilience metrics suggested by reliability and restoration as surveyed in Sec. 3.3 (which are often probabilistically measured) could offer a better choice for system designers when designing an engineered system to be failure resilient. As mentioned in Sec. 3.3, reliability can be precisely quantified through the probability that the system or component will perform its required functions under stated conditions for a specified operating period, and measured systematically by a probability distribution of time to failure. In addition, unreliability, survivability, or vulnerability is another term that could be used to describe reliability in a system. The resilience concept extends the concept of reliability by incorporating the ability to recover from disruptive events into the system. As suggested by the resilience quantification metrics based on reliability and restoration, not only reliability of a system must be designed but the ability to recovery from a performance disruption must also be engineered in order for an engineered system to be failure resilient. Compared to tremendous amount of research and development in the area of design for reliability, research in the area of design for restoration is still very limited, despite its importance in realizing engineering resilience.
Besides the reliability and recovery, other resilience attributes have also been studied including the ability of a system to monitor its operations, anticipate potential failures, response to failures, and learn from failures [74]. The ability of a system to monitor includes tracking the changes in its own performance as well as its environment, allowing a disruptive event to be anticipated, minimized, or avoided. When a disruptive event has been anticipated, more coherent, timely, and effective responses can be expected from the system. If the responses of the system are not the desired responses, the ability of the system to learn allows the system to learn from the experience, so that the ability to monitor, anticipate, and response can be enhanced.
Predictive Resilience Analysis.
While designing an engineered system to be failure resilient, it is essential for system designers to be able to assess the resilience levels for different design alternatives in order to make the best design decision. There are many uncertainty factors should be taken into account, while converting a conceptual framework to a designable resilience measure and further developing predictive resilience analysis techniques. A conceptual resilience framework is composed of many factors that affect the system performance in terms of resilience characteristics inherent in the system. As surveyed in Sec. 3, resilience quantification metrics have mostly related system performance outcomes after a disruption to system resilience. How the system responds in the aftermath of a disruption will largely determine the resiliency level of the system, thus one of the essential and challenging tasks in predictive resilience analysis is being able to analyze system disruption responses at early system design stage.
In the early design stage, an engineering assessment technique for predictive resilience analysis is very much needed for system designers to gain necessary knowledge of how the system responds to a disruptive event, and whether the resilience level in a system design is sufficient. The methodologies and tools available in the literature for assessing engineering resilience in the design process are still very limited. This is primarily because assessing the further performance of a system in its operating stage during the design process is challenging. Although advanced system simulation techniques have given system designers more capability in predictive analysis, it is still challenging to take into account the interdependencies and complexities of an engineered system, the uncertainties associated with system design and operation, and the emergent changes in the long term that may affect the system operating conditions.
One of the primary challenges in predictive resilience analysis is the development of effective system modeling techniques, so that the interdependencies and complexity of an engineered system can be modeled, and the performance of the system undergoing a disruption can be simulated and analyzed at the design stage. Some preliminary studies have been reported in the literature in addressing this challenge. One way to understand the design architecture of a complex engineered system by utilizing approaches from game theory and social network analysis [75]. Interdependency between entities can be expressed in the terms of algebraic connectivity. However, this approach requires an accurate modeling of a complex engineered system as an interconnected graph, which could be very challenging in the cases where a large amount of interdependent components and subsystems are considered, thereby the graphical model expands tremendously in size. Thus, recent research efforts have been directed toward adapting a combination of logical and statistical approaches, such as the Bayesian or the Markov approach. Reasoning copes with complexity, and probability handles uncertainty. Bayesian network (BN) approach has been proposed as a way to handle interdependencies [57,70]. A Bayesian network (BN) approach has been applied to assessing the resiliency of a supply chain [3], a production system line [25], and a system-of-system [70]. Figure 14 shows the BN modeling framework that has been reported for engineering resilience analysis and design [25]. In the BN approach, the important system characteristics or critical components are represented as nodes, the interdependencies between components are modeled as links, and the overall complexity of the system structure is demonstrated through the combination of links and nodes. Moreover, in BN the uncertainties are represented as conditional probabilities in multiple possible states. Considering the dynamic or evolving behavior of the system performance over time, the dynamic Bayesian network (DBN) could be further employed [76–78]. However, updating the BN or DBN to accommodate system changes for a complex system may be laborious and computationally intensive.
Besides the interdependency and complexity of an engineered system, it is also challenging to take into account the emergent behavior of the system due to the recovery effects, as well as the evolving operating environment. An example would be in the design of a transportation infrastructure system to accommodate more automatically driving vehicles in the future. In a different application, employing partially observable Markov decision process (POMDP) has been proposed for designing resilient spacecraft swarms [79]. Although POMDP allows self-learning and is self-adaptive, a strenuous initial condition is required to define the behavior, reward, and actions to enable an accurate self-learning capability. Considering the evolving characteristics of complex adaptive systems (CASs), the agent-based simulation technique could be potentially used by system designers as a sophisticated tool for analyzing the disruptions in an adaptive evolving simulation environment [26,79,80]. Although some initial efforts have been made in modeling engineering systems for the resilience assessment as reported in the literature, more effective predictive resilience analysis methodology and tools that are readily used in various system design applications should be developed in addition to uncovering different engineering resilience quantification metrics.
Engineering Resilience Strategies.
As discussed in Sec. 4.1, there are two essential resilience attributes that an engineered system must possess in order to be failure resilient, namely, reliability and recovery. The resilience strategies discussed in this section are focused on how to improve the reliability and ability to recover through system designs. Since reliability and recovery are designable quantities, they could be utilized in transforming the conceptual resilience to the designable resilience attributes, enabling system designers to develop resilient-engineered systems, as demonstrated in Fig. 15. Accordingly, design strategies used for advancing reliability and recovery could be implemented for the purpose of advancing resilience in the system. In the rest of this section, design strategies for the improvement of reliability and recovery are further discussed.
Improving Reliability Through Design.
As one of the important design attributes for engineering resilience, reliability is a relatively mature concept within the design community. Reliability can be generally defined as the probability that the system or component will perform its required functions under stated conditions for a specified operating period. Accordingly, substantial research efforts have been made in the past few decades in designing engineered systems for reliability, leading to mutual design frameworks and tools being developed in the literature, such as the reliability-based design optimization framework [66,81–84], effective reliability analysis methods for design [85–89], and postdesign reliability assessment and growth.
There are different approaches and design strategies to improve the reliability of an engineered system or component. While considering single failure mode, it is beneficial to understand the failure mechanism and physics of failure so that appropriate reliability design strategy could be identified such as discovering new materials, mechanisms or new design concepts, or developing a reliability growth plan. While considering reliability at a systems level with multiple components and failure modes, one of the most used design techniques in improving reliability is the incorporation of redundancy into the system. Reliability allocation could be used to allocate reliability attributes to component and subsystems optimally in design while considering redundancy levels. In addition, when dealing with uncertainties in most engineering applications, there is no certain way that all the failure modes could be taken into account in the early design stage. Therefore, derating and diversity are other design techniques that can be adopted to improve reliability. Derating could be found in the applications where higher tolerance components are used for extra endurance instead of components with normal specifications. Diversity can be seen in logistics applications, such as having a diversity of suppliers to ensure the reliability of the continuous supply process.
Besides the design strategies in improving the system reliability, failure diagnostics, prognostics and health management (PHM), and appropriate operation and maintenance (O/M) plans could also be developed to improve the system reliability in operations. PHM is an emerging engineering discipline that has been applied to a large variety of engineered systems to improve system reliability [90–94]. It diagnoses the performance degradation of a system through its operational performance data, thereby predicting the remaining useful life (RUL) of the system. PHM can significantly enhance the reliability, availability, and predictability of the system by providing the early awareness of potential system failures, thus enabling optimized planning of failure mitigation and recovery activities.
Improving the Ability to Recover.
Different with improving the reliability through design, the ability of an engineered system to recovery often relates to the aftermath of disruptive events of the system, which makes it more challenging for system designers to consider it thoroughly at the early stage of a system design process. In many applications, a swift recovery process also depends on the amount of available resources and time. Thus, optimal allocation, high-level preparedness, and good collaboration can be designed into a system with the relation to the decision makers or managerial-level.
Redundancy is also in line with recovery strategies, since it offers an alternative path for maintaining system functionality when a disruption event occurs due to failure of a component or subsystem. Similarly, maintenance actions in mitigating potential failures or recovering the functionality of failed components or subsystems would be another design strategy that could be applied to enhance the ability to recover for an engineered system. Preventive maintenance is associated more with reliability attributes because it is usually used to maintain the healthy condition of a system to prevent a complete system failure. This is opposed to corrective maintenance, which is typically carried out to restore the system to an operational condition, leaning more toward recovery attributes. Development of an effective maintenance plan includes not only the maintenance planning to be optimized but also the system designs to be more effective in conducting the planned maintenance actions. Additionally, functional retrofits that apply partial changes to a system at the operation stage to restore its capacity or improve performance have gradually become a major cost-effective means to maintain desired system functionality of an engineered system over its lifecycle. Functional retrofits through partial system repair, replacement, or upgrade could be a viable strategy in improving the ability to recover, given that these retrofits could be appropriately projected and engineered at the system design stage.
The PHM technique that diagnoses the performance degradation of a system through its operational performance data could facilitate an optimized planning of failure mitigation and recovery actions. The PHM technique could not only improve system reliability by offering early awareness of system failures but also play an important role in improving the ability of an engineered system to recover from the aftermath of disruptive events. This is because the PHM technique enables a proactive approach to address failures at the life cycle use phase through detecting, diagnosing, and predicting the system-wide effects of disruptive events and providing valuable information for failure mitigation and recovery decisions. A resilient design of an engineered system would expect the system to be intelligent so that it can make autonomous decisions to recognize risk induced by a potential hazard or disruptive event, and adjust or reconfigure itself in response to risk [79,80]. Advanced resilience design could leverage the capability offered by the PHM technology in order to develop self-learning or self-restructuring capabilities for the design of a resilient-engineered system [76]. The PHM technique has been successful in lowering system lifecycle costs by providing precise information about operational stage failures. However, in order to realize the resilience through failure prognosis and prognosis-informed maintenance or functional retrofits, a generally applicable PHM system development framework that ensures high accuracy and robustness needs to be developed for the design of resilient-engineered system.
In summary, both reliability and recovery are essential resilience attributes that are quantifiable and designable, and some examples of design strategies for the reliability and recovery improvements are listed in Table 2, which are still being perfected with the continuous progress in design for resilience researches and developments.
Challenges and Further Researches.
From the surveyed resilience quantification metrics and the discussion on their engineering design implications, it is postulated that resilience of an engineered system could be enhanced through better design. The enhancement could be realized from the improvement of designable resilience attributes through appropriate design strategies. However, to achieve resilient designs of engineered systems, challenges from multiple aspects must be addressed. In this subsection, based upon the authors' best knowledge, the challenges and the further research needs are discussed.
As shown by most of the resilience quantification metrics, the resilience measure is closely tied to the system performance changes throughout a disruptive event. Early awareness of potential disruptive events and the aftermath of these events at the early system design stage is one of the primary challenges in resilient system design. This challenge is posed to system designers at the early design stage, as they have to be aware of potential disruptive events, the factors of complexity, and uncertainties in their design applications. They also have to be aware of how these factors would affect the behavior of the system when undergoing one of the disruptive events, before the system is actually being developed.
Disruptions can be categorized according to the types, sources, or impact levels, such as natural or man-made, external or internal, and local or global [95]. Disruptions do not necessarily need to have a sudden fatal impact on the system. Aging and degradation due to long hours of operation could be considered as disruptions as well. Furthermore, a minor disruption will only alter a small part of the system characteristics, whereas a disruption with severe impact could be fatal for the system. From the disruption aspects, the context (behavior, mode, and state), the duration (temporary, permanent, and trend), and the risk (likelihood and severity of damage) should be considered in the design of a resilience scheme [79].
Complexity is generally associated with the hierarchy and the collective behavior of the system. For example, the interdependency between system components and subsystems, different subfunctions, as well as between the system and its operating environment, would substantially increase the complexity of the system. A system in general consists of multiple components and subsystems that are interconnected and interact with each other in various different ways. The collective behaviors of lower-level systems regulate the top-level system performance. Depending on the severity and the impact of the disruptive events, a partial failure, common cause failures, or cascading failures could occur. Either failure imposes negative effects on a system that is indicated by an overall lower system performance level. From the system characteristic viewpoint, the architecture or hierarchy, the collective behavior, the interdependencies, and the functionality of the system should not be disregarded in the scheme of designing resilience.
To address the challenges as outlined above in designing resilient-engineered systems, there is a great research need for a theoretical basis that furnishes a better understanding of how engineered systems achieve resilience, as well as enables the development of an engineering resilience principle readily applicable to engineering design. In the rest of this subsection, several emergent research needs are discussed from four different aspects. This discussion is not intend to be exhaustive, but rather to throw light on further research directions and to stimulate more valuable insights from the community to address the research challenges in designing resilient-engineered systems.
Early Awareness of Disruptive Events.
In the early design stage, it is essential for system designers to be aware of potential disruptive events for their design applications, and be able to have necessary knowledge in terms of the likelihood of occurrence for each of these disruptive events. Although information on the failure rates of different types of system failures exists in the literature, these failure-induced disruptions are largely within the scope of a particular system or due to human error, and primarily considered independently. Knowledge about disruptions is induced by external factors, such as natural disasters or external environments, and their cascading effects due to the interdependency between system components and subsystems are primarily dependent on subjective expert judgments. Understanding the characteristics of these potential disruptive events would enable the development of failure mitigation and recovery techniques to be included in the consideration of system designs. In addition, early awareness of the potential disruptions would help the development of system monitoring, diagnostic, and prognostic techniques so that these potential disruptive events could be avoided or their consequences could be minimized.
Capability of Predictive Resilience Analysis.
During the system design process, it is also essential for system designers to be able to assess the resilience levels for different design alternatives. Thus, enabling techniques for predictive resilience analysis applicable at the early design stage is of paramount importance. The development of advanced complex system modeling methodology and associated system simulation tools would largely enhance the capability of system designers in predictive resilience analysis. The modeling technique must be able to take into account the complexity of systems, simulate the aftermath of system disruptions and system responses to these disruptions, consider the uncertainties associated with system design and operation, and further be adaptive to emergent changes in system operating conditions.
Recovery Strategies for Design.
As discussed in Sec. 4.3, recovery is one essential resilience attribute to be designed for a resilient-engineered system. However, recovery of the performance of degraded or partially failed engineered systems has largely relied on maintenance activities or functional retrofits. The strategies that can be used in design primarily depend on the allocation of redundancy, as it offers an alternative path for maintaining system functionality when a disruption event occurs due to failures of a component or subsystem. Although the PHM research could improve the ability to recover by facilitating an optimized planning of failure mitigation and recovery actions, failure recovery strategies that can used for the system designers in the design stage are very limited. Further research directions are very much needed in the new venue of exploring diverse failure recovery strategies that can be readily used for engineering design. These research needs would generally fall into either developing new performance recovery pathways, such as the use of self-healing materials [96,97] for design, or better implementing of existing recovery strategies, such as an advanced operation and maintenance planning method. Additional efforts would also need to be spent on design decisions on different recovery strategies and design alternatives in achieving the recovery of system performance after the disruptions.
Cost Assessment and Systems Engineering.
With increasing complexity and long projected useful lives of engineered systems, design decisions to ensure resilience of the system generally have to be made while simultaneously considering the costs or affordability. Thus, a lifecycle cost assessment framework that takes into account all costs associated for the improvements on each of the resilience attributes with the resilience strategies must be developed and incorporated into the decision-making process while designing a resilient-engineered system.
In addition, during the process of designing a resilient-engineered system, not only reliability but also the ability to recover from a disruption must be designed in order for an engineered system to be failure resilient, as suggested by the resilience quantification metrics. Advanced systems engineering tools for design, such as those for tradespace explorations [98–100], must also be developed and used to facilitate the generation and assessment of different design alternatives considering interdependencies, design constraints, different design outcomes, and their lifecycle costs.
Conclusion
This paper presented a literature survey of engineering resilience quantification metrics from a system design perspective. The engineering resilience quantification metrics reported in the literature were reviewed and summarized in three categories. With the surveyed resilience quantification metrics, the design implications toward the development of resilient-engineered systems were discussed, with the focus on the resilience attributes, predictive resilience analysis, and design strategies for resilience. The challenges of incorporating resilience into the engineering design processes were discussed, and the future research needs were outlined from four different perspectives, with an aim of inspiring future research directions and arousing valuable insights from the community to address the research challenges in designing resilient-engineered systems. The presented study expects to serve as a building block toward developing a generally applicable engineering resilience analysis and design framework that can be readily used for system design.
Acknowledgment
This research was partially supported by the National Science Foundation through Faculty Early Career Development (CAREER) award (CMMI-1351414) and by the Department of Transportation through University Transportation Center (UTC) Program.
Nomenclature
- E[•] =
expected value
- ei =
disruptive event
- P(t) =
system performance level over time
- Po =
initial system performance level before disruption
- Pv =
system performance level after disruption
- R =
system reliability
- T =
control period (T = tn − td)
- td =
occurrence time of the disruptive event
- tn =
time to new recovered state
- to =
initial time
- tv =
time to vulnerable or degraded state
- T* =
a long period of time
- ρ =
system recovery/restoration
- Ψ =
system resilience