MTBF: When Will Failure Strike?





Two standards are used for calculating MTBF. The Defense Department authorized the first issue of Military Handbook 217 (MILHDBK-217A), Reliability Prediction of Electronic Equipment, in 1965. The latest issue is MIL-HDBK-217F Notice 2, Feb. 1995.

The former Bell Communications Research (Bellcore) took MILHDBK-217 as a starting point and then modified and simplified the failure rate models to better reflect their own field experience. In 1985 they issued Technical Reference 332 (TR-332), Reliability Prediction Procedure for Electronic Equipment, also known as Telcordia GR-332. The latest revision is Issue 6, Dec. 1997. Bellcore developed TR-332 for telecommunications companies, but it has now been adopted for many other commercial electronic products.

MTBF is the inverse of the sum of the failure rates of all the electronic and electromechanical parts (λ) in a piece of equipment.

MTBF = 1/Σ(λ1 + λ2....+ λn)

The failure rate models for each individual part are estimates based on available data, and are the product of the base generic failure rate λ and a number of modifying π factors. The failure rates in TR-332 have fewer modifying π factors than MIL-HDBK-217. For TR-332, the steady-state part failure rate is:

λSS = λGπQπSπTπE

where: λSS is the steady-state part failure rate (failures in 109 hours)
λG is the generic failure rate
πQ is the quality factor
πS is the electrical stress factor
πT is the temperature stress factor
πE is the environmental conditions factor
πFY is the first year multiplier

For MIL-HDBK-217, the part failure rates λp (equivalent to TR-332 λSS) are the products of the base generic failure rate λb (equivalent to TR-332 λG) and a larger number of modifying π factors, some with added coefficients. As a minimum:

λp = λbπTπSπQπE

where: λp is the part failure rate (failures in 106 hours)
λb is the base part failure rate
πT is the temperature stress factor
πS is the electrical stress factor
πQ is the quality factor
πE is the environmental factor
Other π factors used in the handbook are:
πA is the application factor (certain semiconductors)
πC is the construction factor (magnetrons and semiconductors)
πCF is the configuration factor (vacuum capacitors)
πCYC is the cycling factor (mechanical switches and relays)
πCV is the capacitance factor (capacitors)
πF is the function factor (meters)
πK is the mating/unmating factor (connectors)
πLS is the load stress factor (switch and relay contacts)
πL is the learning factor (number of years since a vacuum tube’s introduction to field use)
πM is the matching network factor (RF components)
πP is the power degradation factor (lasers, laser diodes), or active pins (connectors)
πR is the power rating factor (semiconductors) or resistance range factor (resistors)
πS is the size factor (synchros and resolvers)
πSR is the equivalent series resistance factor (electrolytic capacitors)
πU is the utilization factor (magnetron tubes, switches)
πV is the voltage factor (variable resistors and capacitors)

The failure rates for LSI chips (Large Scale Integrated Circuits) such as microprocessors and gate arrays are calculated by gate count or transistor equivalents, and tables are provided in both reliability specs for determining the generic failure rates. There are also tables for estimating the junction temperatures of these complex parts.

As you might expect, electromechanical parts have higher failure rates than electronic parts. The λSS = 10 for switches is about the same as power semiconductors. Small fans have λSS = 50. Relays have λSS = 70 or more.

Things are also difficult in microwave and high-frequency switching applications. The λSS for HF switching devices, detectors, mixers, and amplifiers are all above 100. The stress factors extend to 90% of the particular parameter’s rating. Once the applied stress exceeds this limit (overstress), the part failure rate is no longer valid, and catastrophic part failure is likely to result.

Given the severe penalty for earth orbit, how do satellites last as long as they do? Currently, it is impossible to repair a satellite in geosynchronous orbit, so to ensure long life, the electronics are built with the highest reliability (Hi-Rel) Military specification (Mil-Spec) parts available:

1. Established Reliability (ER) passive parts,
2. Joint Army-Navy Tested-Extra 100% internal Visual inspection (JAN-TXV) discrete semiconductors,
3. MIL-M-38510; monolithic, multichip, and hybrid microcircuits and their quality and reliability assurance requirements.

The finished equipment is then given a burn-in with full environmental qualification. Redundant circuits are used for critical functions. While redundant circuits do add to the total parts count, properly designed fail-safe/fail-operational redundancy earns some significant credit factors.

Part failure rates are additive, while redundancy reduces the failure rate exponentially, lowering the overall equipment failure rate well in excess of the added penalty due to the increased parts count.

There are two specific techniques used to evaluate MTBF. During the product development process, engineers calculate predicted MTBF using the failure rates of the components used in the design. Initially, the design is calculated using a parts count method that assigns an operating temperature of 40°C, and an electrical stress of 50%. Once component values and ratings are selected and actual junction and hot-spot temperatures are known, the predicted MTBF can be calculated with a higher accuracy.

The preferred calculation method for MTBF requires data that documents actual failure rates for electronic components under normal operating conditions. Manufacturers or users analyze their field return rates to determine the actual field-demonstrated MTBF.

Using this data, they can calculate the overall field MTBF using a set of assumptions based on accurate sampled information. It is also important for a manufacturer to have good design rules, careful component selection, quality control and test procedures, and a controlled production process that builds products with the highest possible life expectancy.

As mentioned in the article, one of the most important design tools is the Failure Mode and Effect Analysis (FMEA). The FMEA takes each mode of failure for a given component (short, open, degradation, value shift, etc.) and then determines the effect on the overall equipment. A failure in one device can propagate and lead to subsequent failures in other components. Whenever an undesirable failure effect is found, the engineers will evaluate the compensating conditions, and perhaps make a design change to make the failure effect more benign. An FMEA is complementary to the design process of defining positively what a design must do to satisfy the customer, but it is most useful when applied with failure mode distribution data.

The military failure mode distribution reference is published by the Reliability Analysis Center (RAC), Failure Modes/Mechanism Distributions, document FMD-97. It lists the percentages of failure mode probability, obtained from repair facility repairs for electronic and electrical components. Some examples (not, unfortunately, from the latest version of FMD) are:

Carbon composition resistor: 46% open, 3% short, 51% value drift.
Film resistors: 45% open, 15% short, 40% value drift.
Variable resistors: 70% open, 10% short, 30% mechanical binding.
Ceramic caps: 30% open, 40% short, 30% value drift.
Metallized caps: 43% open, 43% short, 14% value drift.
Film-foil caps: 20% open, 70% short, 10% value drift.
Tantalum caps: 10% open, 85% short, 5% value drift.
Aluminum caps: 60% open, 10% short, 30% value drift.
Signal diode: 40% open, 60% short.
Power diode: 30% open, 70% short.
Zener diode: 40% open, 10% short, 50% value drift.

Failure modes and their probabilities become more complicated with transistors, ICs, and larger scale devices. The concept of an individual transistor “failure” in a 100,000 transistor monolithic microprocessor is not easy to fathom in a practical sense.
2/2
related items