When bad things happen to complicated systems

: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /home/rcousine/wiredcola.com/includes/unicode.inc on line 311.

This is a story about complex systems failing obscurely.

"The first thing, of course, was the noise that alerted us to something had gone wrong."

On November 4, 2010, an Airbus A380 (that's their new double-decker superjumbo) blew an engine while in mid-air. That is something that happens, but this one was an impressive and quite rare uncontained turbine failure, which is a euphemism for "blew up bigtime." Bits of the turbine disc cut into the wing, damaging several systems and making a huge headache for the aircrew, who performed admirably and landed the plane without serious incident or any injuries.

The bare details are available elsewhere, and this is largely old news, but I just stumbled on an amazing interview (that's a transcript, plus a link to audio) with Capt. David Evans, a check pilot who happened to be aboard during the flight. Oddly enough, there were four pilots aboard a plane that normally needs a crew of two, because (as explained in the transcript) he was supervising the training of another pilot as a "check captain," whose job is to do performance evaluations on aircrews by observing them during a flight.

That's all boring background, and I was reading this because aviation incidents are a morbid fascination of mine. But this one was interesting, and not just because everyone lived to tell the tale*: the hidden message is that the airplane was not well-designed to cope with a complex failure. It's overstating the case to say that the crew narrowly averted disaster, but it's a tale of confusing information, poor delivery, badly handled corner cases, and hard decisions.

First, the engine went boom, a noise the aircrew heard. The "Electronic Centralized Aircraft Monitor" (aka "ECAM": it's Twitter for glass-cockpit commercial jets) threw the message ‘Engine 2 turbine overheat’. Then it momentarily said 'engine fire' before returning to the first message. There are procedures for dealing with an overheat (basically, throttle back and check the temperature), but it's not an overheat when half of the engine is missing. Not that a catastrophic engine failure necessarily needs its own warning message, but surely there was enough information, if it had been sensed or presented properly, for some combo of the system and the pilots to deduce a complete engine failure (and a big one) sooner. It was clearly wrong to describe this failure primarily as "too warm." (I'm also confused because, presumably, the engine would rapidly have shown far greater signs of malfunction: incorrect or zero rpm, or some such.

Maybe it's not that bad: the crew followed the overheat procedure, and after the mandated 30-second test at reduced throttle, shut off the engine.

"It was getting very confusing with the avalanche of messages we were getting. So the only course of action we have is the discipline of following the ECAM and dealing with each one as we came through with them."

Capt. Evans goes on to mention they faced 43 ECAM messages within 60 seconds of the explosion, each with its own procedure, sometimes including procedures that were clearly contraindicated (maybe don't try to equalize fuel weight into the suddenly empty wing tank with a visible fluid stream coming out of it...). Even with 2 extra aircrew, it took them 2 hours of flying time to deal with all the messages properly.

Next, they had to figure out what their landing profile was (given their weight, the nature of their failed systems, what is the correct landing speed and minimum length of runway). Unfortunately..."In the Airbus and the A380 we don’t carry performance and landing charts, we have a performance application. Putting in the ten items affecting landing performance on the initial pass, the computation failed. It gave a message saying it was unable to calculate that many failures."

They solved the problem with common sense and creativity, taking out variables that wouldn't really be relevant to their situation, until they finally got a sensible answer. But it's a bit scary that the computerized landing calculation system gave up on them!

After two hours of flying, they finally had coped with enough of their issues to be sure they could safely land back at Singapore, and so they did so, making what was an uneventful landing, considering the circumstances.

"I think most probably the most serious part of the whole exercise, when you think back at it, was the time on the runway after we’d stopped."

And then they were stuck. First, critical communications systems were out. Second, the fire crew on the ground was reluctant to approach the plane because an engine was still running, they had an active fuel leak, and it was spraying on the extremely hot landing-gear brakes. The fire crew did act to suppress the fire risk, and they chose to keep the passengers inside the plane until they could get stairs, rather than use the slides to evacuate people into a pool of kerosene. A judgment call, maybe even a risk, but a decision that worked out.

"Questions were asked ‘why did we spend so long in the air’? But we had to spend that time in the air to determine the state of the aircraft and it took that long to do that. "

What I read into this story was a tale of complex systems in chaotic circumstances, and a crew that seemed, at times, to be fighting with their cockpit management system and the standard procedures and even the reluctant firefighting ground crew. Messages about what they were dealing with were obscure, leading them through confusing processes. They had to spend hours, both in the air and on the ground, inside a seriously crippled airplane, largely because they were working through processes that were not coping with a severe failure in an efficient manner.

It is true that part of the reason the procedures they faced were so complex is because the failure was so complex: when the engine exploded, its debris punched holes in the wing, and damaged fuel and hydraulic systems throughout the plane. But this appears to me to be a fortunate incident where dealing with the problem took far too much time, but fortunately they had enough time.

"We tried to recreate it in the sim and we can’t! I think it was just such an extraordinary day."

And yet. I worry about complex systems, I worry about rigid procedures (and indeed, the crew demonstrated creative and thoughtful responses to a problem that was so unexpected the standard training sim can't make it happen), and I worry about user interfaces. There are lessons here not only for Airbus, Qantas (and surely for Rolls-Royce, maker of the ill-starred turbine), but also for any students of coping with hard problems. I'm not clever enough to impose morals on this story, except maybe Not Everything on the Test Will Be Covered In Class and You May Be Required to Use Your Brain.

*aviation is a super-safe means of travel. Incidents are rare, and most of them (like this one) do not end with a plane crash. That said, the survival rate for passengers in aviation crashes is 53%. In airliner incidents which have at least one passenger fatality, the general survival rate of passengers is 25% in the last decade. That's a long-winded way of saying that in a fatal airplane accident, most of the time most of the passengers die.