How often do we take action on our PC which requires us to answer the “Are you sure” question? The simple reason for that verification: it’s a well-known fact that user errors can cause system, application or data problems. This message should serve as a constant reminder to think about resilience and continuity from the people perspective.
Staff Training and Effective Continuity
It cannot be stated often enough that we need to integrate staff training to achieve effective continuity programs. A recent issue of CSO magazine included an interesting recap of Malcom Harkins’ thoughts. Harkins is the CISO at Intel, and he was speaking recently at the Forrester Security Forum. Misperception about risk, according to Harkins, is driven by two key points – economics and psychology. Economic impacts guide each of us when we make business decisions. The desired outcome of the decision might be increased sales and revenue – a positive economic impact. What CSO didn’t mention is that it’s important to have the full information on the business impact when making these policy and program decisions.
Harkins’ psychology angle is interesting, too: he spoke of both exaggerating and underestimating risks, either of which can lead to flawed business decisions. I would suggest this scenario which I hear often in conversation: if you have fail-over architecture and high availability for an application, then it would seem to diminish the need for further continuity investments, right?
Actually, the fact that you have an application that warrants fail-over protection should raise your sensitivity to making sure that basic recovery is possible; the moment you have a single point of failure event, you are now running exposed for a second failure point. Additionally, testing fail-over capability can be very challenging. Wouldn’t it be comforting to know that should human error enter into the mix at just that critical – sometimes chaotic – moment, you have a recovery capability that has been validated, and even more importantly, that your staff knows what to do?
Crisis Decision Making
Decision making in a crisis is never easy; the potential impact of incorrect choices adds a heavy stress burden. However, that’s why recovery programs are developed; decision trees based on potential disruptive events are created during non-stress-filled times to guide decisions when chaos may reach a peak.
Some of this could be categorized as continuity basics, but real-world events often tell a different story. On September 26th, a computer failure for Virgin Blue airline in Australia caused havoc. Inability for Virgin Blue staff to re-ticket passengers, and for passengers, no ability for self-check-in was the direct impact. A simple disk failure was the root cause. The human response that ensued focused on the effort to repair the disk (which did not go as hoped). Decisions to transfer operations to a recovery configuration – a process that was in place with a 2-hour RTO – were delayed, assuming the repair would be quick. The result: hundreds of disrupted flights, thousands of angry customers, and you can count on many of those customers now having to be categorized as “former customers”. Virgin Blue has offered reimbursement and a free ticket, a costly impact when considering the number of flights disrupted over 24 hours.
When faced with any disruptive event, those making decisions to delay invoking a recovery plan should ask themselves, “Are you sure?”