“A thought which does not result in an action is nothing much, and an action which does not proceed from a thought is nothing at all” (Georges Bernanos)
This post is the third, covering Incident Management in a SaaS Operational Environment.
The previous post covering the initial activities of the incident, discusses the more reactive tasks, namely Detection, Recording and Classification. This post will discuss the proactive stages leading to resolution.
Notification – Inform everybody of the incident.
There are three groups that must be made aware of the incident as soon as it is classified:
- Internal staff. A predefined list of who gets notified within the company must exist. Whether it is done via email, chat, whatsapp, phone call or carrier pigeon should be determined (ahead of time) according to the classification (urgency and impact). You do NOT want a situation where a major customer informs the sales rep of a problem.
- Customers. Sometimes the classification of the problem would determine that there are no impacted customers right now and that service could be restored shortly. In this case there is no advantage of creating mass hysteria. The Status-Page (as described in the first post) should be updated first. Now, depending on a many circumstances there are options of sending out an email to all customers, affected customers, highly valued customers, etc. Under a certain set of rules, account managers may call their customers to inform them personally. If the application (is not down and) has a notification box, this a good opportunity to inform actual users of problems.
- Partners / Channels. Don’t forget your partners. Sometimes in the heat of an incident they are not notified. It may affect them and their customers.
The points I am trying to nail are:
- Do not risk having customers discover on their own that there are problems – if they are likely to find out, make sure you are the one informing them.
- Try to determine all this activity prior to the incident, not while you’re in the middle of it.
Note: Status Page
This is part of the Notification process, but it merits its own section.
The first Status Page I implemented was at a SaaS provider whose service was business critical. Before we implemented it, each event, real or imaginary, would generate hundreds of calls to the support center. The lines would clog up and the customers would leave angry or frustrated messages. They would try again later and still get the ‘please leave your message’. After the event was over the exhausted CSRs would have to open a helpdesk ticket for every recorded message, and call back the users. This was not only wasted effort and time consuming, but we ended up with many frustrated customers.
Once the Status Page was implemented, it took a few weeks to get the customers used to checking it out and the amounts of calls we got during an incident was reduced by two orders of magnitude!
Keep in mind that the Status page should be updated regularly, with a timestamp attached. Any information that can be provided to the customers will boost their confidence and give them a sense of how soon the problem would get resolved.
Escalate – Get the relevant people working on the problem ASAP
Having planned the Escalation Path in advance, as recommended in the previous post, this should be a straightforward process. Some issues may be resolved by a level-1 operator, but assume that in major incidents everybody will be involved. It is important to stick with the escalation path not to hinder the Investigation process.
It is imperative that an Incident Manager be assigned to the particular event. It may be decided in advance or ad-hoc. The IM gathers the relevant staff in the War Room (below) and manages the whole process, assigns tasks, collects information and ensures that the whole process is recorded.
Investigate – Determine the root cause
As this point we should have the following:
- An assigned Incident Manager
- People of relevance gathered together in the ’War Room’, whether physical or virtual
- Understanding of the problem – what is not functioning
- Understanding of the impact – who is suffering from it and how urgent is it
- Understanding of the affected component – sometimes it is obvious from the onset that a major component is down, via monitoring or a report from a service provider, but in complex systems this is not always possible. Sometimes a problem in one sub-system will manifest itself as a problem in another dependent sub-system. The Known Problems in the knowledgebase should be very helpful.
- Using the Component-Customer mapping as described in the previous post, could be helpful to determine to culprit.
- Assuming you have been following the practices of the STORM™ Change Management, you would have at your fingertips a query of all changes to the system that were done in the last X hours. There is a very high correlation between changes and failure, so that it safe to assume that the problem will become obvious. Keep in mind that changes should include everything in your production domain including your service providers and your customers.
- Usage of the Knowledgebase, as described in the previous post might point out to similar cases that were encountered in the past.
Note: War Room
As described in the Prologue, a quiet environment, where only people who might contribute to the process, is vital. The war room, should include up-to-date information on all aspects, and allow open communication between all parties. There should be a single entity, the Incident Manager, running the show, gathering information and assigning tasks to the various participants.
It is important to keep out of the room any person who might add unnecessary pressure and the IM should feel confident enough to kick the CEO out of the room if it is deemed necessary.
Remember that a customer support representative is present as well. The CSRs’ job is to report on any new developments from the customers’ point of view and to communicate to the customer base any progress, preferably through the Status Page.
Restore Service – Allow your customers to continue working
While still in the War Room, the process of restoring the service is done. There are usually three options:
- Resolving the problem. Sometimes the issue is straight forward and can be resolved with firing up a backup server, restarting an Windows’ service, switching to the last reliable version, or even re-launching the application. If there is a high probability that taking such action could bring the service back within minutes (this is open to interpretation), that is obviously the preferable route. A knowledgebase of Known Solutions would be a great asset at this point. Predefined scripts, as part of the KB, would be even better.
- Workaround. When the problem is not well understood and there is no guarantee that any remedial action will bring the service up, or even if it does, there is no guarantee that the problem will not reoccur within a short time, there should be a workaround solution. Such a solution might be a temporary one (such as reverting to the last working version, or database) and may include reduced functionality, but it will at least allow the customers to get back to work, until resolving the problem.
- Failover. Assuming redundancy across production systems (locations?) or a DR site is available, there is always the option of failing over to the backup service. This is not an easy decision and not without its costs, but if a workaround is not available and resolving the problem at the production site is going to take long, restoring service to your customers is paramount.
Throughout the whole process the Status Page should be updated, and obviously, when service has been restored to a satisfactory level, that should be communicated. It is up to the Indecent Manager to verify that this is being dome and up to the CSR to perform that. The Incident Manager should not be assigned with tasks herself, and her only responsibility is to make sure that everything is being documented and calmly coordinate the activity in the War Room.
In the next post – the Epilogue – we will look at the events and activities that take place after the service was restored.