What is Problem Management?
According to the IT Infrastructure Library (ITIL), the goal of Problem Management is to minimize the adverse impact of Incidents or prevent these Incidents from occurring at all. Problem Management seeks to get to the root cause of Incidents and take actions to improve or correct the situation. Since problems are typically the cause of one or more incidents, it’s only logical that the best way organizations can reduce incidents is by having a solid Problem Management Process.
From a reactive as well as proactive standpoint, problem management strives to identify known errors, and in many cases the solutions to these errors are either too costly to deploy, or may not be achievable for a period of time. This is where workarounds are identified and maintained in a Known Error Database (KEDB). With this KEDB, documented known errors and workarounds can be indexed and used in the future when an incident recurs – this should replace the archaic method of keeping ‘sticky notes’ near the phones so you can recall what procedures you’re supposed to take to restore service. Sounds easy, right?
Problem management the process
There’s no shortage of frameworks, standards and bodies of knowledge that offer great information on not only the problem management process, but root cause analysis as well. The short list of these frameworks includes the IT Infrastructure Library (ITIL), Control Objectives for Information and Related Technology (COBIT), and ISO20000 to name a few. Although each of these approaches IT from different perspectives, they all agree on one thing: Problem Management is a discreet process and should be managed as such.
Below is a high level model of a typical problem management process. It’s important to note that although some might call this Root Cause Analysis, Root Cause Analysis is actually an activity within the process. An activity that strives to determine the actual fault that is causing the incidents.
How can Problem Management be the source of problems?
I will submit that Problem Management alone does not accomplish these goals mentioned above, as there are many other processes to contemplate. Consider this real scenario which I was personally involved in:
About three years ago, I was a senior level IT leader in a managed service provider who essentially ‘outsourced’ IT departments for the small to mid-sized market. Unfortunately, our first (and largest) customer contacted us to inform us that they would be leaving our datacenter and moving to a competitor. Ouch. We knew what the issue was. Our datacenter had recently experienced multiple incidents which caused a lot of dissatisfaction with our clients. The only thing we really had to do was fix incident management, right? Wrong. After several conversations the real truth came out regarding their displeasure. It seems that the incident management actually worked quite well. The core issue was that we were attacking the same incidents over and over…
Aha, the real issue was problem management, right? Well, yes and no.
So, off we went to fix the problem management process. Not surprisingly, once our process became mature, the recurring incident count actually declined, only to start rising again within a few months.
What do we do now? Go back through problem management to see where the issue is? No.
We were actually deploying changes into the live environment that were creating new incidents! On to change management.
You see, it is impossible to compartmentalize your processes and simply ignore all of its inputs and outputs, but at least you can get a handle on some core process discipline. The model below illustrates some other processes that have an effect on the quality of problem management.
Here are a few tips to get you on your way
I have assembled a few tips below that might be helpful in your problem management quest. These are the top ten based on my experiences, so you may have a few of your own to add to the list:
- It’s only as good as the resources you put on it. I’ve seen many organizations staff and manage this process from the Service Desk, let’s call this Level 1. Remember, Level 2/3 resources are your most appropriate resources to tackle problems. It doesn’t mean you can’t apply Level 1 folks to helping with particular problems; however they have a different primary role and that’s called Incident Management.
- Be aware of other processes. So you identified the problem and have identified a solution. Now what? Don’t make the mistake of simply deploying the solution, as many other processes must be involved: Don’t forget about Financial Management, Change Management, Release Management, etc.
- Be cautious of unrealistic SLAs. It is common to apply resolution timeframes to various levels of priorities when you’re dealing with incidents, but problems are a different story altogether. Don’t forget that a problem is essentially an unknown error until you determine what it is (known error) and can either identify a workaround or a final solution. Therefore, applying specific resolution timeframes to problems may create hasty decisions. I’ve found it more appropriate to use levels of effort towards higher priority problems as opposed to having specific timeframes.
- Synchronize Categorization Schemas. I once ran into a situation where we could not effectively compare problem ticket closure to incident reduction. Why? We used different categorization schemas between problems and incidents. If possible, try to have some congruence between how you categorize incidents and problems.
- Don’t implement – adopt. Deploying a meaningful Problem Management process is never actually complete. Know that you will always be improving and updating the process. Document the process in some type of run book, job book, or Standard Operating Procedure (SOP) so that you have a single version of how the process operates.
- Use models, frameworks, ideas, and methodologies that work. Good practices in problem management can be found in several industry frameworks such as ITIL and COBIT, so do yourself a favor and understand what they say about the process. Also, there are a few good tools to help you out as well specifically regarding root cause analysis: Kepner-Tregoe, Ishikawa, Pareto, and Cause/Effect are a few good tools to consider.
- Treat problems like projects. If you break down some core attributes, you might see, as I have, that problems have a lifecycle much like projects…so why not treat them in a similar manner? The similarities are pretty simple. Each includes a definitive start date, defined scope, some type of deliverable(s), and requires resources (people, time, money). Therefore, problem tickets are competing for the same resources and attention that projects are. I’m not saying you should track problem tickets through your project management function, but believe me, it sure helps to have them synchronized.
- You have to measure the process. First and foremost, our job is to deliver quality services to our customers and ensure that they are aligned with their needs. Create your metrics before you deploy the process and have ruthless adherence to collection, analysis, reporting and action based on those metrics.
- Build some type of reference. Most of us understand this as the Known Error Database (KEDB). If you don’t have a tool with this capability, create a document or database that documents all of the known errors and approved workarounds. Typically the problem management process owns this. Believe me, your Service Desk will thank you for this.
- Last but not least. Ensure you are considering organizational change enablement when formally adopting this process.
This is not a complete set of good practices. It should be enough to get you on your way. There are many other great ideas out there and I don’t mean to ignore them, so if you have any thoughts or ideas please share!
Good luck in your Problem Management adoption!