Mister IT incident management process
PURPOSE
The primary purpose is to restore normal service operation as quickly as possible and minimize the adverse impact to customers and business operations, thus ensuring that the best possible levels of service quality and availability are maintained.
SCOPE
Applies to all software and information technology services in use at Mister Car Wash.
PROCESS
The steps below run through the process from reporting the incident through closing the incident
- Incident is reported or detected – anyone noticing or hearing of an incident has the responsibility of working through this process. Team members must monitor and triage tickets throughout the day.
- Incident identification - Failures or potential failures need to be detected early so that the incident management process can be started quickly.
- Incident logging - All incidents must be fully logged in the Service Desk (even those that seem like duplicates) so that anyone assisting in the resolution has immediate access to all information and to maintain a full historical record. Recording each reported issue helps to understand how widespread the issue is.
- Incident categorization - Both Urgency and Impact need to be assessed. If the issue has already been reported, relate the issues together.
- Incident prioritization - Once the impact and urgency are assessed, use “Incident Priority” matrix shown below to identify the priority.
- Incident management – If this a priority 0-critical or 1-high incident, declare the Incident Owner and the Communication Owner. If possible, two different people should perform these roles so that the time communicating with users does not interfere with resolving the issue itself.
- Priorities 0 or 1: Contact the ticket owner and confirm they’re aware of the assigned critical or high priority ticket.
- Incident notification – Communication Owner uses the Communication and Notification table to notify appropriate people based on priority.
- Initial diagnosis – IT will utilize the collected information on the symptoms to initiate a search of the Knowledge Base to find an appropriate solution. IT will resolve the incident and close the incident if the resolution is successful.
- Incident escalation - If the necessary information to resolve the incident is not in the Knowledge Base, the appropriate support group (including vendor support) must be consulted for further diagnostics and attempted resolution. Using timelines in the Resolution and Notification table, escalate and provide status updates.
- Incident resolution – Work on diagnosis and solution until requester verifies that the resolution was satisfactory. An incident resolution does not require that the underlying cause of the incident has been corrected. The resolution only needs to make it possible for the requester to be able to continue their work.
- Incident closure - Once the requester verifies the issue is resolved, close the ticket.
- Confirm and/or update incident categorization so it is correct.
- Update ticket with resolution details
- Update Knowledge Article with all troubleshooting and remediation steps
- Determine if this incident could recur and decide preventative action.
Use the Incident Priority Matrix below, when an incident is reported or detected.
Incident Priority
IMPACT
High
Medium
Low
Service or major portion of a service is unavailable
Issue prevents personnel from performing business critical, time sensitive functions.
Issue prevents personnel from performing a portion of their duties.
URGENCY
HIGH
Significant Damage is occurring or will occur rapidly (One or more stores, or all HQ affected)
Urgent
High
Medium
Medium
Damage increases considerably over time (Part of a store, or HQ departments affected)
High
Medium
Low
Low
Damage marginally increases over time. (One or two personnel affected)
Medium
Low
Low
Examples:
Low
Issues that are not significantly affecting the site's ability to process cars.
- Can't print from one machines
- Single peripheral down
- Report is not correct
Medium
Issues that limit's the site's ability to process cars
- Single tablet down
- Send Station down but has a work around
- Primary internet down but on backup (if processing cars normally)
High
- Point of sale extremely slow
- kiosk down
- multiple tablets not working
- Primary internet down, backup is up (not process cars normally)
- Manager workstation not working
Urgent
- Entire store down
- Server down
- Send Station is non-functional
- Primary and backup internet down
- Unable to process credit cards
Categorization & Prioritization
The goals of proper categorization are to:
- Identify the Impact and Urgency to determine the Priority
- Indicate what support groups need to be involved
- Capture meaningful metrics on system reliability
All incidents are important to the user, but incidents that affect large groups or mission critical functions need to be addressed before those affecting 1 or 2 people.
Resolution and Notification
Every issue will be addressed as soon as possible, however, most will need to be triaged and worked in priority order. The Resolution and Notification Table below defines how to report an issue, the expected response time, resolution time, communication method and communication frequency.
For example, requesters reporting Critical Incidents, should expect a response within 15 minutes. Resources will be pulled from lower priority issues to try to resolve the issue within one hour. The requester should call (versus email or portal) to report this priority of issue. Both the requester and IT must notify their manager immediately and provide hourly updates.
Priority
How to Report
Expected Response Within
Target Resolution Time
When to Notify Manager
How to Respond
Response Update Frequency
Urgent
Phone
15 Minutes
1 Hour
Immediately
Phone, Email Follow-Up
Every hour until resolved
High
Phone
15 Minutes
1 Day
Immediately
Phone, Email Follow-Up
Every day until resolved
Medium
Portal, Email, Phone
4 Hours
2 Days
1 day, if not resolved
Email
Every two days until resolved
Low
Portal, Email
24 Hours
5 Days
2 days, if not resolved
Email
Every 5 days until resolved
Issue Escalation
Requesters, IT and management play important roles in incident management. If the requester has not been contacted within the “Expected Response Time” or if the issue is not resolved by the “Target Resolution” time, he/she must escalate the issue to his/her manager. If IT has not been able to resolve the issue by the targeted resolution time, he/she must escalate the issue to his/her manager. Escalate to Mister IT using the tiers defined below.
Escalation
Contact
Tier 1
Support Center Supervisors
Tier 2
Director of IT Support
Tier 3
Director of IT
Tier 4
Director of IT Operations
Tier 5
Chief Technology Officer
Responsibility Matrix (RACI)
Obligation
Role Description
Responsible
Responsible to perform the assigned task
Accountable (only 1 person)
Accountable to make certain work is assigned and performed
Consulted
Consulted about how to perform the task appropriately
Informed
Informed about key events regarding the task
This RACI pertains to Priority Issues identified as CRITICAL or HIGH.
Activity
Requestor
IT Staff
Service Desk Manager
Communication Owner
Incident Owner
Director/ IT Leadership
Regional Manager
Technical Expert
Record Incident in Service Desk
A, R
Incident Assignment
R
A, R
Incident Categorization
A, R
Incident Prioritization
A, R
Declare incident Owner & Communication Owner
I
R
A
I
I
Incident Notification
I
A
R
C
I
I
Incident Diagnosis
C
I
C
I
A, R
C
Incident Resolution
C
I
I
A, R
I
I
I
Incident Escalation
R
I
C
I
A, R
I
R
C
Incident Closure
C
I
I
I
A, R
I
I
Key Performance Indicators (KPIs)
-
Average Cost per Incident: Fixed and variable costs divided by the total number of incidents
-
Average Initial Response Time: Total time between when an incident is reported to when IT responds divided by total number of incidents
-
Average Resolution Time: Average time taken to resolve an incident
-
Percentage of Incidents by Priority: Proportion of total incidents broken down by priority
-
Total Incidents by priority: Quantity of incidents within a defined timeframe
Definition and Terms
-
Communication Owner: Person responsible for communicating status of the incident and gathering any additional information about the incident for the incident owner.
-
Impact: Is a measure of the effect of an incident, problem or change, determined by how many personnel or functions that are affected.
-
Incident: An unplanned interruption or reduction in quality of an IT Service.
-
Incident Manager: Person responsible for driving and continually improving the incident management process.
-
Incident Owner: Person responsible for ensuring the incident is resolved.
-
Priority: A category used to identify the relative importance of an incident.
-
Urgency: Is a measure of how long it will be until the issue has a significant impact on the business.
Related Procedures
Revision History
Revised Date
Revised By
Revisions
08/15/2019
Lauren Babson
Document created
08/03/2021
Tam Rininger
Added Supervisors to the escalation plan.
08/19/2021
Tam Rininger
Removed Dir of IT Ops to the escalation Plan.
10/14/2024
Andrew Poskey
Moved to Fresh
01/20/2025
Yolanda Terrazas-Franco
Updated "VP of IT" to Chief Technology Officer in the Escalation Plan.
01/27/2025
Yolanda Terrazas-Franco
Added Director of IT and Director of IT Operations to the Issue Escalation Contacts.
3/13/2025
Andrew Poskey
Updated title and content formatting
4/2/2025
Andrew Poskey
Updated changes from Word Doc, Added related article section.