Mister IT incident management process

PURPOSE

The primary purpose is to restore normal service operation as quickly as possible and minimize the adverse impact to customers and business operations, thus ensuring that the best possible levels of service quality and availability are maintained.

SCOPE

Applies to all software and information technology services in use at Mister Car Wash.

PROCESS

The steps below run through the process from reporting the incident through closing the incident

Incident is reported or detected – anyone noticing or hearing of an incident has the responsibility of working through this process. Team members must monitor and triage tickets throughout the day.
Incident identification - Failures or potential failures need to be detected early so that the incident management process can be started quickly.
Incident logging - All incidents must be fully logged in the Service Desk (even those that seem like duplicates) so that anyone assisting in the resolution has immediate access to all information and to maintain a full historical record. Recording each reported issue helps to understand how widespread the issue is.
Incident categorization - Both Urgency and Impact need to be assessed. If the issue has already been reported, relate the issues together.
Incident prioritization - Once the impact and urgency are assessed, use “Incident Priority” matrix shown below to identify the priority.
Incident management – If this a priority 0-critical or 1-high incident, declare the Incident Owner and the Communication Owner. If possible, two different people should perform these roles so that the time communicating with users does not interfere with resolving the issue itself.
Priorities 0 or 1: Contact the ticket owner and confirm they’re aware of the assigned critical or high priority ticket.
Incident notification – Communication Owner uses the Communication and Notification table to notify appropriate people based on priority.
Initial diagnosis – IT will utilize the collected information on the symptoms to initiate a search of the Knowledge Base to find an appropriate solution. IT will resolve the incident and close the incident if the resolution is successful.
Incident escalation - If the necessary information to resolve the incident is not in the Knowledge Base, the appropriate support group (including vendor support) must be consulted for further diagnostics and attempted resolution. Using timelines in the Resolution and Notification table, escalate and provide status updates.
Incident resolution – Work on diagnosis and solution until requester verifies that the resolution was satisfactory. An incident resolution does not require that the underlying cause of the incident has been corrected. The resolution only needs to make it possible for the requester to be able to continue their work.
Incident closure - Once the requester verifies the issue is resolved, close the ticket.
1. Confirm and/or update incident categorization so it is correct.
2. Update ticket with resolution details
3. Update Knowledge Article with all troubleshooting and remediation steps
4. Determine if this incident could recur and decide preventative action.

Use the Incident Priority Matrix below, when an incident is reported or detected.

Incident Priority

IMPACT High Medium Low Service or major portion of a service is unavailable Issue prevents personnel from performing business critical, time sensitive functions. Issue prevents personnel from performing a portion of their duties. URGENCY HIGH Significant Damage is occurring or will occur rapidly (One or more stores, or all HQ affected) Urgent High Medium Medium Damage increases considerably over time (Part of a store, or HQ departments affected) High Medium Low Low Damage marginally increases over time. (One or two personnel affected) Medium Low Low

Examples:

Low Issues that are not significantly affecting the site's ability to process cars.

Can't print from one machines
Single peripheral down
Report is not correct

Medium Issues that limit's the site's ability to process cars

Single tablet down
Send Station down but has a work around
Primary internet down but on backup (if processing cars normally)

High

Point of sale extremely slow
kiosk down
multiple tablets not working
Primary internet down, backup is up (not process cars normally)
Manager workstation not working

Urgent

Entire store down
Server down
Send Station is non-functional
Primary and backup internet down
Unable to process credit cards

Categorization & Prioritization

The goals of proper categorization are to:

Identify the Impact and Urgency to determine the Priority
Indicate what support groups need to be involved
Capture meaningful metrics on system reliability

All incidents are important to the user, but incidents that affect large groups or mission critical functions need to be addressed before those affecting 1 or 2 people.

Resolution and Notification

Every issue will be addressed as soon as possible, however, most will need to be triaged and worked in priority order. The Resolution and Notification Table below defines how to report an issue, the expected response time, resolution time, communication method and communication frequency.

For example, requesters reporting Critical Incidents, should expect a response within 15 minutes. Resources will be pulled from lower priority issues to try to resolve the issue within one hour. The requester should call (versus email or portal) to report this priority of issue. Both the requester and IT must notify their manager immediately and provide hourly updates.

Priority How to Report Expected Response Within Target Resolution Time When to Notify Manager How to Respond Response Update Frequency Urgent Phone 15 Minutes 1 Hour Immediately Phone, Email Follow-Up Every hour until resolved High Phone 15 Minutes 1 Day Immediately Phone, Email Follow-Up Every day until resolved Medium Portal, Email, Phone 4 Hours 2 Days 1 day, if not resolved Email Every two days until resolved Low Portal, Email 24 Hours 5 Days 2 days, if not resolved Email Every 5 days until resolved

Issue Escalation

Requesters, IT and management play important roles in incident management. If the requester has not been contacted within the “Expected Response Time” or if the issue is not resolved by the “Target Resolution” time, he/she must escalate the issue to his/her manager. If IT has not been able to resolve the issue by the targeted resolution time, he/she must escalate the issue to his/her manager. Escalate to Mister IT using the tiers defined below.

Escalation Contact Tier 1 Support Center Supervisors Tier 2 Director of IT Support Tier 3 Director of IT Tier 4 Director of IT Operations Tier 5 Chief Technology Officer

Responsibility Matrix (RACI)

Obligation Role Description Responsible Responsible to perform the assigned task Accountable (only 1 person) Accountable to make certain work is assigned and performed Consulted Consulted about how to perform the task appropriately Informed Informed about key events regarding the task

This RACI pertains to Priority Issues identified as CRITICAL or HIGH.

Activity Requestor IT Staff Service Desk Manager Communication Owner Incident Owner Director/ IT Leadership Regional Manager Technical Expert Record Incident in Service Desk
A, R

Incident Assignment
R A, R

Incident Categorization
A, R

Incident Prioritization
A, R

Declare incident Owner & Communication Owner I R A I I

Incident Notification
I A R C I I
Incident Diagnosis C I C I A, R

C Incident Resolution C I
I A, R I I I Incident Escalation R I C I A, R I R C Incident Closure C I I I A, R I I

Key Performance Indicators (KPIs)

Average Cost per Incident: Fixed and variable costs divided by the total number of incidents
Average Initial Response Time: Total time between when an incident is reported to when IT responds divided by total number of incidents
Average Resolution Time: Average time taken to resolve an incident
Percentage of Incidents by Priority: Proportion of total incidents broken down by priority
Total Incidents by priority: Quantity of incidents within a defined timeframe

Definition and Terms

Communication Owner: Person responsible for communicating status of the incident and gathering any additional information about the incident for the incident owner.
Impact: Is a measure of the effect of an incident, problem or change, determined by how many personnel or functions that are affected.
Incident: An unplanned interruption or reduction in quality of an IT Service.
Incident Manager: Person responsible for driving and continually improving the incident management process.
Incident Owner: Person responsible for ensuring the incident is resolved.
Priority: A category used to identify the relative importance of an incident.
Urgency: Is a measure of how long it will be until the issue has a significant impact on the business.

Related Procedures

Revision History

Revised Date Revised By Revisions 08/15/2019 Lauren Babson Document created 08/03/2021 Tam Rininger Added Supervisors to the escalation plan. 08/19/2021 Tam Rininger Removed Dir of IT Ops to the escalation Plan. 10/14/2024 Andrew Poskey Moved to Fresh 01/20/2025 Yolanda Terrazas-Franco Updated "VP of IT" to Chief Technology Officer in the Escalation Plan. 01/27/2025 Yolanda Terrazas-Franco Added Director of IT and Director of IT Operations to the Issue Escalation Contacts. 3/13/2025 Andrew Poskey Updated title and content formatting 4/2/2025 Andrew Poskey Updated changes from Word Doc, Added related article section.