This episode was published on 6 August 2020 and is approximately 51 minutes long. This episode made possible by Glow Your Soul and Anchor.fm.
Are you looking to improve your incident response process for your applications? Have a recent outage that made for a bad night? This is the episode for you!
In this episode we discuss Incident Response at an Enterprise level! From monitoring, to response planning, troubleshooting, and how to handle an active event, we’ve got it covered. There’s even a starter template for your root cause analysis (RCA) available in the show notes.
This episode will help you be better prepared should your application fail.
Who is responsible for watching the alerts from any systems?
Who is first up for Pager Duty alerts.
Think about how to help the on call resource when they have a long night
Highly trained to do the job
The first person receiving an alert should be able to take action
Have an escalation process, if the first responder doesn’t get there in a timely manner
Duty Phone or Bat Phone
A dedicated phone for use by the On Call team
Have a process for ensuring the right person has the phone, if it’s a physical phone
If it’s a forwarding system, you’ll need to have a process for updating the destination phone number
8 Min: Application Stack Training
How to you help the First Responder take action?
Create a “Run Book”
As part of making an application ready for production, make sure you have a run book which explains some basic troubleshooting steps
What’s the first thing which should be done?
Who does the First Responder reach out to when the steps provided don’t fix the problem?
9:50 Min: Alerts which are White Noise
You can’t have meaningless alerts, or your On Call team will start to ignore the issue
Find a way to clean up spurrious alerts
Severity and Priority of the alert determines the medium of the notification.
Not every system is important at 2 AM. Some sites might not be on your critical list for immediate action. Make sure your monitoring, alerts, and response processes take this into account.
12:45 Min: Everything Always Works, Right?
No, unfortunately, shit happens.
“Don't wait until the third or fourth time you have an outage to put a plan in place. Do it after the first one! (Or before you have one!)”—Steve Ledwith
14:30 Running an Outage
When the First Responder can’t fix it, start getting others involved.
Maybe you need to alert a third party, or open a ticket with a vendor
Get the right subject matter experts involved
Start a bridge line / conference call / hangout
Pro Tip: Do not have multiple silos of communication which requires someone on the main call to go talk with this other set of resources. Support Telephone won’t make things better.
You never know who will have the answer to the problem at hand. Having everyone in the same place, be it a conference room, phone call, or hangout, allows everyone to hear the important (and not as important) information.
Your collective energy working on the problem together magnifies the energy of the team. Don’t cheat yourself.
19 Min: Running an Incident
Define the key players in the Run Book
What is a Run Book?
According to Wikipedia a run book is a compilation of routine procedures and operations that the system administrator or operator carries out. It usually contains procedures to begin, stop, supervise, and debug the system.
Both internal and external audiences need updates
Have a regular, pre-defined, cadence for updates
Every 30 minutes post an update to your public facing site
ignore this group of stake holders at your own peril! Let this group know what’s going on, the current status. Get in front of the problem.
Keep the Senior Leadership in the know. Tell them when the next update will be too. This group wants to support you; help them do so.
“Communicate regularly when you are working through an issue. If you're not communicating, people assume there's no progress.”—Steve Ledwith
Have a person dedicated to reviewing the log files for the applications
This person has to know enough to filter out the noise
Make sure the people who need access have it; follow the proper channels to grant additional access, if necessary
Every application is different, and the skills necessary to solve it will vary every time
Learning how to troubleshoot is an art… It’s based on experience and insight, and knowing how things work together
Hard to learn how to figure out the right things to look at?
This is the hardest part of the process, and there’s no magic here
Have a plan for how you’re going to do all of these things
Think about it like a fire drill saftey plan from your in-person office, or your school when you were a kid
Practice it. Review it. Make sure those involved know the plan.
What’s your process for granting access to production?
Have a script for what you want to accomplish when an outage starts
You train the way you fight! – Imran Kasam
Shoot. Move. Communicate.
Have a mantra for your team.
Focus on the problem at hand, not all the rest of the things going on.
Take a shot at fixing it.
Move on to the next step.
Communicate to your users and stakeholders
Pro Tip: Establish a Note Taker / Outage Tracker
You want to have someone taking notes and keep track of time.
You need someone to call out when we’re repeating tasks, make sure you’re duplicating efforts.
Huge help for doing the root cause analysis and working through your timeline.
34 Min: Root Cause Analysis (RCA)
After an off hours outage, get some rest.
When you’re back at it, figure out what happened.
You probably have an Service Level Agreement (SLA) with your users, or customers.
You will have executives who want to know the complete down time and impact to the business.
Be prepared to investigate and learn from what happened.
You could have contracts with penalties related to downtime which will require specifics.
What does “5 Nines” Mean
99.999% (“five nines”) means 5.26 minutes of downtime in a year