Episode 10: Incident Response

August 6, 2020 by Imran Kasam & Steve Ledwith

This episode was published on 6 August 2020 and is approximately 51 minutes long. This episode made possible by Glow Your Soul and Anchor.fm.

Overview

Are you looking to improve your incident response process for your applications? Have a recent outage that made for a bad night? This is the episode for you!

In this episode we discuss Incident Response at an Enterprise level! From monitoring, to response planning, troubleshooting, and how to handle an active event, we’ve got it covered. There’s even a starter template for your root cause analysis (RCA) available in the show notes.

This episode will help you be better prepared should your application fail.

Listen on Apple Podcasts

Apple Podcasts

Listen on Anchor.fm

Listen on Spotify

Show Notes & Selected Links

To be good at incident response, you have to spend time thinking about it! This episode will provide you with all the information you need to build your response plan.

Incident Prevention

Monitoring is key.

On Call Person

Who is responsible for watching the alerts from any systems?
Who is first up for Pager Duty alerts.
Think about how to help the on call resource when they have a long night

First Responder

Highly trained to do the job
The first person receiving an alert should be able to take action
Have an escalation process, if the first responder doesn’t get there in a timely manner

Duty Phone or Bat Phone

A dedicated phone for use by the On Call team
Have a process for ensuring the right person has the phone, if it’s a physical phone
If it’s a forwarding system, you’ll need to have a process for updating the destination phone number

8 Min: Application Stack Training

How to you help the First Responder take action?
Create a “Run Book”
As part of making an application ready for production, make sure you have a run book which explains some basic troubleshooting steps
What’s the first thing which should be done?
Who does the First Responder reach out to when the steps provided don’t fix the problem?

9:50 Min: Alerts which are White Noise

You can’t have meaningless alerts, or your On Call team will start to ignore the issue
Find a way to clean up spurrious alerts

Severity and Priority of the alert determines the medium of the notification.

Not every system is important at 2 AM. Some sites might not be on your critical list for immediate action. Make sure your monitoring, alerts, and response processes take this into account.

12:45 Min: Everything Always Works, Right?

No, unfortunately, shit happens.

“Don't wait until the third or fourth time you have an outage to put a plan in place. Do it after the first one! (Or before you have one!)”—Steve Ledwith

14:30 Running an Outage

When the First Responder can’t fix it, start getting others involved.
Maybe you need to alert a third party, or open a ticket with a vendor
Get the right subject matter experts involved
Start a bridge line / conference call / hangout

Pro Tip: Do not have multiple silos of communication which requires someone on the main call to go talk with this other set of resources. Support Telephone won’t make things better.

You never know who will have the answer to the problem at hand. Having everyone in the same place, be it a conference room, phone call, or hangout, allows everyone to hear the important (and not as important) information.

Your collective energy working on the problem together magnifies the energy of the team. Don’t cheat yourself.

19 Min: Running an Incident

Define the key players in the Run Book

What is a Run Book? According to Wikipedia a run book is a compilation of routine procedures and operations that the system administrator or operator carries out. It usually contains procedures to begin, stop, supervise, and debug the system.

Communications

Both internal and external audiences need updates
Have a regular, pre-defined, cadence for updates
- Every 30 minutes post an update to your public facing site
Senior Leadership
- ignore this group of stake holders at your own peril! Let this group know what’s going on, the current status. Get in front of the problem.
- Keep the Senior Leadership in the know. Tell them when the next update will be too. This group wants to support you; help them do so.

“Communicate regularly when you are working through an issue. If you're not communicating, people assume there's no progress.”—Steve Ledwith

Troubleshooting

Have a person dedicated to reviewing the log files for the applications
This person has to know enough to filter out the noise
Make sure the people who need access have it; follow the proper channels to grant additional access, if necessary
Every application is different, and the skills necessary to solve it will vary every time
Learning how to troubleshoot is an art… It’s based on experience and insight, and knowing how things work together
Hard to learn how to figure out the right things to look at?
This is the hardest part of the process, and there’s no magic here

Prior Planning

Have a plan for how you’re going to do all of these things
Think about it like a fire drill saftey plan from your in-person office, or your school when you were a kid
Practice it. Review it. Make sure those involved know the plan.
What’s your process for granting access to production?
Have a script for what you want to accomplish when an outage starts
Initial communication
Grant access
Download logs

You train the way you fight! – Imran Kasam

Shoot. Move. Communicate.

Have a mantra for your team.
Focus on the problem at hand, not all the rest of the things going on.
Take a shot at fixing it.
Move on to the next step.
Communicate to your users and stakeholders

Pro Tip: Establish a Note Taker / Outage Tracker

You want to have someone taking notes and keep track of time.
You need someone to call out when we’re repeating tasks, make sure you’re duplicating efforts.
Huge help for doing the root cause analysis and working through your timeline.

34 Min: Root Cause Analysis (RCA)

After an off hours outage, get some rest.
When you’re back at it, figure out what happened.
You probably have an Service Level Agreement (SLA) with your users, or customers.
You will have executives who want to know the complete down time and impact to the business.
Be prepared to investigate and learn from what happened.
You could have contracts with penalties related to downtime which will require specifics.

What does “5 Nines” Mean

99.999% (“five nines”) means 5.26 minutes of downtime in a year
See the full chart as part of High Availability on Wikipedia

Continuous Learning with the Root Cause Analysis

As you’re working through the RCA look for things you can improve in your process.
How can you update your run book?
Look for new things to monitor?
Document what you did, what you looked into, and what worked.
Write down things which didn’t work.

Key Components of the RCA

Brief description of the event
Accurate timeline
People who were involved
Actions which were taken (things that worked, and those which didn’t)
Root Cause of the Outage
Remedy

Hold an Agile Style Retrospective about your Response

Have a retrospective around the response to the outage.
What worked?
What needs to be improved?
Did you have the right people involved?
How did your partners respond?
Did you include your partners?
Do you have the right level of service from your providers?
Do you have the right partners? Maybe they can’t meet your expectations!
Do you need to update any processes or procedures?
How can you streamline your response?
Was the response appropriate for the system?
Is there a place where you can spend a little money to make this process better?

Episode Wrap Up

Successfully managing an application outage requires prior planning, a documented process, and a dedication to follow-up and learning from past events.

Prevention
- Monitoring
- On call
- Escalating Alerts
- Runbook
Response process
- Bridge line
- Communication with multiple audiences
- Troubleshooting
- Tools
- Access
- People
Post Response Analysis
- Root Cause Analysis (RCA)
- Retrospective

As mentioned in the episode, we have a template for you to start with. You’ll want to dress it up to fit your business needs, but these are the important sections.

The template is available in ODT | PDF | TXT and is free to use as you see fit.

Don’t be the person in this photo by Andrea Piacquadio from Pexels, have a plan!

Final thought: Don’t go dark. Respond. Communicate. Communicate. Make sure everyone knows you’re working on the problem. Communicate.

Disclaimer: Some of the links provided are affiliate links meaning, at no additional charge to you, The Architect and the Executive may earn a commission if you make a purchase.