Learn from AWS outages with After-Action Reviews (AAR)
This morning at Adzerk we experienced a host of operations problems because of AWS outages. We became aware of the problem early and pulled together to restore all Adzerk systems. After we recovered, I began to think about how we'll improve our response to outages like this. One way we'll improve is by understanding what happened and thinking about how we should respond in the future. We'll do this during a kind of meeting known as an an After-Action Review (AAR).
I learned about AARs in the Army. The AAR is a short, memorable agenda for the kind of meeting that is best conducted right after any significant "event". In the Army, "events" are things like training exercises, accidents, or enemy contact.
AARs are time-sensitive so that whatever is learned can be applied to the next event, which could come at any time. The time-sensitivity of AARs is primarily what distinguishes them from similar practices in other professions, like "post-mortem" write-ups in software engineering and morbidity and mortality meetings in medicine. AARs are also notable in that they are team events and to be effective require broad participation.
I find the AAR meeting a useful way for software teams to re-group and quickly learn together after events such as faults or service outages.
Tomorrow, we will conduct an AAR in order to collectively understand what happened and how we might want to respond differently when something like this happens again.
Using the AAR
The AAR is a leadership tool and calling one is a leader or manager responsibility, but I think it's a good idea in software teams if anyone can. It's easy to forget to do.
Once scheduled, the meeting should involve as many of the people who participated in the event as possible, from the first person to notice it was happening to the person who called the meeting.
In addition to a facilitator, it's good to have an assigned note-taker who will share notes with the team. Leaders should also take their own notes.
The facilitator motivates discussion by asking the following four questions of participants, which anyone can answer:
- What was supposed to happen?
- What did happen?
- How can we improve next time? ("What didn't work?" or "improvements")
- What should we do the same way next time? ("What did work?" or "sustains")
The AAR is not about blame. Events can and will happen for any reason, including human error, and it's no less the team's responsiblity to respond to and learn from events. Furthermore, human error is often a consequence of inadequate training or communication, which are failures of leadership or others on the team - not of the individuals who errored.
Good AARs end with some commentary from leadership. Specific praise and a short list of changes to be made immediately in case the event happens again soon are good ways to conclude.
An outcome of the AAR will be a list of TODOs and action-items. Leaders should prioritize and specify these and then turn them into tasks.
Further AAR References
- This YouTube video gives a nice overview: AAR (After Action Review) Definition & Explanation