Supporting Your SysAdmins with HugOps

By Ingo Averdunk

When the going gets tough, show empathy and appreciation for your SysAdmins

There certainly is a proliferation of *Ops: DevOps, ChatOps, AIOps, GitOps, DataOps, ModelOps, MLOps. But there is one Ops concept that is special: HugOps. HugOps is a way to show empathy and appreciation for the people that operate the service—your SysAdmins, Site Reliability Engineers (SRE), Production Engineers and Support Center staff.

While physical hugging might not be possible due to COVID-19 or you simply don’t feel comfortable hugging people, showing appreciation is not only possible — it is a must. Research has shown that Community Care—sharing the burdens together—can effect change and help people manage stress better. A tweet or Slack with the #HugOps hashtag can go a long way in showing your empathy with the people in the fire.

Stay ahead of the latest tech news

Get weekly insights, research and expert views on AI, security, cloud and more in the Think Newsletter.

A stressful calling

Working in operations is a stressful job. Businesses and customers depend on the services offered, and the costs for downtime are rising. Aberdeen estimates that while five years ago, an outage cost about USD 260,000 per hour, costs now are likely greater than USD 1 million.

“Slow is the new down”—even if there isn’t an outage, slowness can affect the bottom line. A delay in website load time can hurt conversion rate; mobile site visitors will leave a page that takes longer than three seconds to load, for example.

The importance of services is ever increasing, and so are their reliability requirements. The result of moving from a three 9s of availability (99.9%) to four 9s (99.99%) is that now the downtime of the service can only last four minutes per month.

Four minutes. A well-engineered operations function incorporates modern operations approaches (like SRE) on top of a reliable architecture. But still, bad things can happen. Carrying the weight of handling an incident certainly is an enormous stress.

It is important to recognize the different aspects of stress in this job. Like other emergency responders, three aspects of stress can be considered (see “Stress Management for Emergency Responders”):

Day-to-day stress: Getting ready to respond to interrupt (power-up computer, connectivity to the system), coordinate daily life and so on.
Critical incident stress: Performing the incident response for an incident in flight.
Cumulative, chronic stress: Results from an accumulation of various stresses inherent in the job—repeating incident patterns, feeling helpless, feeling alone and so on.

As you can see, even small doses of stress add up, leading toward a risk of chronic stress.

A way out

We must find ways to tackle the impact of stress in this discipline. Social support from friends and family can help get through stressful times. This is where #HugOps comes into play.

Rather than just putting more pressure on top of a stressful job, the technical community can show empathy and support to the people in the fire. By sharing the burdens and vitalizing our community, we can help the operations team to cope with stress better and faster.

Having strong social ties helps us get through stressful times and lowers anxiety.

A culture of collaboration and support

Beyond the typical approaches of #HugOps (Tweets, sending food, sweets, swag), further ahead are some thoughts on applying the objective of HugOps in the enterprise.

The main motivation is two-fold. First, operations are a team sport and a responsibility from everyone across the software development lifecycle, not just the team labeled “Operations.” The second aspect is psychological safety, the belief that you won’t be punished when you make a mistake (see: “High-Performing Teams Need Psychological Safety”):

Prevent burnout before (preparation), during and after (stress relief) an incident:
- Improve the on-call schedule and policy (that is, no more than two incidents per shift).
- Prepare for the job through exercises like “Wheels of Misfortune.”
- Measure workload and staff appropriately. Don’t treat Ops just as a cost play.
- Provide the necessary training and technology to be well prepared for responding to incidents.
- Offer relaxing and mindfulness services. Learn from other emergency response teams (such as emergency medical technicians, firefighters and police).
Collaboration. In the true spirit of DevOps, collaborate with your operations colleagues:
- As a developer, actively support the operations team during the incident when they need help.
- After the incident, seek active participation in the blameless (!) Post Incident Review, treating the incident as a learning opportunity for operations as well as development.
- Perform continued improvement of the service by implementing the resulting non-functional actions and in turn reducing the technical debt.
- Apply concepts like ChatOps to enable collaboration across organizational boundaries.
Shift the responsibility left. Don’t throw your operations colleagues under the bus by writing rogue code, lacking aspects such as reliability or observability:
- Build better reliability and manageability into the software: 12-factor, build for reliability, build to manage.
- Instrument the code to provide richer observability.
- Follow secure coding practices.
- Apply modern principles like chaos engineering to validate the robustness of the service.
Get in front of it:
- Learn from incidents. If you don’t spend time analyzing and determining the conditions that exist for an incident to take place, you won’t learn how to successfully remove nor recover from these conditions in the future.
- Shift from reacting to avoiding: automation, observability, shift left and so on.
- Give people time and tools to improve the service and the incident response. (“Sharpening the Axe” has come to mean acting to make yourself better at your job, both long term and for the task at hand. Take a moment and think about how to go about the task in a smarter way.)
- When giving people time, do it in a mindful way. For instance, an effective on-call rotation scheme that differentiates between “interruptible” and “non-interruptible” work times.
- Hold production readiness reviews to govern what gets into production. An interesting aspect of SRE is that the SRE team has the right to refuse support if the service doesn’t meet the requirements. This challenge switches to a “you build it—you run it” model.

Not every service needs five 9s of availability: Product Owners need to clearly negotiate and articulate the reliability targets of the service. When these service level objectives (SLO) are defined, the appropriate measures need to be taken to be able to support these targets (architecture, implementation, operations).

Closing

Working in operations is a stressful job. The Community Care practices described in this article will help reduce the stress significantly by being better prepared, responding better to and learning from unforeseen scenarios. Sharing the burdens together — and expressed noticeably through #HugOps—will create a support system to destigmatize burdening others.

Stress and anxiety might still exist in your workspace, but there are simple ways to reduce the pressure you feel. These tips often involve getting your mind away from the source of stress.

Self-awareness, exercise, stress-reduction techniques (such as mindfulness, meditation), music and hobbies (such as woodwork, photography, knitting) can all work to relieve anxiety—and they will improve your overall work-life balance as well. SAMHSA has some great resources on individual stress management planning for disaster response staff members.

People typically notice the operations role when something is not working. Successful operations tend to go unnoticed. Safety management should move from ensuring that “as few things as possible go wrong” (so-called Safety-I ) to ensuring that “as many things as possible go right.” This perspective is called Safety-II, and it relates to the system’s ability to succeed under varying conditions (see “From Safety-I to Safety-II”). Applying these concepts will be the topic of a future blog post.

Links and references

#hugops in practice: Operationalizing Empathy. David Shackelford (PagerDuty). DevOpsDays 2015. https://legacy.devopsdays.org/events/2015-detroit/proposals/hugops and https://www.pagerduty.com/blog/devops/hugops-in-practice/
HugOps is the best ops. On empathy and site reliability engineering. James Governor (RedMonk). September 18, 2017. https://redmonk.com/jgovernor/hugops-is-the-best-ops-on-empathy-and-site-reliability-engineering/
HugOps for Humans. From self-care to selfless-care. Nitya Narasimhan. October 04, 2019. https://speakerdeck.com/nitya/hugops-for-humans-self-care-to-selfless-care
Stress Management for Emergency Responders. Understanding Responder Stress. Dr. Leslie Snider (Antares Foundation). https://www2c.cdc.gov/podcasts/media/pdf/AntaresPgm1.pdf
Disaster Responder Stress Management. Substance Abuse and Mental Health Services Administration (SAMHSA). U.S. Department of Health & Human Services. https://www.samhsa.gov/dtac/dbhis-collections/disaster-response-template-toolkit/disaster-responder-stress-management
High-Performing Teams Need Psychological Safety. Here’s How to Create It. Laura Delizonna. Harvard Business Review. August 24, 2017. https://hbr.org/2017/08/high-performing-teams-need-psychological-safety-heres-how-to-create-it
From Safety-I to Safety-II: A Whitepaper. Erik Hollnagel (University of Southern Denmark). Robert L. Wears (University of Florida Health Science Center). Jeffrey Braithwaite (Australian Institute of Health Innovation). 2015. https://www.england.nhs.uk/signuptosafety/wp-content/uploads/sites/16/2015/10/safety-1-safety-2-whte-papr.pdf

Author

Ingo Averdunk

Distinguished Engineer

Empowering platform teams to do cloud right

Learn how platform teams can standardize workflows and unify infrastructure and security lifecycle management with a platform-as-a-product approach.

Supporting your SysAdmins with HugOps