My IBM Log in Subscribe

Supporting Your SysAdmins with HugOps

28 June 2021

5 min read

Author

Ingo Averdunk

Distinguished Engineer

When the going gets tough, show empathy and appreciation for your SysAdmins.

There certainly is a proliferation of *Ops: DevOps, ChatOps, AIOps, GitOps, DataOps, ModelOps, MLOps. But there is one Ops concept that is quite special: HugOps. HugOps is a way to show empathy and appreciation for the people that operate the service — your SysAdmins, Site Reliability Engineers (SRE), Production Engineers and Support Center staff.

While physical hugging may not be possible due to COVID-19 or you simply don’t feel comfortable hugging people, showing appreciation is not only possible — it is a must. Research has shown that Community Care – sharing the burdens together – can effect change and help people manage stress better. A tweet/Slack with the #HugOps hashtag can go a long way in showing your empathy with the people in the fire.

     
    Black woman working on laptop

    Stay ahead of the latest tech news

    Get weekly insights, research and expert views on AI, security, cloud and more in the Think Newsletter.

    A stressful calling

    Working in operations is a stressful job. Businesses and customers depend on the services offered, and the costs for downtime are rising. Aberdeen estimates that while five years ago, an outage cost about USD 260,000/hour, costs now are likely greater than USD 1 million.

    “Slow is the new Down” — even if there isn’t an outage, slowness can affect the bottom line. A delay in website load time can hurt conversion rate; mobile site visitors will leave a page that takes longer than three seconds to load, for example.

    The importance of services is ever increasing, and so are their reliability requirements. The result of moving from a three 9s of availability (99.9%) to four 9s (99.99%) is that now the downtime of the service can only last four minutes per month. Four minutes! A well-engineered operations function incorporates modern operations approaches (like SRE) on top of a reliable architecture. But still, bad things can happen. Carrying the weight of handling an incident certainly is enormous stress.

    It is important to recognize the different aspects of stress in this job. Like other emergency responders, three aspects of stress can be considered (see “Stress Management for Emergency Responders”):

    • Day-to-day stress: Getting ready to respond to interrupt (power-up computer, connectivity to the system), coordinate daily life, etc.
    • Critical incident stress: Performing the incident response for an incident in flight.
    • Cumulative, chronic stress: Results from an accumulation of various stresses inherent in the job — repeating incident patterns, feeling helpless, feeling alone, etc.

    As you can see, even small doses of stress add up, leading towards a risk of chronic stress.

    A way out

    We must find ways to tackle the impact of stress in this discipline. Social support from friends and family can help getting through stressful times. This is where #HugOps comes into play. Rather than just putting additional pressure on top of a stressful job, the technical community can show empathy and support to the people in the fire. By sharing the burdens and vitalizing our community, we can help the operations team to cope with stress better and faster.

    Having strong social ties helps us get through stressful times and lowers anxiety.

    A culture of collaboration and support

    Beyond the typical approaches of #HugOps (Tweets, sending food/sweets/swag), below are some thoughts on applying the objective of HugOps in the enterprise.

    The main motivation is two-fold. First, operations are a team sport and a responsibility from everyone across the software development lifecycle, not just the team labelled “Operations.” The second aspect is psychological safety, the belief that you won’t be punished when you make a mistake (see: “High-Performing Teams Need Psychological Safety”):

    • Prevent burnout before (preparation), during and after (stress relief) an incident:
      • Improve the on-call schedule and policy (i.e., no more than two incidents per shift).
      • Prepare for the job through exercises like “Wheels of Misfortune.”
      • Measure workload and staff appropriately. Don’t treat Ops just as a cost play.
      • Provide the necessary training and technology to be well prepared for responding to incidents.
      • Offer relaxing and mindfulness services. Learn from other emergency response teams (such as emergency medical technicians, firefighters and police).
    • Collaboration. In the true spirit of DevOps, collaborate with your operations colleagues:
      • As a developer, actively support the operations team during the incident when they need help.
      • After the incident, seek active participation in the blameless (!) Post Incident Review, treating the incident as a learning opportunity for operations as well as development.
      • Perform continued improvement of the service by implementing the resulting non-functional actions and in turn reducing the technical debt.
      • Apply concepts like ChatOps to enable collaboration across organizational boundaries.
    • Shift the responsibility left. Don’t throw your operations colleagues under the bus by writing rogue code, lacking aspects such as reliability or observability:
      • Build better reliability and manageability into the software: 12-factor, build for reliabilitybuild to manage.
      • Instrument the code to provide richer observability.
      • Follow secure coding practices.
      • Apply modern principles like chaos engineering to validate the robustness of the service.
    • Get in front of it:
      • Learn from incidents. If you don’t spend time analyzing and determining the conditions that exist in order for an incident to take place, you won’t learn how to successfully remove nor recover from these conditions in the future. Help each other learn.
      • Shift from reacting to avoiding: automation, observability, shift left, etc.
      • Give people time and tools to improve the service and the incident response. (“Sharpening the Axe” has come to mean taking action to make yourself better at your job, both long term and for the task at hand. Take a moment and think about how to go about the task in a smarter way.)
      • When giving people time, do this in a mindful way. For instance, an effective on-call rotation scheme that differentiates between “interruptible” and “non-interruptible” work times.
      • Hold production readiness reviews to govern what gets into production. An interesting aspect of SRE is that the SRE team has the right to refuse support if the service doesn’t meet the requirements, switching to a “you build it – you run it” model.

    Not every service needs five 9s of availability: Product Owners need to clearly negotiate and articulate the reliability targets of the service. Once these service level objectives (SLO) are defined, the appropriate measures need to be taken to be able to support these targets (architecture, implementation, operations).

    Closing

    Operations is a stressful job. The Community Care practices described in this article will help reduce the stress significantly by being better prepared, responding better to and learning from unforeseen scenarios. Sharing the burdens together — and expressed noticeably through #HugOps — will create a support system to destigmatize burdening others.

    Stress and anxiety may still exist in your workspace, but there are simple ways to reduce the pressure you feel. These tips often involve getting your mind away from the source of stress. Self-awareness, exercise, stress-reduction techniques (such as mindfulness, meditation), music and hobbies (such as woodwork, photography, knitting) can all work to relieve anxiety — and they will improve your overall work-life balance as well. SAMHSA has some great resources on individual stress management planning for disaster response staff members.

    People typically only notice the operations role when something is not working; successful operations tend to go unnoticed. Safety management should move from ensuring that “as few things as possible go wrong” (so-called Safety-I ) to ensuring that “as many things as possible go right.” This perspective is called Safety-II, and it relates to the system’s ability succeed under varying conditions (see “From Safety-I to Safety-II”). Applying these concepts will be the topic of a future blog post.

    Links and references

    Related solutions

    Related solutions

    IBM Instana Observability

    Harness the power of AI and automation to proactively solve issues across the application stack.

    Explore IBM Instana Observability
    Automation consulting services

    Move beyond simple task automations to handle high-profile, customer-facing and revenue-producing processes with built-in adoption and scale.

    Explore automation consulting services
    AIOps solutions

    Discover how AI for IT operations delivers the insights you need to help drive exceptional business performance.

    Explore AIOps solutions
    Take the next step

    Discover how AI for IT operations deliver the insights you need to help drive exceptional business performance.

    Explore Instana Observability Play with Instana