Guide to Efficient Operations for Cloud

Overview

The Efficient Operations Pillar focuses on solutions that meet requirements for cloud workload insights, digitized processes and maintaining a pro-active operational posture. This is enabled through practices and guidance on deploying teams, automation and AI tooling to monitor, manage, and maintain solutions in a secure, reliable, and performant manner.

Principles

Examples of operational models include “You build it you run it” or establishing a Site Reliability Engineering practice. These and other operational models are dependent on understanding the needs and context of the business, customers, administrators, and development teams.

It is important to understand that operating models should be continuously assessed, adjusted, and tailored to an organization’s needs considering factors such as industry, regulatory requirements, existing solutions, and user objectives.

Automating common and routine operations tasks using scripting, intelligent agents, and other tools helps to maintain high service levels.

This principle should scale across teams and dependencies to achieve end-to-end efficiencies/speed, accuracy, reduction of errors, agility and ensure consistency within the operating environment.

Today operations teams have a great deal of choice in operational tools and may choose many best of breed tools for specific operational aspects which may lead to sprawl if not managed.

Variations and integration across tools can lead to challenges in on-boarding team members, licensing and cost control, agility, and increased vulnerabilities. Operations teams should continually seek to minimize and consolidate the operational tools and consoles in use.

Not every solution requires 24 x 7 availability, or instantaneous response time. Efficient solutions must support multiple service levels within a single solution and enable workloads to be placed on the infrastructure that best meets the workloads' operating requirements.

Service levels will also inform development teams to account for application level configurations and instrumentation to support business objectives and positive user experience.

Efficient operations teams are multi-disciplinary, i.e. they contain all the skills needed to support a set of applications. Achievement of this capability requires consideration across the stack of workloads including application and infrastructure services.

Solutions and operations tooling must support this model by enabling both integration (across solution components), and segregation (from other solutions) of operations tools and information.

What is site reliability engineering?

Many operational practices are common and can both be automated and accessed behind an API for greater access, for example not only creating automation to manage secrets, but creating a long-running API to execute secrets management operations.

This approach scales across the on-boarding of new capabilities and integration into dynamic workflows. This can both standardize procedures and reduce the amount of waiting time between teams.

Practices

Practices and guidance for creating efficient operational solutions. This guidance informs teams to implement principles of efficient operations and ensure the reliability, availability, and performance of complex systems. These practices help organizations achieve their user-centric reliability goals and maintain the health of their services.

Efficient operations practices are adjustable and can be right-sized to suit the specific needs and context of an organization's consumers, systems, and services.

Organization specific requirements, workloads and architectures will dictate the best practices to adopt. Continuous improvement through feedback, assessment and alignment to organization cloud strategy are essential for continued efficiency and effectiveness.

The desired outcome is to create a culture of reliability and collaboration that enhances the user experience, drives business value and minimizes disruptions.

Each cloud application with its own administrative or SRE team will tend to adopt or build a monitoring solution. To minimize control planes, ops teams should prefer to work with a centralized tools team to onboard to a centralized monitoring solution.

Consolidating system, event, and application logs to a central location greatly simplifies operational monitoring and problem diagnosis by minimizing the number of log sources and locations that must be monitored and managed by operations staff. This allows teams to set up comprehensive monitoring systems to continuously collect and analyze metrics and logs from various components of the system. These systems trigger alerts when SLIs deviate from acceptable ranges, allowing engineers or automated processes to respond quickly to address indicative signals.

IBM Observability with Instana

IBM Cloud Pak for AIOps

The old “break-fix” model just doesn’t work in modern IT environments with greater customer demands, sprawling multi-cloud solutions, and fewer skilled employees to manage it all. Artificial intelligence assisted operations (AIOps) tooling helps operations teams to maintain the availability, performance, and security of their environments, and to quickly identify and resolve potential and ongoing problems within their environment.

Adopting AIOps is enabled through key activities including:

Data collection across operational processes such as incident, problem and change
Model training aligned to SLOs and SLIs
Automated detection and triage across individual and integrated services
Automated response and remediation to enable self healing systems
Continuous feedback and learnin

IBM Cloud Pak for AIOps

Infrastructure specifications and configuration are managed like code, ie. using automated provisioning tools to enable configuration management and to ensure consistency of infrastructure across deployments.

Consistent infrastructure configuration in code ensures reproducible environments across SLDC lifecycles and deployments across environments.

This approach enables key benefits including:

Governance, auditability and version control leveraging systems like Git
Enabling collaboration and the ability to roll back changes when needed
Automation and speed of provisioning and management of resources
Scalability according to thresholds and triggers
Re-usability across teams and projects

Ansible

Product teams work with stakeholders to define Service Level Objectives (SLOs) and establish Service Level Indicators (SLIs) that measure resiliency and delivery of a service. SLIs may have a many to one relationship with SLOs leveraging measurable metrics like latency, error rates, and uptime that contribute to meeting established objectives.

The establishment of SLOs and SLIs enable key benefits including:

Measurable goals to ensure teams have a well-defined and clear targets
User-centric focus based on business expectations and user experience
Quantifiable measurement allowing objective measurement and assessment
Guidance in aligning to the quality of services when using external vendors
Common measurement across teams (development, operations, business)

Consistently develop procedures that outline roles, responsibilities, communication channels, and escalation paths during key processes such as incident, change and problem management. These procedures ensure functional, secure, scalable, and cost-efficient use of cloud resources.

Common Cloud Operations procedures include:

Provisioning and Deployment based on predefined templates of Infrastructure as Code
Monitoring and Alerting to notify teams when thresholds or health parameters are breached
Regularly backing up data and configurations to enable robust and timely recovery plans
Monitoring resource utilization and identifying cost optimization opportunities
Regularly conducting disaster recovery tests to validate the effectiveness of recovery plans
Analyzing usage and growth trends to forecast resource needs
Developing incident response plans to address critical events such as outages and breaches

The implementation and management of well-defined procedures includes process simulations to emulate scenarios and ensure teams are well-prepared to execute across squads with quality and efficiency.

After an unexpected incident occurs, teams conduct postmortems and blameless investigation to identify contributing factors, root cause(s), and the efficiency of the response. This is followed with a digitized and/or automated solution to prevent similar incidents in the future.

This practice includes regular reviews and refinement based on data-driven insights, postmortems, and feedback from stakeholders. Additionally, as services are on-boarded and incorporated into environments, integration with existing automated solutions is key to maintain a proactive posture.

This ensures that processes should not remain static, but rather evolve to meet dynamic business objectives, requirements and challenges.

Collaborate with security teams to ensure that security measures are integrated into the development, deployment, and maintenance processes.

This includes shifting left with security and Software Development Life Cycles (SDLC) with a focus on implementing policies through codified and automated solutions.

Consistent collaboration with security ensures that deployed workloads remain aligned to dynamic organization policies and business objectives. Other processes include employing regular security assessments, incorporating vulnerability management, and invoking periodic compliance checks.

Resources

IBM Cloud Pak for AIOps

a comprehensive AI-assisted operations management platform that helps operations teams to contextualize operations data and collaboratively solve problems, and it provides proactive recommendations to help teams avoid problems before they occur.

IBM Instana Observability

a full-stack operations observability platform that integrates operations team through a common platform and contextualized views that support all delivery teams including DevOps, SRE, platform engineering, and ITOps.

IBM Turbonomic

a full-stack visualization and operations automation platform that helps operations teams optimize infrastructure resources for cost and performance.

IBM DevOps Automation

an intelligent software tool that helps teams to deliver software more efficiently.

Red Hat Ansible

a hybrid cloud automation platform automate repetitive tasks to save time and be more productive.

OpenShift Pipelines