SRE-003: The Modern DevOps Manifesto

Presenter: Andrea Crawford

Description: Cloud native applications are viewed as collections of microservices that deploy at various speeds, independently of each other. The number of moving parts, amplifies the challenge of manually deploying cloud native apps, quickly, rendering it an unsustainable effort. Conversely, a huge benefit of DevOps for cloud native is the opportunity to view “everything as code”…not just the cloud native app, but the full stack configuration of platform and infrastructure that apps run on. Come hear about the “Modern DevOps Manifesto” and how DevOps is not just for apps and Dev teams anymore!

Keywords: devops, automation, cloud native

Speaker Bio:

  • Andrea C. Crawford
    Andrea has more than 20 years of cross-industry experience in application development, architecture, and accelerated delivery with significant contributions to IBM’s own DevOps transformation and to clients’ DevOps journeys. Andrea provides technical oversight and leadership of technical enablement, solution design, and offering innovation which focus on accelerated application development with Agile, DevOps, and Cloud Native Development. Andrea applies her expertise in accelerated development across all industries, heterogeneous technologies and hybrid cloud scenarios.


SRE-004: Scaling Challenges faced by Video Conferencing Apps during Pandemic

Presenter: Adinarayana Haridas

In early 2020, as the world started to be aware of Covid-19, people started to think about medium of communication while being at home. Video conferencing applications becase the main stream to communicate online and people started to use them for further communications. Be at Schools, Colleges, Universities or at Software companies, video conferencing based applications started to rise. Google Meet has seen a 30x growth in the month of March 2020, same is the case with Zoom where they saw 50x growth in usage of their application and Cisco WebEx user volume tripled during this period. With increase in demand, companies started to face scaling challenges as well interms of infrastructure.
This lightening talk provides details about how Google has used SRE (site reliability engineering) best practices to maintain their growth technically and operationally sustainable.

Keywords: Capacity, CapacityPlanning, Scalability

Speaker Bio:

  • Adi Adinarayana is Site Reliability Engineer by Profession and Technology enthusiast. Over the past years with a good amount of travel, gained a lot of Customer base and had a privilege to be the first person to receive calls for any of the troubleshooting activities. Two client references, Strong contributor to several Academy of technology Initiatives, Opensource contributor and a regular Raspberry Pi/Arduino user. Spending time with kids and at the Tennis court are mandatory daily tasks.


SRE-008: Proactive SRE: Airline company flies high with auto-monitoring and standardized operation

Presenter: Alan Mok , Timothy Leung

Description: In three years’ time, IBM has successfully implemented varies standards and tools to help a Hong Kong airline company automate their application monitoring and standardize remediation. The result of the implementation is extraordinary which drives the monthly incident volume dropped from 209 to 31 (approximately 100 application monitored), proving that the application maintains high operation availability.  Client satisfaction score surges from 66.5 to 84 after implementation.

Keywords: SRE, SOP, incident_reduction, availability, metrics_and_monitoring, emergency_response

Speaker Bio:

  • Alan Mok
    Architect, IBM Hong Kong – GBS HCS
    Alan is an architect with over 10 years of experience working on Cloud Architecture. And 4 years of experience leading in Technical Operation Centre for IBM SaaS / PaaS products. He is specialized in DevSecOps Establishment and Site Reliability Engineering.
  • Timothy Leung
    Architect, IBM Hong Kong – GBS HCS
    Timothy is an architect with 16 years experience on application management, cloud solution development and enterprise cloud migration strategy consulting.
    He is specialized and interest in application performance tuning.


SRE-010: SRE Manager Panel

Moderator: Ingo Averdunk

Hannah Delvin, IBM
David Gill, IBM
Bastian Spanneberg, Celonis

Title: SRE Manager Panel

Description: Panel of SRE Managers to discuss implementation patterns for SRE from a Managers perspective. We will discuss questions like:
– Their journey to SRE, and becoming an SRE Manager.
– How technical do you need to be as an SRE Manager ?
– How do you built your SRE team ?
– What are the biggest challenges ?

Keywords: keywords (comma separated)

Speaker Bio:

  • Bastian Spanneberg
    Bastian started his career as a software engineer but soon realized that there is much more to a successful software project than “just coding” and so he began to look into the full lifecycle of software applications. After working ib consulting for several years where he focused on Continuous Integration, Continuous Delivery and DevOps he joined Instana as a Site Reliability Engineer and has led the SRE team for the past two years. He is now Director SRE at Celonis where he will help building a new team and establish SRE practices.
  • Hannah Delvin, IBM
    Hannah started her career in IBM’s Strategic Outsourcing area of the business. She spent many years working on customer sites as a Service Delivery Manager leading the technical teams before she moved to the Tivoli division of SWG . She managed the Customer Support Level 2 team for the Tivoli Monitoring products. In 2016 she joined the Cloud business as the SRE manager of the UK team for the Container service. She is now Programme Director for SRE, Security and Compliance for the IKS/ROKs and Satellite services.
  • David Gill, IBM
    With 20+ years of professional experience, David has lead global Managed Services and Operations Teams. David joined IBM almost 10 years ago as a Cloud Availability Manager primarily focusing on stability, reliability and availability of IBM Cloud SaaS services, managing major incidents whilst driving resolutions and remediation plans. In addition having ownership of Change Management ensuring adherence to compliance and streamlining cross team processes. David moved to a DevOps Manager role in 2014 managing a SaaS DevOps engineering team. A significant achievement was managing a migration of non-IBM 3rd party SaaS solution to Strategic Cloud Solution. Integrated best of breed Monitoring & Alerting solutions. Championed a local DevOps COE cross SaaS synergy encompassing guiding principles for HA, Automation, Continuous Delivery, Monitoring and Optimization. As the business evolves into an SRE organisation David has shifted to Managing SRE Tooling and Data Insights. Understanding the requirement to take a holistic approach focusing on automation and toil reduction through standardising processes supported by strategic tooling . Evangelizing SRE practices with development, operational and product groups. Aligning key initiates with OKR’s being outcome focused.


SRE-011: A Quick Starter to Observability using OpenTelemetry

Presenter: Ayush Sharma , Vikrant Kaushik

Description: Telemetry data lacks standardization, promotes vendor lock-in and results in a lack of data portability. To mitigate these issues OpenTelemetry project came into existence. Which apart from providing us with a set of APIs SDKs Tooling and integrations for the creation and management of our Telemetry Data for various languages also gives us a Single, Vendor agnostic solution that supports both manual and automatic instrumentation.
The goal of this proposed demo is to upskill individuals on how to add OpenTelemetry to an existing distributed application, to send standardised telemetry data to backends of your choice for Observability.

Keywords: keywords (comma separated)

Speaker Bio:

  • Ayush Sharma
    Ayush is a Sr. Application Consultant at IBM and an Open source contributor to projects like Fedora. He worked as a full-stack software developer and as a DevOps consultant helping companies embrace DevOps and SRE movement. Ayush enjoys programming in Python, tinkering with distributed systems and has a knack for knocking down certifications.
  • Vikrant Kaushik
    Vikrant has joined IBM in the month of Feb’20 although he is having an overall of 15+ years of IT experience. As an SRE, Vikrant is responsible for the implementation of SRE best practices for different-different commercial accounts. Prior to this Vikrant was part of Mahindra and DXC Technology where he was managing the SRE and DevSecOps practices for many industries and accounts.


SRE-015: Microservices above the Cloud – Designing the International Space Station for Reliability

Presenter: Robert Barron

Title: Microservices above the Cloud – Designing the International Space Station for Reliability

Description: The International Space Station has been orbiting the Earth for over 20 years. It was not launched fully formed, as a monolith in space. It is built out of dozens of individual modules, each with a dedicated role—life support, engineering, science, commercial applications, and more. Each module (or container) functions as a microservice, adding additional capabilities to the whole. While the modules independently deliver both functional and non-functional capabilities, they were designed, developed, and built by different countries on Earth at different times and once launched into space (deployed in multiple different ways) somehow manage to work together—perfectly. In this session I will showcase lessons SREs can learn from the way the ISS was developed.

Keywords: SRE, Reliability, Incidents

Speaker Bio:

  • Robert Barron
    Robert works for IBM as an SRE, ChatOps, and AIOps Solution Engineer who enjoys helping others solve problems even more than he enjoys solving them himself. Robert has over 20 years of experience in IT development & operations and is happiest when learning something new. He blogs about operations, space and AIOps at


SRE-016: Launching the SRE Practice

Presenter: Catherine Darmas-Goings, Bob Abraham

Description: This presentation will focus on standing up an SRE practice and the transformation to become a high performing team. It will include tips for success and how to effectively engage all areas of the infrastructure and application teams. Finally, there will be discussion on potholes to avoid on your SRE journey.

Keywords: keywords (comma separated)

Speaker Bio:

  • Catherine Darmas-Goings
    Catherine Darmas-Goings is passionate about the benefits that SRE can bring to an IT organization. A relatively newcomer to the SRE world, she is currently helping to introduce and evangelize this methodology within Kyndryl and Anthem. Trained as a Program Manager, Catherine has 15 years of background in project management, leadership and strategic planning across various business sectors. This provides her clients and the organizations she supports with unique advantages. Currently, she works at Kyndryl as an Infrastructure Specialist on the Anthem account. Prior to that, Catherine spent 10 years at Anthem in Performance Engineering.
  • Bob Abraham
    Bob Abraham has worked in IT for nearly 40 years, mostly working for large customers of IBM. Through that time, he has worked in various roles developing, designing and supporting high volume, resilient systems. Some of his more notable technical designs included a mainframe-based interoffice communications system which detected network availability and latency issues (years before email), a high-performance database thread-reuse interface which reduced system resource overhead, and a number of cross-technology interfaces to enable 24×7 operation of core processing systems. Bob’s designs are based on “what we can do today”, and “how we evolve to the technology options of tomorrow”. As a result of his work, Bob was invited to participate in a Gartner workgroup on high-availability systems, and to work with the CICS lab in Hursley on product feature planning.
    Across any set of deliverables, whether technology-related, rebuilding a kitchen, or planning a road trip, Bob has embraced the importance of good project strategy and planning, and effective project management disciplines necessary to deliver results.
    Bob currently leads the SRE practice for the Anthem account.


SRE-019: The value of empathy vs sympathy in becoming a better SRE

Presenter: Debbie Yang

Description: There is a difference between sympathy and empathy which can influence how SREs handle issues and how they interact with customers. We can build better tools, have better client interactions and become better SREs if we understand the difference and integrate empathy into SRE activities.

Keywords: empathy, empathymap, understandingUserNeeds

Speaker Bio:

  • Debbie Yang
    Debbie is an SRE working with the Multicloud Management Platform (MCMP). She’s particularly interested in the customer experience and usability (UX) of applications, striving to continuously improve the end results as well as the processes that go into conceiving, developing and supporting them.


SRE-022: Toil reduction by using z/OSMF and Python script.

Presenter: Emiko Nakaya

Description: One of the important job of PCM team is to gather system data periodically and analyze it. This is very important for the stable system operation, but also it is time consuming works. Our team uses z/OSMF and Python script to reduce the report making work, which is toil and make SE to use their time to analyze the data. I would like to introduce some PCM report making workflow using z/OSMF and python script.

Keywords: z/OSMF, python

Speaker Bio:

  • Emiko Nakaya
    Emiko Nakaya is a z/OS Perf&Capacity Management specialist and delivery team leader of large outsourcing project. She joined IBM, GTS in 2001 and started to deliver local Perf&Cap management tool of z/OS for multiple SO project. Also the specialist of IBM z Decision Support which is the major products for z/OS Perf&Cap management and supporting project to create and analyze Performance reporting easily. Now she is working as platform SRE in SO Project.


SRE-024: Injecting faults into hybrid cloud environments

Presenter: Frank Bagehorn, Daniel Firebanks-Quevedo

Description: In our talk we will demonstrate a fault injection platform and how we are able to inject faults into an application on both Kubernetes environments and virtual machines. Using such a fault injection platform allows the generation of operational data in situations that occur rarely or never an thus supports the training of AI Operations models.

Keywords: keywords (comma separated)

Speaker Bio:

  • Frank Bagehorn
    Frank Bagehorn is an IT Architect at IBM Research Zurich. His work focuses on innovation of the way that IT services are delivered and managed especially by infusing AI techniques. In performing his Research he can draw from many years of practical experience working in an IT Service Delivery organization. In his spare time, he enjoys Swiss nature and taking pictures.
  • Daniel Firebanks-Quevedo
    Daniel Firebanks-Quevedo is a software engineer at the IBM TJ Watson Research Center. His work spans at the intersection of AI and IT operations (AIOps) and he is passionate about leveraging AI to other fields of knowledge. A musician on the side, he enjoys collaborating with people in both data science and music projects.


SRE-025: Infrastructure as code in zDevOps

Presenter: Gerald Mitchell

Description: What SRE means in zDevOps and how Infrastructure as code is influencing business and application modernization on mainframe.

Keywords: IBMZ, ZDevOps, ZCICD, ZSRE

Speaker Bio:

  • Gerald Mitchell is a developer and Eclipse client architect for IBM Developer for z/OS in the z hybrid cloud portfolio in the z devOps space; he has worked in many roles for IBM including Host Access middleware front and back end development, services, build, and product architect, product support Global Response Team, Rational brand Serviceability Architect, and Jazz Foundation core engineering.


SRE-027: How SRE Helped Me Save Time & Money

Presenter: Hellen Fernandes Cavalcante

Description: With this brief presentation, I would like to share a little bit of my experience in the SRE Project.
At the beginning, I had almost no knowledge about the project, but I was able to immerse myself, along with the monitoring team, and I learned a lot of things that made me understand the functioning of services, analysis and implement several improvements in my applications. And with that, we had several advances regarding the analysis of user experience and satisfaction, data flow, specific analysis of microservices, etc. I am deepening my knowledge in this immense area every day and today I am able to instruct members of my team in these analyses. Come learn more about tips, best practices, standardization that you can apply in your projects.

Keywords: Monitoring, DevSecOps, SRE

Speaker Bio:

  • Hellen Fernandes
    Systems Analysis Student at the Faculty of Technological Education of the State of Rio de Janeiro, started her career at IBM in 2016 as a young apprentice. Today, she works in CIO acting as DevSecOps Facilitator for team applications, tester and developer applications. In her free time, she plays soccer and loves watching series and listening to music.


SRE-028: Helm and Back again. An SRE guide to choosing between Operators and Helm

Presenter: Hilliary Lipsig

Description: Choosing the best tool for the job cannot be overstated. However, needs change and what works today may not work as well tomorrow.  In today’s Cloud-Native ecosystem, it can be a challenge to pick the right tool given the number of choices we have. In this presentation we’re going to talk about when you might want to use a Helm chart vs an Operator, and the benefits of being able to create an Operator from your Helm chart down the line if needs change.
This allows Operations to be more agile, in its response to immediate needs, and scalable and anticipatory to potential future needs as well.

Keywords: help, operator, cloud-native

Speaker Bio:

  • Hilliary Lipsig
    Hilliary is an autodidact and start-up veteran who has frequently learned and applied technologies to get a job done. She’s had her hand in every part of the application delivery process, honing in her skills originally as a QE engineer. Hilliary is an IT polyglot able to talk the lingo of both the Operations and Development teams. She’s currently a Principal SRE and team lead at Red Hat, and she’s passionate about process, consistency in tooling, and scalability.


SRE-030: Around the world in 24 hours!

Presenter: Jack P. Ciejek

Description: So you have an SRE team spanning the globe giving you a follow-the-sun rotation. How do you make sure the next SRE who’s going on-call has all the information they need to pick up where you left off? The answer is Slack! Join me to learn how we leverage Slack to maintain continuous and smooth hand-overs between multiple people across the globe to ensure our services continue functioning for our customers.

Keywords: keywords (comma separated)

Speaker Bio:

  • Jack has been with IBM for over 30 years working a variety of roles on many products and platforms. For the last several years however, Jack has settled into the role of Site Reliability Engineer for several Watson services.


SRE-031: Intro to Red Hat Team Dial Tone

Presenter: Jeremy Eder

Description: In this talk, we will cover how the Red Hat SRE team is evangelising across the company to foster a culture of building stable, secure, performant and boring “dial tone” services that enable success for both our internal and external customers. We will discuss the process of hardening our control plane for the recent Red Hat OpenShift on AWS (ROSA) product launch, as well as providing real world lessons-learned upfitting existing applications to be more operable. Finally, we will round out the session by sharing architectural guidelines for greenfield development of new services with an emphasis on fleet-wide observability and capacity planning.

Keywords: SRE, culture, stability, reliability

Speaker Bio:

  • Jeremy Eder
    A 15+ year tech industry veteran, Jeremy is a Distinguished Engineer within Service Delivery, building Red Hat’s managed service muscle in order to operationalize the vision of OpenShift as a hybrid cloud substrate through building and operating services like Red Hat OpenShift on AWS, OpenShift Dedicated and Azure Red Hat OpenShift.He is a proven technical leader and intrapreneur, having seeded several strategic initiatives that have made their way into Red Hat’s products, services and process. Jeremy was the recipient of Red Hat’s Chairman’s Award and remains a frequent author and presenter. He previously specialized in software performance analysis, and currently works in the managed services space on observability, reliability and other tenets of Site Reliability Engineering.As a Distinguished Engineer, Jeremy continues to challenge the status quo, pushes his teams to continuously improve, is an active mentor for many engineers, inspires confidence by providing long-term vision and context, and believes that infrastructure should be a dial-tone: Stable, Secure, Performant and Boring.


SRE-032: How to keep customers happy and keep your production services running

Presenter: John Thornton

SREs can suddenly get swamped with 5xxs from their services and get overwhelmed at where to start looking for the cause and impact to customers. Watson AI SREs have created “Health boards” in Logdna (series if dashboards) that allow us to view our micro-services from the frontend to the backend. Not only does this allow us to see which micro-services are generating the 5xxs but it also allows us to look for patterns which helps us resolve the issue sooner and helps keep the service running for our customers. By using our “Health board” we can identify when the issue occurred and then follow that timeline and look for patterns. Perhaps only 1 downstream service was impacted by a customer overloading the service and we needed to scale up a specific micro-services or perhaps we found multiple downstream micro-services were impacted by a database issue.

Keywords: keywords (comma separated)

Speaker Bio:

  • Name John Thornton
    I have been at IBM for 32 years! Worked at Lotus Development and then Iris Associates before being acquired by IBM in 1995. While at Lotus I worked is Lotus Technical support and then began a testing career on the Lotus Notes Quality Assurance team (QA). Our QA team then became part of Iris Associates, so I continued my career as a QA engineer on the Lotus Notes and Domino teams in a variety of testing roles until 2014. Joined Watson in 2014 as a technical contributor (setting up systems for early adopter customers of Watson Explorer Advisor). Became a Watson SRE in May 2016.


SRE-033: Delivery Service Excellence with an A.I. augmented “Health Check as a Service”

Presenter: Jonathan Young

Do you think your “daily checks” are effective in risk mitigation? Are your teams focused on the systems that are inherently stable, rather than those that carry underlying deviations from best practice?
“Health Check as a Service” (HCaaS) is an integrated health check tooling platform that is data-centric and leverages A.I. technology to assist service delivery in mitigating technical risk, lowering rates of severe incidents. Integrating the automation of Tech-Spec verification, hardware/software currency management with SRE led “Technical Health Checks” platform, teams can focus on ensuring smooth service delivery and spend less time dealing with incidents.

Keywords: keywords (comma separated)
technical-health-check, watson, delivery-excellence, tech-spec

Speaker Bio:

  • Jonathan Young CEng MIET
    Jonathan is an Open Group Certified Distinguished Architect (Level 3) who is dedicated to customer success. He is currently leading the “Health Check as a Service” platform as key contributor to the Kyndryl delivery-excellence capability. Jonathan has performed the role of SRE leader for many deep dive Technical Health Checks for some of IBM’s largest customers in the past 10 years. His specialisms include Integration Architecture, Technical Governance and the automation of Technical Risk Management.
    Jonathan is also expert on Threat Modelling and is hands-on with technologies such as DevSecOps, containers, cloud services, single page application & related technology.
    Some of his career highlights include 3 IBM Outstanding Technical Achievement Awards (including a Gerstner Award) and 3 patents in the subject of technical health check automation


SRE-034: Learning to localize faults using fault injection

Presenter: Jesus m Rios aliaga, Karthikeyan Shanmugam, Qing Wang

Description: Quickly finding the exact location of a fault in large distributed applications such as those based on microservices architectures is not an easy task and typically requires manual analysis of logs and telemetry. This is especially the case for new releases where the application behavior is not yet well known.

We leverage WOLLFI, an automatic fault injection platform developed at IBM, to learn in a staging environment a causal model of how errors propagate between the different components of such applications. We then use this causal model during production to automatically localize the root cause of a problem detected within the application.

We have tested our methodology in a well-known microservice application benchmark and present here the results of such experiments. To do so, we first inject faults in the benchmark application while running a user-flow in the background, and use the observed logging data along with the information of which faults generated those logs to learn the error propagation causal model. We then evaluate its performance by injecting new faults into the application and comparing the predictions from our fault localization algorithm (based on the learnt causal model) with the actual locations of the injected faults.

Keywords: fault injection, fault localization, micro-services, causal learning,

Speaker Bio:

  • Jesus Rios is a Research Staff Member at the IBM Research Division. He joined IBM Research in 2010. He currently works on applying AI to problems in the IT domain
  • Karthikeyan Shanmugam is currently a Research Staff Member at IBM Research AI, NY. Previously, he was a Herman Goldstine Postdoctoral Fellow in the Math Sciences Division at IBM Research, NY. His research interests broadly lie in Graph algorithms, Machine learning, Optimization, Coding Theory and Information Theory. In machine learning, his recent focus is on graphical model learning, causal inference interpretability in ML and large graph analytics.
  • Qing Wang received the PhD degree in computer science from Florida International University. She is currently a researcher at IBM T.J. Watson Research Center, Yorktown Heights, NY. Her research interests are in data mining and machine learning studying both on algorithmic and application issues but now she is focusing on AI for IT operations. She has more than 20 publications and 12 patents. She is an IEEE member.


SRE-037: SREs – The Avengers of Production System

Presenter: Kaushal Kishore

Description: This talk is on who is a SRE? And how they are the guardians of the Production System

Keywords: SREdimension, TheGaurdians, SREsAssemble

Speaker Bio:

  • Kaushal Kishore
    Joined IBM in mid June 2021. Working as a SRE for Planning Analytics on Cloud team in IBM Cloud and Cognitive Software Org. Have in total of 5+ years of SRE experience. Have set up SRE practices and implementation in my past companies I have worked for.


SRE-038: Iterating to Awesome – An SRE Toil-Reduction Retrospective

Presenter: Kirk Bater

Description: Toil Reduction is the lifeblood of an SRE, we know this, but what happens when SREs are tasked with writing code and features as well as monitoring the service? How do you manage both? This talk is about how our SRE team operates running OpenShift Dedicated and Red Hat OpenShift on Amazon; and a retrospective how we tackled this challenge through iteration on people and process.

Keywords: toil

Speaker Bio:

  • Kirk Bater
    Kirk Bater is an SRE and operations Region Lead at Red Hat, working on OpenShift Dedicated. When Kirk isn’t developing software, you can find him writing or talking about software, coaching his daughters’ hockey teams, or camping/hiking in the Adirondacks.


SRE-039: Treat your SRE Certification Journey like a project and drive it to completion

Presenter: Kevin Yu

Description: Have you thought of SRE Profession Certification only to be overwhelmed by the journey? This session will describe how to make certification real and achievable by treating it like a project. Start with the user persona and empathy map and then capture the AS-IS to surface the gaps. Leads to prioritization and identification of tasks to reach the goal in an agile and iterative way. This method will help your Certification journey in addition to reflection on your work, achievements and leads to career dialog you can have with your manager and mentors.

Keywords: keywords (comma separated)

Speaker Bio:

  • Kevin Yu
    Kevin is a solution reliability architect with over twenty years’ experience enabling solution to scale and meet peak demands such as The US Cyber Monday. Kevin have worked on end-to-end enterprise SaaS and on-prem solutions ranging from Commerce, Marketing to Supply Chain with roles in lab services, development and operations. Kevin is a champion of the SRE Profession in IBM. He is driving the technical vitality and more importantly, the culture and mindset shift. He is currently leading the SRE enablement in IBM AI Applications and driving its SRE transformation to a data driven engineering organization.


SRE-041: Becoming SRE – A Professional Journey

Moderator: Rod Anami

Attendees: Cindy Mullen, Ralph Bateman, Pavlos Ratis

Description: What does it take to become a full-fledged SRE? Hear from technical thought leaders from 3 different companies what the SRE profession journey is about on this panel. Join Cindy Mullen from Kyndryl, Pavlos Ratis from Red Hat, and Ralph Bateman from IBM to learn their perspectives on this career and how they have succeeded as established engineers. After this panel, you will possess insights on defining your personal career path for the SRE profession.

Keywords: sre_profession, journey, becoming_sre

Speaker Bio:

  • Cindy Mullen
    Cindy Mullen is a Senior Technical Staff Member at Kyndryl with a 20 year career in the Information Technology industry. She shifted her career from System & Security Administration to Site Reliability Engineering (SRE) three years ago. As a Site Reliability Engineer, she is focused on optimizing observability and developing automation. As a core member of the Global SRE@Kyndryl Program, she serves as a Transformation SRE where she facilitates successful integration of SREs around the globe and mentors members of the SRE community. Cindy is a strong advocate for SRE as well as a continuous learner. She has earned Certified Expert IT Specialist, Open Group Master IT Specialist, MCSE/MCITP/Azure, VMware, AWS, PMP, CISSP, CCSP certifications and accredidations.
  • Ralph Bateman
    Ralph is the Distinguished Engineer for IBM Cloud, SRE. He created and leads the new Site Reliability Engineering strategy that keep the lights on for IBM Cloud. The goal is to deliver high quality services to enable customers to deploy into the cloud quickly and easily providing operational excellence for those services. Working closely with Kubernetes, Docker, CNCF and other open source to create, deliver and run IBM Cloud.
  • Pavlos Ratis
    Bio: Pavlos Ratis is a Senior Site Reliability Engineer at Red Hat, where he works on the OpenShift team. He is the creator and curator of awesome-sre and awesome-chaos-engineering Github repositories.


SRE-043: Enabling SRE at the Enterprise

Moderator: Kitty Smith

Ron Baker, Distinguished Engineer (IBM Software, AI Applications)
Mark Emig, Vice President (Kyndryl Anthem)
Ishan Sehgal, Program Director (IBM Software, AI Applications)

Description: In this panel discussion, we will investigate what it takes to establish SRE across a large enterprise. We will hear from leaders within product management, application development, service delivery, and operations to gain insight from their experiences enabling business value through the adoption of SRE. They will share strategies on facilitating cultural and behavioral changes in their organizations as well as the key metrics and measurements that need to be in place to ensure the success of enabling the practice.

Keywords: enterprise, culture, delivery, KPIs, business value, scaling

Speaker Bio:

  • Ron Baker is an IBM Distinguished Engineer in the AI Applications organization at IBM. He leads the SRE discipline and operations technology, focused on the transition to Hybrid Cloud and the OpenShift environment; bringing experience in moving traditional enterprise applications into containers, Kubernetes, and multi-cloud deployments. Ron is a member of the IBM SRE Global Profession Board, helping bring consistency of the SRE role across IBM. Prior to this, he was the Director of Geospatial Content & Analytics in the Weather Company, leading the curation of global location data from our mobile and web properties and their use in industry and consumer applications.
  • Mark Emig is the senior partner executive of Integrated Managed Services for Anthem at Kyndryl. In 2020, Mark led contract negotiations with Anthem, resulting in $1B, multi-year modernization and managed services agreement. He is responsible for all aspects of service delivery, managing highly technical and diverse global teams to ensure service excellence. Mark is the executive sponsor for SRE at Anthem, establishing the practice to drive continuous improvement across the enterprise. The SRE mission and vision at Anthem is to provide operational efficiency and technology enablement through automation.
  • Ishan Sehgal is the Program and Offering Manager for the AI Applications organization at IBM. Over his career he has managed worldwide strategic alliances, product marketing, cloud operations, and product management for Watson IoT, IBM Systems HPC solutions and BladeCenter Solutions. Ishan is the champion of SRE within his organization, bringing a unique perspective of how SRE can help meet business goals and deliver client success. As the product owner, he facilitates driving SRE tenets and capabilities as features into the core product. His focus is on reducing clients’ operational costs, improving asset productivity, and increasing process efficiency; all of which are directly aligned to SRE principles.
  • Kitty Smith is a certified Distinguished Architect with more than 20 years experience in architecture design and service delivery. In her role as Director in the Cloud Engagement Hub, she has combined her background in architecture, systems engineering, and operations management with a passion for client success to become an evangelist for Operating Model transformation. As an Executive Architect for clients across multiple industries, Kitty has been instrumental in leading organizations in the application of technology to meet their business objectives. Through her appointment on the IBM SRE Profession Governance Board, Kitty is influencing the strategy and adoption of Site Reliability Engineering practices globally.


SRE-044: The challenges with transitioning towards SRE

Presenter: Neil Miranda, Lekha Rao

Description: This talk will focus on how to make the transition if you are purely focused on operations and toil, What are the low hanging fruits you should begin with? And what is the culture, mindset and principles you need to imbibe / hire for so that we see the benefits of a truly functioning SRE organization. This talk will focus on the lessons learned and challenges faced and the benefits of running a data driven SRE function in your organization.

Keywords: keywords (comma separated)

Speaker Bio:

  • Lekha Rao
    Lekha Rao has over 10 years of Industry experience. She is currently leading the IaaS SRE team in India and is responsible for establishing SRE across IaaS VPC including Monitoring, Automation, Operations and Incidents for NextGen VPC Environment and VMware.
    Prior to this role she worked as a Technical Assistant to VP of India Software Labs. There she participated in various special projects, helped build strategy, work with partners and clients, scale initiatives across India Software Labs etc.
    She started off her career in IBM in the QA domain, moved to lead teams and then became an architect for products predominantly in the retail industry including Sterling, Websphere Commerce and Commerce Insights and Business User controls. Her interest in problem solving and keen understanding of product led to a short stint as a business analyst before moving to the role of a Technical Assistant to VP of ISL.
  • Neil Miranda
    I am a Senior Manager at IBM Public Cloud (IAAS). I manage several teams under the IaaS – SRE umbrella AIOps, IPOPS Operations, Security and Compliance Automation etc to name a few. I have over 11 years of experience as a Manager in IBM and have lead several development teams in the past. I was involved in tools development for Infosphere Information Server, DB2 development, Development manager for LIFT. I have a wide area of experience that spans most areas of software development life cycle be it Data Analysis, Database Management, Development, QA, Release management etc. In my spare time I enjoy listening to music (all kinds of genres).


SRE-047: Stakeholder Management & SRE Soft Skills

Presenter: Manjunath Sangappan

Description: On SRE, we come across a lot on how to develop SRE skills on ‘Developer with Operations mindset’, building up full stack skills, focussing on individual SRE Tenets & Principles etc. Though these are cornerstones and are important for SREs, the SRE mindset is vital to the success of an SRE. The soft skills & stakeholder management that an SRE brings often plays a key role in the success of an SRE delivery and even contributes to the increase in the newer business opportunities. In this talk, I will take you thru identifying the stakeholders, how to build and lasting relationships, what actions one can take to break resistance, what’s an positive politics and how to play it and more

Keywords: SoftSkills, StakeholderManagement, SRE, Mindset

Speaker Bio:

  • Manjunath Sangappan
    Manju is the SRE Practice and delivery leader from DevSecOps(DSO), CIC India, IBM Services. Manju has single handedly built the SRE practice for DSO from scratch and was instrumental in structuring and shaping up Technical SRE delivery through SRE Package. He has setup several SRE Entry, Foundation & Experienced level Bootcamps and was responsible for enabling over 1000+ practitioners worldwide. Manju comes with over 25 years of experience and most of those in building high performing technical teams all over the globe.


SRE-048: Using Instana to become a better SRE

Presenter: Marcel Birkner

Description: What does a typical day as an SRE look like? In this presentation I will discuss the challenges we face while running a SaaS platform that is used 24 / 7 / 365 around the globe.
In doing so, we have embraced the core principles described in the Google SRE handbook.
While we are not Google by any means, most of the principles apply to our daily work one way or another.
Having a fully distributed team running a distributed system can be quite challenging.
In this talk I will be covering:

  • Core SRE principles
  • How using Instana has applied them to our daily work
  • How we use SLO to improve our platform and the quality of service for our customers
  • Technical, cultural and organisational lessons learned along the way

Keywords: Instana, Alerting, SLO, Distributed Tracing, End-User-Monitoring, EUM

Speaker Bio:

  • Marcel Birkner
    Marcel works as a Staff Site Reliability Engineer at Instana, an Application Performance Monitoring (APM) solution.
    He has long experience in software engineering and software automation.
    Currently he focuses on improving the current Kubernetes stack, reducing overall system complexity and installing Instana SaaS infrastructure in IBM Cloud.


SRE-051: Passion doesn’t sell SRE: Create a business case your boss will say “Yes” to

Presenter: Marion Clelland

Description: You have read the SRE book, attended all the conferences and purchased yourself a novelty SRE themed t-shirt. You are full of passion and excitement on how this is going to entirely transform…. absolutely everything. As a practising engineer or sys admin, how do you then go about selling this crazy idea to your boss? Why would they want to put expensive engineers in a role they could give to a cheaper team with a set of runbooks?
I work in what was a traditional software development department starting to move into DevOps as we built out a cloud service. We are still growing SRE and I continue to nurture the SRE message in our management line. I now have a wide SRE network but that wasn’t the case when I started out. I love to reflect, take on feedback and learn – and what I learned this time is useful to share. The first SRE pitch I made to our leadership team was a bit of a flop. However, the second time around it gained traction and results.
I want to tell you my story so you can get your SRE journey started faster than I did. I’ll tell you the things I got wrong first time, the things I didn’t consider and what I got right in the end.

Keywords: keywords (comma separated)

Speaker Bio:

  • Marion Clelland
    Marion is the Site Reliability Engineering lead for Integration within IBM Cloud and a core team member of the global IBM SRE profession. She has held a variety of technology roles in her 15 year career including software development, systems architecture, traditional operations and more recently SRE. In 2018 Marion led the adoption of SRE within her department; with accolades culminating in an industry award recognising rising female technologists (TechWomen100). Marion is a keen advocate and role model for diversity and inclusion; she was among the first globally to be awarded the IBM Be Equal Ally title as recognition for her continued work in this area.


SRE-052: Financial SRE: Observability with usage-based cost reduction

Presenter: Matheus Bitencourt, Filipe Cesar , Danne Aguiar, Victor Hugo

Description: In this presentation we will present the concept of Financial management, where we apply Observability with an analysis of application usage to suggest actions to reduce the cost of infrastructure in applications. We will show a real case with further benefits to be applied to any project related usage-based cost reduction. This workshop will include a hands-on exercises on the methods we developed to achieve the environment optimization and cost reduction.

Keywords: financial, management, observability, analysis, infrastructure, SRE, cost, performance, optimization, environment

Speaker Bio:

  • Matheus Bitencourt: Brazilian, Carioca, Technical Lead and Solution Architect @ IBM, a Certified Developer Advocate, RJ Meetups Leader at exponential areas such: AI, Blockchain, IoT, Data Science, in addition to First Patent Application Invention Achievement Award, Red Hat Certified Specialist In Containers and Kubernetes, IBM Cloud Solution Advisor and Oracle Java. Member of the IBM Brazil Technical Leadership Council and Chapter Leader of Slack RJ Community and Bluetalks@Rio. Hackathons Mentor, Hero Ambassador IBM Brazil 2020, TLC Rockstar 2019/2020, Winner of BlueHack @ 2017 in Blockchain and Initiative Star Contributor of IBM Academy of Technology. IBM Recognized Technical Speaker at hands on sessions and Enthusiastic about everything technology can achieve! Besides that I am from Campo Grande, Rio, a lover of everything from Marvel, a traveler in non-pandemic times and passionate about astronomy and its mysteries.
  • Filipe Cesar: System Information Technology graduated with MBA in Business Process management working in IBM since 2013 currently as Technical owner of Pilot application in SRE implementation WW with cost reduction focus, Technical owner mentor and “sextapec” Guild Lead.
  • Danne Aguiar: Danne Aguiar is a certified IBM SRE L1, working with monitoring since 2010 with focus on APM, helping clients to acquire the required Observability level to maintain the SLAs/SLOs using Automations.
  • Victor Hugo: Working at IBM for 18 years of which the last 4 years acting as SME and Netcool specialist for GTS Brazil clients. For the last 9 months focused on APM monitoring and Dynatrace Professional certificate.


SRE-053: Assessing Change Risk in Cloud Pak for Watson AIOps

Presenter: Michael Nidd, Yu Deng, Mary Yost

Description: Change is among the biggest contributors to service outages. With more enterprises migrating their applications to cloud and using automated build and deployment, the volume and rate of changes has significantly increased. Furthermore, microservice-based architectures have reduced the turnaround time for changes, and increased the dependency between services. All of the above make it impossible for the SREs (Site Reliability Engineers) to rely solely on manual risk assessment for changes.

In order to mitigate change-induced service failures and ensure continuous improvement for cloud native services, it is critical to have an automated system for assessing the risk of change deployments. In this talk, we present an AI-based system for proactively assessing the risk associated with deployment of application changes in cloud operations. The risk assessment is accompanied with actionable risk explainability. We discuss how the system helps SREs quickly assess the risk level of a problematic change, leveraging features extracted from historical data. At the end of our talk, we will share feedback from SREs who have used a pilot deployment of our system, and present our plan for improvement.

Keywords: AIOps, Change Risk, Risk Mitigation

Speaker Bio:

  • Michael Nidd works for IBM Research at the Zurich Research Laboratory, focusing on extending IBM Cloud Pak® for Watson AIOps. Prior to this involvement, his most recent work was applying machine learning for automatic fault remediation in systems operations, and before that in fault prediction for firewalls. He received his doctorate in 2001 from the Swiss Federal Institute of Technology (EPFL) in Lausanne, with a thesis on service discovery in transient wireless ad-hoc networks, and earned his BMath and MMath degrees from the University of Waterloo (Canada) in 1993 and 1995.
  • Mary Yost
    Mary Yost became the Global Site Reliability Engineering (SRE) for Instana observability offerings at IBM in June 2021. Previous to this role she spent 6 years in leadership positions with the Watson AI Services SRE team. Expertise areas include incident management, change management, root cause analysis processes. She holds an Masters in Business Administration from Boston University and Bachelors degree in Computer Science from the University of Illinois.
  • Yu Deng
    Yu is a Research Scientist and Manager at IBM T.J. Watson Research Center. Her research interests
    are in information extraction, question answering, knowledge graph and semantic analysis. Her work at IBM has been focused on building AI solutions for IT operations and services. Yu is a member of IBM Academy of Technology and IBM Master Inventor. She has received her Ph.D. in Computer Science from the University of Maryland, College Park.


SRE-055: SRE and Microservices

Presenter: Nagaraj Chinni, Adinarayana Haridas

Description: SREs must look into various aspects of complexity of designing, running and managing the microservices. In this presentation we will be focusing providing insights on the following aspects of microservices that SREs should and must focus on right from the architecture phase:

  • Design for reliability
  • Securing microservices
  • Build to manage
  • DevSecOps strategy

This session will also bring out some of the best practices/case studies implemented in client projects and discuss the value it brings.

Keywords: microservices

Speaker Bio:

  • Nagaraj is a Senior Cloud Application Architect and GBS Lead SRE Offering Architect comes with an extensive experience in architecting solutions for migration, build and manage of applications on cloud platforms. As a cloud architect he is also leading OpenShift everywhere program. In his career as a lead Retail Industry Architect he supported and delivered complex Retail engagements for major Fortune 100 clients. Nagaraj has also led the solutioning guidance and architecture kit for Cloud Move offerings. He also worked with various clients in creating a SRE adoption roadmap.
  • Adi Adinarayana is Site Reliability Engineer by Profession and Technology enthusiast. Over the past years with a good amount of travel, gained a lot of Customer base and had a privilege to be the first person to receive calls for any of the troubleshooting activities. Two client references, Strong contributor to several Academy of technology Initiatives, Opensource contributor and a regular Raspberry Pi/Arduino user. Spending time with kids and at the Tennis court are mandatory daily tasks.


SRE-056: How to Implant SRE DNA into Developers/Architects

Presenter: Nagaraj Chinni

Description: In this lightening talk, I will introduce the process and tricks that will help inculcate the very essence of SRE tenets, principles and procedures in a daily life of Architects and Developers.

Keywords: keywords (comma separated)

Speaker Bio:

  • Nagaraj is a Senior Cloud Application Architect and GBS Lead SRE Offering Architect comes with an extensive experience in architecting solutions for migration, build and manage of applications on cloud platforms. As a cloud architect he is also leading OpenShift everywhere program. In his career as a lead Retail Industry Architect he supported and delivered complex Retail engagements for major Fortune 100 clients. Nagaraj has also led the solutioning guidance and architecture kit for Cloud Move offerings. He also worked with various clients in creating a SRE adoption roadmap.


SRE-058: Connecting dots: My journey from Support to SRE

Presenter: Neeraj Bhatt

Description: I would like to share my journey of how I successfully transitioned into an SRE role from a technical support background. In this talk, I will discuss the change of mindset which I adopted in order to think and operate in an SRE team and how I leveraged my existing support skills to contribute to the SRE methodology and also share insights on the importance of SRE in today’s era of rapid technology enhancements.

Keywords: supportToSRE, startingPointForSRE

Speaker Bio:

  • Neeraj Bhatt
    My name is Neeraj Bhatt, and I am active in this industry for almost about 7 years now. Currently, working with Red Hat for about 4.8 years in OpenShift technologies. I was in support role for 4 years, here in Red Hat, I am working on many OCP versions since 3.0. Currently, I am a part of Red Hat SRE team, and here I am contributing with the help of my previous knowledge and learning many new technologies which I did not get chance to work in the past.


SRE-061: Smith – OS compliance through automation

Presenter: Paddy Doyle, Farhan Ahmad, Paul Cullen, Guy Hindle

Description: IKS classic infrastructure comprises a large and growing number of virtual and physical machines, which must be kept OS compliant. Automation is essential to drive the process of patching the operating systems. In this session we will discuss the history of the project, where we are now, and changes we plan to implement to further reduce toil.

Keywords: Operating system, patching. Automation.

Speaker Bio:

  • Farhan Ahmad – Joined IBM as SRE for IKS in November 2020. Has previous experience with implementing DevOps practices for various clients/organizations of different sizes. In his spare time, Farhan likes to play video games and recently has started skateboarding.
  • Paul Cullen is the SRE Security Compliance Lead and works in the SRE squad supporting the IKS offering. He has been with the team since 2015 and was previously the European SRE Squad lead.
    In his spare time, Paul enjoys mountain biking and football, as well as walking his rescue dog, a 7 year old English springer spaniel.
  • Paddy Doyle – Joined IBM as SRE for IKS in September 2020. Paddy is currently focal for the Smith Patching effort within IKS.
    In his spare time, Paddy likes to run away, and also back again.
  • Guy Hindle – Long term IBMer, SW Engineer and latterly SRE for IKS. From COBOL to golang, Mainframes to mobile Guy has developed and deployed on it. Outside work, badminton, music and recently kayaking keep him busy … when not playing along with one or more of his three kids.


SRE-064: AI-Powered Automation for your IT

Presenter: Pratik Gupta, Endre Sara

Description: Remove manual work from IT Operations by AI powered Automation. Apply software analytics to make decisions managing business applications and automate these decisions to assure the level service for each application while maintaining compliance at the lowest cost possible.

Keywords: Full-stack, observability, resource management, problem remediation, problem avoidance

Speaker Bio:

  • Endre Sara
    Endre Sara is the VP of Advanced Engineering at Turbonomic, directing a team of developers focusing on new technologies and opportunities to extend Turbonomic’s existing capabilities.
    Previously, Endre was a VP at Goldman Sachs, leading the Systems and Application management team and Network Management team driving the management strategy, design and implementation for Goldman Sachs globally. Endre holds an M.E. in Electrical Engineering from the Technical University of Budapest and a Ph.D. in Electrical Engineering from Stevens Institute of Technology.
  • Pratik Gupta
    Pratik Gupta is the Chief Technical Officer for IBM Hybrid Cloud Management in the IBM Automation Business Unit. His current areas of interest are Cloud Management, AI and Automation. Prior to that, he has held both business and technical executive positions in IBM spanning Product Development, Business Development, Strategy and Architecture in the area of Cloud and Service Management.


SRE-065: Kubernetes Operators in Practice

Presenter: Pavlos Ratis

Description: In Kubernetes, the Operator pattern helps capture good production practices and the expert knowledge of practitioners for managing a service or a set of services.

An Operator acts as an SRE for an application. Instead of manually following playbooks or scripts to deploy databases or mitigate issues in Kubernetes, SRE teams can use off-the-shelf Operators or develop their own to automate these processes and reduce toil work.

In this session, we will explore the Operator pattern with some examples of how we have used them at Red Hat to build OpenShift. We will discuss some lessons learned, common pitfalls running Operators, and when it makes sense to write one.

Keywords: sre, kubernetes operators, openshift

Speaker Bio:

  • Pavlos Ratis
    Bio: Pavlos Ratis is a Senior Site Reliability Engineer at Red Hat, where he works on the OpenShift team. He is the creator and curator of awesome-sre and awesome-chaos-engineering Github repositories.


SRE-067: My journey to becoming a Watson SRE

Presenter: Paul Stroud

Description: The presenter details his educational path and skills and how they enabled him to achieve his goal of working with the Watson AI stack.

Keywords: SRE

Speaker Bio:

  • Paul Stroud
    Paul has been able to follow his passion for computers from helping people get their modems online for an ISP at the dawn of the internet, to working on todays cutting edge artificial intelligence. With many steps along the way, including UNIX Systems Administration and Enterprise Network Management, he has been able to compile a broad array of skills that help him daily in his current position as a Watson Discovery Service SRE.


SRE-068: Availability, Reliability and Resiliency – KYC for SRE’s

Presenter: Raghu Srinivasan

Description: Know your client – (KYC) is definitely good, but also know definitive and distinctive info for Availability, Reliability and Resiliency. They are not all the same. Listen to this 5 minute presentation on how these get used interchangeably, and why it is important to know the specific differences, especially as an SRE. I will share a real world example to drive home the need to have the distinction. You will take away the definitions for each of these and learn how to apply them in your day-to-day actions.

Keywords: KYC, Reliability, Availability

Speaker Bio:

  • Raghuram(Raghu) Srinivasan is a Senior Technical Staff Member (STSM) at IBM (in support of Kyndryl) and the lead Client transformation site reliability engineer(SRE), for IBM services. He is a Redbooks thought leader and IBM senior certified IT Architect. He has over 26 years of experience in systems engineering, development, architecture, and operations.


SRE-069: In pursuit of progress – SRE journey at Air Canada

Presenter: Raghu Srinivasan, Thomas King and Greg Hamonic

Description: We will share our journey and lessons learned in transforming from Systems administration to SRE ways of thinking. We will provide assets that we created in order to implement the SRE tenets – such as developing a cross functional squad with SRE mind set, Operational Readiness Review Score cards, Observability to be able to work towards SLO’s + actionable alerts and how to reduce toil using automation. We will also reflect on why we did this. You will take away how you will be able to re-use these assets for your clients and accelerate adopting SRE tenets in pursuit of adding reliability as a business value.

Keywords: Operational Readiness, SLO/SLI, Actionable alerts

Speaker Bio:

  • Raghuram(Raghu) Srinivasan is a Senior Technical Staff Member (STSM) at IBM (in support of Kyndryl) and the lead Client transformation site reliability engineer(SRE), for IBM services/Kyndryl. He is a Redbooks thought leader and IBM senior certified IT Architect. He has over 26 years of experience in systems engineering, development, architecture, and operations.
  • Greg Hamonic is a senior AIX Admin and site reliability engineer (SRE) at Kyndryl. He has over 30 years experience in AIX/UNIX, SAN, Networking and in supporting highly available systems. He supports Air Canada account as a platform SRE.
  • Thomas King is a senior Linux admin and site reliability engineer (SRE) at Kyndryl. He has over 25 years experience in Linux, Networking, NAS and LDAP. He is an IBM Cloud Associate Certified SRE working in the Quebec sector with a focus on the Air Canada account.


SRE-071: Transformation into SRE

Presenter: Randip Singh Rekhi, Ravi Yadav, Mahesh Kumar

Description: Journey of service line SME transforming into SRE. What mind shift change it brought to us as individuals and what challenges and roadblocks we encountered as a part of this transformation journey.


Speaker Bio:

  • Randip Singh Rekhi – Platform SRE coming from Middlware and collaboration background focusing on end user issues. Having strong hold on Agile methodologies and data driven mindset.
  • Ravi Yadav – Platform SRE coming from VmWare background having expertise in RFS/Projects.
  • Mahesh Kumar – Platform SRE coming from VmWare background. Having good hold on BAU issues and challenges with respect to compute.


SRE-074: Watson AIOps & IT Insights

Presenter: Ricardo N Olivieri

Description: In this session, we will show how several WAIOPs features can help reduce the time to diagnose and resolve IT operations problems. We will cover relevant topics such as log anomaly detection, grouping of events, ChatOps interface, etc (just fyi, please note this information was provided in the submission form).

Keywords: AIOPs, AI, SRE, operations, IBM Cloud Pak for Watson AIOps

Speaker Bio:

  • Ricardo Olivieri is a Senior Technical Staff Member and AIOps Solutions Architect at the AIOps Elite Team. His areas of expertise include gathering and analyzing business requirements, architecting, designing, and developing software applications. Ricardo has extensive experience in the complete software development cycle and related processes, especially in Agile methodologies. Ricardo is now mainly focused on helping customers adopt the different products under the Watson AIOps umbrella so they can prevent critical issues before they occur. His background in application development and DevOps technologies makes him well suited to assist clients in their AIOps journeys.


SRE-078: Helping tenants help their SREs

Presenter: Rob Rati

Description: This talk will introduce a tool that SREs can use to evaluate tenant workloads running in their clusters for best deployment practices. It integrates with existing SRE monitoring systems like Prometheus and gives SREs a means to help tenants improve their deployments which in turn will help SREs do their job.

Keywords: keywords (comma separated)

Speaker Bio:

  • Rob Rati has been contributing to Kubernetes and Openshift for many years, and has experience running various workloads on different Kubernetes and Openshift clusters. He is currently a Principal SRE at Red Hat managing services like and Openshift Cluster Manager.


SRE-082: Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs using Golden Signals

Presenter: Seema Nagar, Ajay Gupta and Pooja Aggarwal

Title: Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs using Golden Signals

Description: In cloud native applications, a large fraction of operational failures—or outages—result from violations of Service Level Objectives (SLOs) defined on either service errors or service latency, commonly referred to as two of the “golden signals.” A light-weight fault localization system can greatly reduce human effort and dependency on domain knowledge for localizing such golden signal-based operational failures. Our technique establishes causal relationships among the golden signal service errors and error logs emitted by the constituent micro-services (all modeled as time series data).

Keywords: keywords (comma separated)

Speaker Bio:

  • Ajay Gupta
    Senior Research Engineer with IBM India Research Lab
    Ajay is a Senior Research Engineer with IBM India Research Lab. He completed his B.Tech in Computer Science from IIT Guwahati. He has worked in various research areas including Natural Language Processing, Conversation AI and Cloud. He has published more than 15 papers in conferences like ICDE, KDD, SIGMOD. His current research area is applying AI in IT operations.
  • Pooja Aggarwal
    Advisory Research Scientist at IBM Research, Bangalore, India
    Pooja is the Advisory Research Scientist at IBM Research, Bangalore, India. Her current research focuses on infusing artificial intelligence into IT technical service processes. She has led the design and implementation of advanced prototypes as well as production-ready features such as Action Recommendation, Event Grouping and Fault Localization, which are key part of WatsonAIOps (IBM product for incident management). Prior to this, she has worked on a variety of research problems in the domain of cognitive support services including improving ticket quality, key component extraction, root cause analysis, resolution action recommendations. She has published 16 papers in prestigious conferences and journals such as IPDPS, TPDS, COLING, TOMACS, ICSOC, HiPC, IUI, AAMAS and hold 6 technical patents. She is the recipient of 1 best paper award in the area of Multi-bot Conversation systems for Support Services from AAMAS Conference.
  • Seema Nagar
    Advisory Research Scientist at IBM Research India
    Seema is an advisory research engineer at IBM Research India. She has been researching the field of computer science for the last thirteen years. She has over 35 publications in eminent conferences and journals and more than 100 patents filed. She has been named a master inventor for two consecutive terms. She has also been part of the reviewer committee of many eminent conferences, such as IJCAI, NAACL, and ACL. She has earned several Research Awards in IBM Research India, including three Outstanding Technical Achievement Award for her work on social network analysis and trustworthy AI. She actively mentors several researchers/students at the research lab. She obtained her B. Tech. and M. Tech. in Computer Science and Engineering from IIT, Guwahati in 2007, and IIT, Delhi in 2011 respectively. Currently, she is also pursuing a part-time PhD from IIIT, Guwahati.


SRE-083: Successful SRE Manager – The facts and the Myths

Moderator: Shobhna Bansal

Attendees: Gareth Holl (Panelist)
Mariko Hasumim (Panelist)
Martin Larson (Panelist)
Tom Schmidt (Panelist)

Description: SRE manager is the commander in chief of a SRE battalion who are dedicated towards building the reliability of a product/service. SRE managers works with a multi talented and a highly skilled team of SREs and it is very important that the manager himself/herself knows the SRE role and build a high performing team with wide breadth of knowledge in Dev, Ops and automation. To be a successful SRE manager you dont just take in charge of your team but there are many other role plays that you have to take care of.
In this panel discussion, we will break certain myths that revolve around the SRE manager role and refrains the manager from achieving high performance.
We will also validate the facts that help in the making of a successful SRE manager.

Keywords: sre, manager

Speaker Bio:

  • Shobhna is SRE manager for Watson AI and Planning Analytics for Bangalore SRE. She has 16 years of experience working with different projects. She has been with IBM for 10 years and have been in management for 6 years. Currently she is responsible for the Incident Management of the projects including the collection of the monthly metrics for optimization, working with SRE for automation and provide support for customer cases.
  • Gareth – IBM has been my work home for over 24 years, most of which has been in the US, with a variety of technical and leadership roles. I’m currently a senior manager for the Watson AI Global SRE team, which spans 5 countries and operates 24×7 to keep our Cloud based services available and compliant. One of my goals is to promote an inclusive team that values the differences in our people across our 5 locations. Another of my goals is to develop a highly technical team that strives to automate the daily toil and achieve a high degree of consistency and self-sufficiency. I performed the roles of an SRE and a Technical Architect prior to moving into my current leadership position. In fact I was one of the original members of the SRE team for Watson, helping define what SRE meant for the Watson org, eventually helping to build our global team, including working for a period of 2 years in Sydney, Australia. My previous roles were customer oriented which provides me with good insight for running an operational centre and performing incident management that is customer focused.
  • Mariko is a manager of Watson AI Site Reliability Engineering (SRE) team in Japan and Australia. She joined Watson AI SRE team in April 2019 and has been responsible for Incident Commander in Asia Pacific business hours, and driving CIE/RCA process and SRE projects in collaboration with SRE squad leads and global SRE managers. She is a focal contact for Watson AI partnership customer in Japan and also working closely with Advanced Customer Support (ACS) team to support Watson AI premium customers for their success
  • Martin is a third year Site Reliability Manager within Business Analytics. His previous roles in a traditional DevOps model, Analyst work in Technical support, and work as a SRE evangelist has helped drive Martin’s team’s transition into SRE.
  • Tom is a certified Site Reliability Engineer Thought Leader and active member of the SRE certification review board. Tom Schmidt currently manages multiple Site Reliability Engineering teams in support of Business Analytics. With a diverse background developing common infrastructure, automation and test frameworks, and a passion for leadership development, Tom is an ardent evangelist for SRE within IBM. Over the last several years Tom has focused to influence SRE practices within IBM Cloud DevOps Services, and more recently built a team to lead the SRE transformation across Business Analytics organization.


SRE-086: How we adopt SRE skills to help the busiest airport in the world?

Presenter: Stanley SS Tam/Hong Kong/IBM, Martin Chan

Description: Many enterprises have conducted their cloud modernization journey, but they soon faced challenges on how to monitor and troubleshoot issues in the new platform.

We have solved those challenges by helping Hong Kong International Airport adopting SRE skills. If you’re wondering the real-life examples on how enterprises adopt SRE, join us to learn more!

Keywords: SRE, CloudWatch, OSS, DevSecOps, OpenShift

Speaker Bio:

  • Stanley Tam
    Senior Architect, Hong Kong IBM GBS HCS
    Stanley is a Senior Architect with over 7 years of experience in the industry and has been successfully delivered multiple large-scale projects of different industries covering banking, accounting, supply chain, aviation. He is specialized in Cloud Modernization, DevSecOps Establishment, and Site Reliability Engineering.
  • Martin Chan
    Senior Architect, Hong Kong IBM GBS HCS
    Martin is a Senior Architect over 10 years of experience working on Cloud Architecture, Complex Application Development, and SRE. And 5 years of experience leading in Cloud Modernization and SRE for Quasi-Governmental organizations in Hong Kong.


SRE-092: How to adopt SRE practice for 500 member’s outsourcing project.

Presenter: Takaaki Tsunoda

Description: How do you adopt SRE to large project? I’ll share my story to adopt SRE to 500 project members in Insurance sector. In large team, it is very tough challenge to adopt new concept for all members. In this presentation, I will share the story and key tips of SRE enablement and delivery team transformation for 500 member’s outsourcing project.

Keywords: delivery, large, transition, adopt, outsourcing

Speaker Bio:

  • Takaaki Tsunoda is a platform SRE and delivery team leader for large outsourcing project.
    He also supports company transition from IBM to Kyndryl as IBM Slack Champion. He joined IBM, GTS in 2015, and started his career to deliver shared ITSM tool for multiple customer. He has experience of leading Web Application Team, Apply Hosting Service(MW) Team and Server(OS, Hypervisor, HW) Team of high reliability demand outsourcing project. He has started to study SRE architecture from 2017. The architecture was AlwaysOn (Three Site Reliability Architecture) at IBM Japan Study Group.


SRE-095: Tips & Tricks to become SRE Certified

Presenter: Vandana Pandey

Description: In my session, I would be covering on my experiences on how I became first female IBM Certified SRE. The session would cover on tips and tricks, I pursued to become certified.

Keywords: ibmcertifiedsre, sre, tips, tricks

Speaker Bio:

  • Vandana Pandey is the Client Transformation SRE helping Kyndryl Customers adopt SRE tenets and methodology in their environment and helping them transform and modernize their delivery. She is Integration Chief Engineer leading initiatives that focusses on automation and life-cycle aspects of Hybrid Cloud Services. She has 16+ years of IT experience and was a member of the IBM Academy of Technology (AoT). She has been Society of Women Engineers (SWE) Ambassador since 2016 and was one of the founding members of SWE Chennai Affiliate and previously served as Affiliate President.


SRE-096: Mainframe Site Reliability Engineering in a Infrastructure Organization

Presenter: Viviane De Padua Diogo Sanches, Guilherme Cartier de Palma

Description: The Site Reliability Engineering role in a Mainframe Infrastructure scope. Understand the traditional Mainframe Infrastructure Delivery organization and how to apply the SRE concepts and tenets. Understand the SRE persona, skill and its importance in the content of the Mainframe Organization.

Keywords: IBMZ, Mainframe, SRE, DevOps,

Speaker Bio:

  • Viviane Sanches – Skill & Enablement Leader for Mainframe
    Growing in the Mainframe career for 18 years, started as a Mainframe Computer operator, playing thru Leadership, Management and Global Enablement position in Mainframe ecosystem. From education perspective is graduated Business Management, MBA in Human Development, Agile Advocate and Management 3.0 certified. I`m passionate to learn and share experience with professionals on Mainframe area. Very present in many Z ecosystem initiatives working with main Business and Learning partners, joining global conference and students forums.
  • Guilherme Cartier – Software developer/developer advocate
    I’ve started my career as a console and batch operator. Eventually I moved to a Production Support Analyst role.
    Today I’m a certified Technical Specialist level 2 and I work as a software developer/developer advocate at the Global Mainframe Services Engineering team.
    I’m passionate about the Mainframe ecosystem, new technologies, software engineering and to help our customers on their modernization journey.”


SRE-097: ChatOps and the Modernization of IT Operations

Presenter: Wesley Stevens

Title: ChatOps and the Modernization of IT Operations

Description: ChatOps is a collaboration model that connects people, process, tolling, and automation to enable teams to interact directly with experts, access cognitive analytics, and automated processing to solve problems quickly. ChatOps provides agile squads contextually relevant information so they can interact in real time to address complex situations.

Keywords: ChatOps, ChatBot

Speaker Bio:

  • Wesley Stevens
    Wesley is a Chief Enterprise Architect with IBM Services. He has been with IBM for 20 years, serving in an array of Service and Enterprise Architect roles throughout IBM Services. He is currently responsible for driving innovation and automation through the usage of ChatOps for IBM in order to facilitate higher levels of support and customer satisfaction for IBM and it’s Customers.


SRE-100: Introducing FOCA and LogUtil: Tools for incident handling automation and log analysis

Presenter: Yosuke Tanaka

Description: This talk will introduce and demo two tools that the IBM Watson AI SRE team has created: FOCA (Fast On-Call Action) and LogUtil. FOCA is an incident handling automation framework, which facilitates collaborative and incremental development of incident handler scripts. LogUtil is a tool built on top of Pandas (Python Data Analysis Library) to bring the power of tabular data analysis to application log analysis. It also provides minimum distributed tracing capability in environments where instrumenting apps or adding a side-car proxy is not easy.

Keywords: incident handling automation, application log analysis, distributed tracing, pandas

Speaker Bio:

  • Yosuke Tanaka is an SRE for IBM Watson AI Services at IBM. He leads FOCA and LogUtil engineering project in the team.


SRE-150: Introduction to Operate First

Presenter: Karsten ‘quaid’ Wade, Marcel Hild

Title: Introduction to Operate First


In the Operate First project, we are solving the problem of how to openly operate a 100% open source hybrid cloud. To do that, we are modeling, building, and running a community production cloud that provides operational insights as feedback into open source development. We are building out everything from an SRE knowledge pool and training, to a community team using those same tools and practices who run the actual community production cloud itself at

Keywords: open source, cloud native, open source hybrid cloud, operations, devops, SRE

Speaker Bio:

  • Karsten Wade
    For two decades, Karsten has been teaching about and working in the open source way. In Red Hat’s Open Source Program Office (OSPO), his community architect portfolio centers on the people, principles, and practices of open source communities. His current community management work includes the Open Source Way guidebook and community of practice, and Operate First, a community defining, building, and improving the open source hybrid cloud by open sourcing operations.
  • Marcel Hild
    Marcel Hild has 25+ years of experience in open source business and development. He co-founded a Linux consulting company, worked as a freelance developer, a Solution Architect for Red Hat, and a core Developer for ManageIQ, a Hybrid Cloud Management tool. Now he researches the topic of AIOps in the Office of the CTO at Red Hat, proving how AI will help operating machines and applications.


SRE-200: Keynote 1

Presenter: Alan Peacock, IBM

Title: Keynote



Speaker Bio:

  • Alan Peacock, General Manager, IBM Cloud Delivery & Operations
    Alan is a globally recognised technology and business transformation leader with deep Financial Services experience. Alan recently joined IBM as the General Manager for IBM Cloud Delivery & Operations. Prior to joining IBM, Alan has worked in senior executive positions during his career in HSBC, Lloyds Banking Group and Royal Bank of Scotland including CIO and CTO roles. During his career he has worked in the UK, Europe, US and Asia.


SRE-201: Scaling Enterprise SRE: Accelerating Business Returns from Modern IT Operations

Presenter: Stephen Elliot, IDC

Title: Scaling Enterprise SRE: Accelerating Business Returns from Modern IT Operations



Speaker Bio:

  • Stephen Elliot, IDC Group Vice President, I&O, DevOps, and Cloud Operations
    Stephen Elliot manages multiple programs spanning IT Operations, Enterprise Management, ITSM, Agile and DevOps, Application performance, Virtualization, Multi-Cloud Management and Automation, Log Analytics, Container Management, DaaS, and Software Defined Compute. Mr. Elliot advises Senior IT, Business, and Investment Executives globally in the creation of strategy and operational tactics that drive the execution of Digital Transformation and business growth.
    Before his current role, Stephen helped build business for IDC’s IT Executive Program, where he advised enterprise IT organizations on DevOps best practices across people, process, and technology strategies and tactics. Prior to IDC, Stephen was Vice President of Strategy for CA Technologies’ Infrastructure Management and Data Center Automation business unit, providing direct strategic and tactical support to customers accelerating their path toward the cloud, software-defined datacenters, and Dev/Ops. He also led product management for CA’s VCE partnership, helping customers better understand the business value of converged fabric. Prior to his work at CA, Stephen was a noted software industry analyst at firms including IDC, Gartner, and Forrester, where he advised IT executives and the financial community on IT operations, virtualization, datacenter automation, and network and application management. He also served as a product marketing manager at Inteq, an early provider of SaaS-based network monitoring and service desk capabilities, helping customers understand the business value of the SaaS delivery model.


SRE-202: The evolving journey towards SRE at Discover

Presenter: Ed Calusinski, Discover Financial Services

Title: Client perspective on SRE

Description: In this session we will share the ongoing journey of how Discover Financial Services is maturing their SRE practices and professions within their organization. The session will share early experiences, challenges and opportunities that lie ahead. The session will start with a short presentation followed by an open dialogue with the audience.


Speaker Bio:

  • Ed Calusinski
    Ed Calusinski is the Vice President of Enterprise Architecture and Technology Strategy at Discover Financial Services. In this role he is responsible for setting the technology and architectural strategy across the business, application and infrastructure layers of the organization. He is driving the transformation of the architect profession with a focus on building high performance and resilient digital architectures at scale and the incubation of emerging technology for digital disruption. Ed has a Bachelor of Science in Metallurgy and Aviation from Lewis University and a Master of Science in Computer Science from the Illinois Institute of Technology. Ed is a former IBM Fellow and a member of several university industry advisory boards for the advancement of computer science and computer engineering.


SRE-203: Keynote 1

Presenter: John Granger, IBM

Title: Keynote



Speaker Bio:

  • John Granger, SVP, Hybrid Cloud Services & COO IBM Consulting
    John is responsible for operational discipline and profit performance across IBM Consulting, the consulting and professional services unit of IBM.
    He also leads the Cloud Application Innovation team, helping clients transform their businesses at scale with quality, speed and consistency across end-to-end enterprise application implementations, cloud application migration and modernization, and application maintenance. He leads the GBS network of global delivery centers and has championed delivery transformation through automation, agile and DevOps methodologies.
    Previously, John was General Manager, GBS Europe, responsible for the strategy, execution and business results of the business consulting, systems integration and application management services business across Europe. Prior to that, he held various leadership positions in GBS Europe. He joined IBM in 2002 as part of IBM’s acquisition of PricewaterhouseCoopers Consulting.


SRE-204: The Steady Ascendance of Site Reliability Engineering

Presenter: Craig Lowery, Gartner

Title: The Steady Ascendance of Site Reliability Engineering

Description: Gartner perspective on the growing importance and emerging opportunities for SRE in the marketplace.

Keywords: SRE, Outlook,

Speaker Bio:

  • Craig Lowery
    Craig Lowery, Ph.D., is a VP Analyst in Gartner’s Technology and Service Provider group, focused on cloud computing in general, infrastructure as a service (IaaS), platform as a service (PaaS) and managed services for public cloud. Other areas of expertise include SRE, cloud native computing architectures, the emerging “container” ecosystem (lightweight virtualization), serverless computing, cloud application development models, cloud service expense management, cloud management platforms, internal cloud service brokerage and organizational models, and cloud service pricing models.


SRE-205: SRE Journey in IBM and Kyndryl

Presenter: Ingo Averdunk, IBM
Sunil Joshi, IBM
Gene Brown, Kyndyrl

Title: SRE Journey in IBM and Kyndryl



Speaker Bio:

  • Ingo Averdunk
    Ingo Averdunk is Distinguished Engineer in the IBM Garage for Cloud. As part of the worldwide team, he is responsible for Architecture and Solutions for Cloud Service Management and Site Reliability Engineering (SRE). Mr. Averdunk develops architectures and consults with IBM’s strategic customers, leads Cloud Adoption and Transformation initiatives, and performs RedTeam reviews globally. Ingo Averdunk is a member of the IBM Academy of Technology (AoT), member of the AoT Technical Committee, the IBM D/A/CH Technical Leadership Team, the EMEA Technical Profession Council, and the global profession co-lead for SRE in IBM. He co-authored “The Cloud Adoption Playbook” (Wiley, 2018), documenting proven strategies for transforming an organization with the Cloud. Ingo is the Meetup Organizer for the SRE Meetup Munich.
  • Sunil Joshi
    Sunil is a Distinguished Engineer with IBM. He is currently CTO for North America Hybrid Cloud Services. Sunil is a regular speaker in the industry on Cloud, DevOps, Site reliability Engineering, digital transformation and related concepts. His specialization are in the areas of hybrid cloud solutions, platform as a service and DevOps/SRE strategy. Sunil has authored several IBM Redbooks, blogs and articles. Sunil is passionate about international music, multi-cultural cuisine, active sports and mentoring school & college students on career path and technology.
  • Gene Brown
    Gene Brown is Distinguished Engineer and the Profession and Global Site Reliability Engineering Leader for Kyndryl. He is responsible for the driving the enablement of SRE across Kyndryl’s countries, practices and strategic markets through a CoE model in collaboration with designated SRE focals across the services organization. He leads a global team of Client Transformation SRE’s to guide offerings teams, account resources and common services teams through a transformation to a new mindset and “new-ways-of-working” by adopting the core practices and tenets of Site Reliability Engineering. Gene was the co-founder of IBM’s new IBM SRE Profession with a focus on certifying SRE’s based on their applied experience in the field of Site Reliability Engineering.


SRE-300: Vault Reliability Engineering

Presenter: Rob Barnes, HashiCorp

Title: Hashicorp Vault

Description: This talk will look at the key role played by SREs embedded into dev teams to enable, empower and unblock their teams I’m their quest to secure applications with HashiCorp Vault.

Keywords: vault, hashicorp

Speaker Bio:

  • Rob Barnes
    Robert, also known as DevOps Rob, is a Senior Developer Advocate at HashiCorp. His focus is primarily on Cloud security. He comes from a Network engineering background and more recently in his career, he has been working as a Cloud Consultant, helping customers extract maximum value from the Cloud. His experience spans across multiple sectors, from Banking and Fintech to Transport, Charities and Cyber Security. He is a strong advocate for open source, security best practices and building diverse Communities.