Accelerating research and innovation

How NCHC uses AIOps to improve public network services and proactively prevent outages
by Rob Spencer
6-minute read
In the Conference Room Chief Engineer Presents to a Board of Scientists New Revolutionary Approach for Developing Artificial Intelligence and Neural Networks. Wall TV Shows Their Achievements.

The speed of research matters. During the COVID-19 pandemic, it’s been the difference between life and death for millions.

In Taiwan, where the pandemic response has been exceptionally effective at limiting outbreaks and death, the National Center for High-performance Computing (NCHC) helps accelerate research and innovation nationwide by providing access to supercomputers and analytics and by facilitating nationwide networks for data sharing and collaboration.

Although NCHC supports research in all disciplines, the urgency of the pandemic inspired it to launch successive “Tech v Virus” programs, which call for universities, research organizations, enterprises and startups to find new ways to fight the spread of the SARS-CoV-2 coronavirus. One high-profile breakthrough so far is a stethoscope that visualizes a patient’s breathing, helping doctors and nurses reduce close contact with potentially infected patients — thus reducing risk of transmission. Another is a map of the COVID-19 gene’s evolution, helping predict routes of spread.

To support efforts like these, and hundreds of others in all fields, NCHC wants to ensure that research moves as fast as it can. That’s why it continues evolving its Taiwania series of supercomputers, which includes one of the 50 most powerful computers in the world. That’s why it provides AI services — including tools based on IBM Cloud Pak® for Data. And that’s why NCHC recently worked with the IBM Garage™ to implement the IBM Cloud Pak for Watson AIOps solution, applying AI-based automation to maximize resilience and performance.

Female nurse using laptop while working on medical reports. Healthcare professional is busy in hospital. She is wearing surgical mask and uniform.

Reduced mean time to detect (MTTD) by

55%

for service-impacting issues

Identifies potential outages

25 hours

earlier than before

By feeding structured and unstructured data into the solution’s AI Manager component, NCHC and the IBM Garage team were able to train AI models to automatically, and proactively, manage problems and incidents.
Cutting through IT Ops complexity

Taiwan has several major public computing networks that crisscross the country and allow researchers to share information and collaborate. Some of the networks are specialized for academia, some for government and some for industry. But increasingly — especially in response to the COVID-19 pandemic — research initiatives have demanded cross-discipline efforts and cross-network collaboration. Fast information sharing between the public networks is crucial.

So NCHC began a new initiative: building a central network exchange. But bringing the networks together presented a new layer of challenges. The different networks were equipped with a disparate array of monitoring tools and data log sources and formats. The complexity complicated management, which kept NCHC from quickly filtering alarms to detect significant issues and prevent outages. Outages, in turn, would impede data sharing and collaboration across the networks.

To fulfill the purpose of the central exchange — accelerating nationwide research collaboration — NCHC needed a way to cut through the complexity of IT operations management. It turned to AIOps.

Sunset of Taipei
Predictive maintenance with AIOps
In the System Control Room Technical Operator Works at His Workstation with Multiple Displays Showing Graphics.

As part of its search for a solution, NCHC worked with the IBM Garage to run a proof of concept (POC) based on IBM Cloud Pak for Watson AIOps software.

The goal of the POC was to gauge the real-world impact of the potential solution. NCHC provided operations data and networking log data from real-life scenarios — where some networking equipment is breaking down and would create outages, for example.

The NCHC and IBM teams then used IBM Cloud Pak for Watson AIOps as a central integrator of the network exchange’s diverse array of IT operations tools, producing a holistic view of the entire infrastructure. And by feeding structured and unstructured data into the solution’s AI Manager component, NCHC and the IBM Garage team were able to train AI models to automatically, and proactively, manage problems and incidents.

The results were excellent. The teams achieved a 55% shorter mean time to detect (MTTD) issues that would affect service.

Based on the success of the POC, NCHC and the IBM® Customer Success Manager team deployed IBM Cloud Pak for Watson AIOps into the exchange center production environment. NCHC now uses the following components of IBM Cloud Pak for Watson AIOps:

  • AI Manager: to ingest structured and unstructured data and train AI models to proactively manage problems and incidents. All alerts generated by AI Manager are published as a story in a ChatOps interface that NCHC staff use as the single source of truth for monitoring the exchange center.
  • Event Manager: to import all network device logs via a pre-defined batch program, and to reduce network noise with event grouping, which will reduce operational costs significantly.
  • Metric Manager: to ingest all network device metric data, such as CPU, memory and disk usage, and provide a holistic view of device statuses.
  • In Dark Data Center: Male IT Specialist Stands Beside the Row of Operational Server Racks, Uses Laptop for Maintenance.
    Driving ongoing discovery and innovation
    Science pipette with a drop of substance over laboratory test tubes

    The MTTD reduction means that NCHC can detect potential outages 25 hours earlier than it could before — helping NCHC see and resolve the outages before they occur.

    So far, these impressive results have come in response to common, known problems. NCHC knows that unique, unexpected issues will arise and provide new tests for the solution, but the organization expects similar results. Ultimately, NCHC expects that its adoption of AIOps will help keep information channels open so that research projects across Taiwan have the critical data they need to keep making progress toward discovery and innovation.

    NCHC logo

    About National Center for High-performance Computing (NCHC)

    With the mission of promoting scientific discovery and technological innovation, Taiwan’s NCHCExternal Link provides the country’s government agencies, higher education institutions and industries with supercomputing services, high quality networking, high efficiency storage, big data analysis and scientific engineering simulations. NCHC is headquartered in Hsinchu City.

    Solution components
    NCHC logo

    About National Center for High-performance Computing (NCHC)

    With the mission of promoting scientific discovery and technological innovation, Taiwan’s NCHCExternal Link provides the country’s government agencies, higher education institutions and industries with supercomputing services, high quality networking, high efficiency storage, big data analysis and scientific engineering simulations. NCHC is headquartered in Hsinchu City.

    Solution components