AI-powered automation in the US’s largest 5G network

How T-Mobile uses AIOps to maximize efficiency and reliability
by Rob Spencer
6-minute read

In April 2020, T-Mobile made a competitive leap forward that transformed the telecommunications market in the US. The company acquired Sprint, starting a merger that formed the nation’s largest 5G network.

Of course, the expansion also carried significant risk. In the intensely competitive telco market, customers offered no grace period for the merger. T-Mobile needed to integrate two large networks while continuing to deliver consistent reliability and quality of service across the board.

Making it happen depends on network management. T-Mobile is using tools from IBM Cloud Pak® for Watson AIOps to integrate management, turning two vast networks into one and using AI-powered automation to maximize both reliability and efficiency.

T-mobile logo on the side of a building

When T-Mobile acquired Sprint, it had to correlate approximately

3 million

additional faults per day

Average fault processing time with IBM Cloud Pak for Watson AIOps:

19 seconds

down from 5 minutes

We processed 90% of the total alarm volume through a tenth the scale of the targeted production system. All SNMP alarms were coming through a single SNMP probe with no delays or failures, while the other management system that existed in the network required ten SNMP gateways for the same volume and experienced multiple failures and/or processing delays.
Tom Higdon
Principal Event Management Architect, T-Mobile
T-mobile logo
Maximizing quality despite twice the complexity

“Network management is a major, major component of the quality of service we deliver to the people and the companies using the T-Mobile network,” says Tom Higdon, T-Mobile’s Principal Event Management Architect. “Network management is how you ensure that all the great services you have are actually available for customers and performing the way they should.”

And when you run one of the largest communications networks in the world, network management is also extremely complex. As Higdon explains, “It’s not simply the wireless network, because the wireless network depends on other supporting networks and infrastructure: your data centers and IP-based gear and applications and services, cell sites and switch sites and all the equipment involved.”

When T-Mobile acquired Sprint, the scope and complexity basically doubled. “Day one of merger,” says Higdon, “the scope of what we needed to provide network management for grew by very close to a half a million devices being monitored, which created about three million new faults per day to correlate.”

In addition, the two networks had distinct, mature management systems. “That increased the complexity for integration because we had systems that had been in existence for greater than 10 years each,” explains Higdon. The systems involved two separate core management platforms — one based on IBM® Netcool® solutions (now built into IBM Cloud Pak for Watson AIOps), the other based on a third-party management solution — plus a multivendor array of monitoring tools feeding into the core platforms.

With focus on providing the highest network performance, T-Mobile sought the best way to identify actionable items within the millions of daily faults. The company also looks to build in more automation, making network management ever more efficient, and the network itself ever more reliable.

Inside T-Mobile’s store in NYC
Flexibility and efficiency in network management
5G Tower at sunset

“Almost immediately post-merger,” recounts Higdon, “we saw that IBM was investing in its product suite, and that IBM wanted a true partnership with us, so that we could both evolve, both succeed. So some technical as well as non-technical differentiators started to show themselves. That led us to the decision to move forward with the Watson AIOps solution.”

The immediate priority was integrating the disparate fault management platforms. It was a daunting task that needed to be done fast. Fortunately, Higdon and his colleagues found that the IBM solutions were flexible enough to incorporate data from the other management and monitoring tools. “We quickly integrated successfully. It’s one of the very first things we did as a new business,” says Higdon. “The connectors and adapters in the IBM product suite could be quickly deployed and configured, with minimal customizations, to start getting that data, enriching our fault and event information. That allowed us to merge our network operations teams.”

When staging the merged network management platform for its production release, T-Mobile achieved dramatic efficiency gains. “We processed 90% of the total alarm volume through a tenth the scale of the targeted production system. All SNMP alarms were coming through a single SNMP probe with no delays or failures, while the other management system that existed in the network required ten SNMP gateways for the same volume and experienced multiple failures and/or processing delays.”

Dramatically accelerating actionable insight

With operations between the acquired and original platforms merged, Higdon and his team are now focused on further streamlining management, reducing the “noise” of regular network operations and creating the fastest paths to issue detection and resolution or, ideally, prevention. “We create and implement what I call frameworks that are generic in nature, that help us address multiple types of issues and get the proper results,” says Higdon. For example, Higdon describes a recently implemented wireless core alarm correlation framework. “Correlations can be performed nearly instantaneously. That’s a differentiator. We’re talking about not having to go back and scan thousands of faults every minute. It’s instantaneous.”

And creating such frameworks is relatively easy, according to Higdon. “The tools under Watson AIOps, again, they’re flexible, with adapters to help create these frameworks. That’s one of the biggest strengths. It doesn’t take extensive coding. Some adaptations are needed, but they tend to be very straightforward. The solution provides ways of talking generically with many, many different things. It provides internal capabilities that can be easily adapted to meet our needs.”

Higdon and his team are already seeing dramatic accelerations in delivering actionable insight to users: “We have designed our Watson AIOps deployment to provide full local and geographic redundancy at all levels. Further, we have scaled the target system such that it will be capable of processing required fault volumes with minimal delays. When processing 90% of the total fault volume using minimal AIOps components, we’ve seen that the average time from fault occurrence — through the system with all enrichment, correlation, and/or suppression — to the user display is 19 seconds. Legacy systems were closer to five minutes.”

T-mobile employee helping a customer in store
Next steps: closed-loop automation with AI
T-mobile employee helping a customer in store

Now, T-Mobile is testing how the AI and machine-learning capabilities of IBM Watson AIOps could help achieve even greater responsiveness, further strengthening network reliability.

“We’re looking for the AI to provide intelligence back to us. It could be intelligence in the form of what should my threshold be for correlation count, or is there a relationship here that’s not apparent to us? Things that we as humans might not pick up without extensive time and energy.”

The goal in network management is to move from reactive to proactive and eventually to predictive management. Higdon looks to use AIOps to go another step and achieve closed-loop automation. “That’s where we’re ultimately trying to go. Have the tools bring in mass amounts of data, make sense of that data, make recommendations, and possibly even get to an intent-based orchestration or automation.”

This will be a valuable advantage with a network of such size, in an industry so competitive. “I need to increase capacity. I have the tools pass that information to an intelligent automation platform that’s going to effect the change. All without human intervention. That's our plan, that closed loop, that’s where we’re still charging towards right now.”

T-Mobile logo

About T-Mobile

T-MobileExternal Link is one of the world’s leading and fastest growing mobile communications providers. A subsidiary of Deutsche Telekom AG headquartered in Bonn, Germany, T-Mobile serves consumer and enterprise customers in Europe, the US and the Caribbean. In 2020, it added 5.5 million net customers and achieved $68.4 billion in revenue.

Solution component
T-Mobile logo

About T-Mobile

T-MobileExternal Link is one of the world’s leading and fastest growing mobile communications providers. A subsidiary of Deutsche Telekom AG headquartered in Bonn, Germany, T-Mobile serves consumer and enterprise customers in Europe, the US and the Caribbean. In 2020, it added 5.5 million net customers and achieved $68.4 billion in revenue.

Solution component