Hybrid Cloud

Fine-grained visual recognition for mobile AR technical support

Share this post:

When a hardware-related system disruption like an outage due to hard drive failure happens, the path to recovery includes checking hardware support information, describing the problem to a support representative, waiting for a field technician to arrive, hoping the technician can resolve the issue in a timely manner. All during a single trip.

Our team of researchers recently published paper “Fine-Grained Visual Recognition in Mobile Augmented Reality for Technical Support,” in IEEE ISMAR 2020[1, 2], which outlines an augmented reality (AR) solution that our colleagues in IBM Technology Support Services (TSS) use to increase the rate of first-time fixes and reduce the mean time to recovery from a hardware disruption.

“The most recent industry surveys have shown that the average enterprise estimates that there is an impact of approximately $8,851 for every minute of unplanned downtime in their primary computing environment.” [3]

Unlike virtual reality, which immerses the user in a digital world, AR superimposes graphics and other media on top of a user’s real-world surroundings. By displaying guidance over the physical environment, augmented reality support uses visual guidance to drastically reduce the effort needed to relay instructions, the number of errors and even the time required to look up service information.

How is Augmented Reality Revolutionizing Technical Support

Technical support service providers typically maintain tens of thousands of products in order to meet the needs of their clients. Today, IBM TSS, part of the IBM Services group, supports over 65K products and receives more than 6 million service calls per year [4]. Delivering such a high-dimensional service with shrinking support timelines necessitates relying on innovation to meet client needs. Given the increasing product portfolio, the need to have an expert technician available to cover a given hardware issue in every corner of the globe becomes a major challenge. IBM TSS addressed this issue using AR, in which a mobile application guides both technicians and clients through the repair process using 3D virtual annotations and instructions anchored to the hardware under repair. The remote support application [5] provides live expert guidance, whereas the self-enablement application empowers users to repair their hardware products without having to wait for a specialist, which, in turn, helps IBM TSS scale their operations quickly to support a wide range of products by using AR.

A Fine-Grained Visual Recognition Approach to AR Support

Over the last decade, we have seen major progress and increased interest in AR thanks to AR software development kits (SDKs), such as ARKit and ARCore, which helped lower the barrier for entry for AR development. Recently, intelligent AR systems driven by Artificial Intelligence (AI) are beginning to emerge to enhance the AR experiences. Despite this progress, most AR user experiences remain primitive, and lack intelligence and automation, thereby rendering the user interaction rather unintuitive.

Our research addresses this gap and provides enriched AR user experiences by enabling a more fine-grained visual recognition feature in AR, i.e., recognizing not only objects but also the visual state change of an object (or its parts), which is desirable in a wide range of application scenarios, including technical support. Such a visual recognition system recognizes the changing visual state of 3D objects, even when the change is fairly subtle, and enables the AR system to present the right set of instructions to the user matching their current context. For example, if the user has already complemented the first five steps of the repair action before they needed assistance from the AR system, the visual recognition component leverages the current state of the hardware to determine that the user now needs the instructions for step 6 instead of asking the user.

Although AR enables tracking of virtual objects and annotations in physical spaces through computer vision techniques, it is not inherently intelligent to actually recognize semantics of what it sees. For example, in the technical support domain, traditional AR solutions can recognize a desktop motherboard in the form of a point cloud to enable tracked annotations on top of it, but do not necessarily know that it is a motherboard. Nor would such system be able to understand if a desktop computer’s cover was open or closed, or that the motherboard has its fan removed, or a specific connector unplugged and so on.

Such fine-grained recognition is critical for the technical support domain in order to understand user’s current context and to deliver the right set of instructions to help them. For example, during a laptop repair attempt, the user may have removed the fan of a laptop and needs the instructions for the next step. Without an intelligent AR system that detects the user’s context, they would either have the entire set of repair instructions to choose from, or they would need to explicitly state their context. This not only puts the burden on the user, but also implies that the user should be familiar with the repair steps, which often is not the case. In short, existing AR systems require all interaction to be driven by the user by identifying and specifying the state of an object before they can have the relevant AR content projected to their view, and thereby significantly limit the interaction.

Challenges with Fine-grained Visual Recognition

There are various challenges associated with providing such fine-grained visual recognition capabilities such as camera viewing distance, viewing angle, motion and occlusions. For very fine-grained visual recognition, e.g., recognizing a small screw within a motherboard, the camera must be close enough to the target from a proper viewing angle and avoid occlusions. Thus, simply feeding a random video frames of the camera into a machine learning model will not yield satisfactory results. In addition to the challenges above, the solution should work within the resource and power constraints of mobile devices.

Our Solution

The ideal solution mimics the process of human perception and reasoning: to detect state changes, it enables the camera to focus on discrete local areas that change appearance in different states; prompts the user to adjust to proper viewing angles to collect images from these local areas, and makes prediction on state change only when sufficient visual data is collected.

We propose a solution that takes advantage of AR specific data, such as real-time generated 3D feature points and camera pose, to complement the images captured by the camera for fine-grained visual recognition. We first use a set of training video frames and learn Regions of Interest (RoIs), which have appearance changes that distinguish different states.  We actively track the camera position and orientation to ensure that the camera is kept at the right distance and viewing angle to the RoIs, minimize occlusions or other noise to the input images of the visual recognition model. To improve the robustness of recognition, we develop a discrete multi-stream Convolutional Neural Network (CNN), in conjunction with bi-directional Long Short Term Memory (LSTM), namely a Discrete-CNN-LSTM (DCL) model, to extract not only spatial, but also temporal data to predict state changes.

Figure 1. The off-line and on-line phases of system design.

Figure 1. The off-line and on-line phases of system design.


The above figure illustrates the design of the proposed system, which consists of both off-line and on-line components.

In the off-line phase, we first harvest data from our AR remote collaboration sessions or have a dedicated user to scan the object with a mobile device to construct the relevant object models — 3D point cloud representations of the object in different states. We also collect the camera poses, the corresponding feature points and video frames. Next, the RoI extraction module generates a set of RoIs, based on video frames collected in different states. These RoIs will determine what images should be generated from video frames to recognize object states.

In the on-line phase, we first detect the object and re-localizes the mobile device with respect to the object, using the object model. The RoIs identified in the off-line phase are also mapped to the object, as described later in RoI Identification section. Next, we crop the images of these RoIs to keep only the relevant areas of the RoIs, and further process them to train the model for state recognition.  During real-time recognition, the mobile app instructs the user to position the camera at the right distance and viewing angle to the object and applies the trained visual recognition model to predict the current state.  Based on the predicted state, the applicable object model is automatically selected for AR tracking, and the corresponding AR instructions are rendered accordingly.

Demo for Laptop Wireless Card Replacement

The video below showcases our visual recognition capability implemented within an iOS AR application for hardware maintenance. We show that using our fine-grained visual recognition, the AR system is able to provide a more immersive and intuitive user experience, and the system can detect the very subtle changes to tiny connectors and guide the users reliably through the repair process. This feature will be the key to enable fully automatic and immersive AR-based self-assist user experiences.



[1] IEEE ISMAR Paper: https://ieeexplore.ieee.org/document/9199568

[2] IEEE ISMAR Presentation: https://youtu.be/3rZUWCWLYeY

[3] “IBM Augmented Reality Working To Support And Accelerate How Support Services Are Changing“, Forbes, April 2020. https://www.forbes.com/sites/davidteich/2020/04/28/ibm-augmented-reality-working-to-support-and-accelerate-how-support-services-are-changing/?sh=331b4882c567

[4] “Ensure Cost Balances With Risk in High-Availability Data Centers”, Gartner, April 2019.

[5] IBM Augmented Remote Assist. https://mediacenter.ibm.com/media/IBM+Augmented+Remote+AssistA+Augmented+reality+for+IT+remote+support/1_d2guskyu


Inventing What’s Next.

Stay up to date with the latest announcements, research, and events from IBM Research through our newsletter.


Research Staff Member and Manager, Interactive and Immersive AI for Technology Support, IBM Research

Bing Zhou

Research Staff Member, IBM Research

More Hybrid Cloud stories

IBM researchers investigate ways to help reduce bias in healthcare AI

Our study "Comparison of methods to reduce bias from clinical prediction models of postpartum depression” examines healthcare data and machine learning models routinely used in both research and application to address bias in healthcare AI.

Continue reading

Pushing the boundaries of human-AI interaction at IUI 2021

At the 2021 virtual edition of the ACM International Conference on Intelligent User Interfaces (IUI), researchers at IBM will present five full papers, two workshop papers, and two demos.

Continue reading

New open source tool automates compliance

To help developers minimize the risk of noncompliance, our team developed Trestle, an open-source tool for managing compliance as code, using continuous integration and the National Institute of Standard and Technology’s Open Security Controls Assessment Language.

Continue reading