Published: 18 June 2024
Contributors: Mesh Flinders, Ian Smalley
Artificial intelligence (AI) inference is the ability of trained AI models to recognize patterns and draw conclusions from information that they haven’t seen before.
AI inference is critical to the advancement of AI technologies and underpins its most exciting applications, such as generative AI, the capability that powers the popular ChatGPT application. AI models rely on AI inference to imitate the way people think, reason and respond to prompts.
AI inference starts by training an AI model on a large data set with decision-making algorithms. AI models consist of decision-making algorithms that are trained on neural networks—large language models (LLMs) that are constructed like a human brain. For example, an AI model that is designed for facial recognition might be trained on millions of images of the human face. Eventually, it learns to accurately identify features like eye color, nose shape and hair color, and it can then use them to recognize an individual in an image.
Though closely related, AI inference and machine learning (ML) are two different steps in the AI model lifecycle.
While most organizations are clear about the outcomes they expect from generative AI, what’s not so well understood is the way to go about realizing these outcomes. Choosing the wrong model can severely impact your business.
If AI models aren’t trained on a robust dataset that’s appropriate to their application, they simply aren’t effective. Given the sensitive nature of the technology and how closely it's scrutinized in the press1, enterprises need to be cautious. But with applications that span industries and offer the potential of digital transformation and scalable innovation, its benefits are many:
While the benefits of AI inference are many, as a young, fast-growing technology, it is not without its challenges, too. Here are some of the problems facing the industry that businesses considering investing in AI should consider:
AI inference is a complex process that involves training an AI model on appropriate datasets until it can infer accurate responses. This is a highly compute-intensive process, requiring specialized hardware and software. Before looking at the process of training AI models for AI inference, let’s explore some of the specialized hardware that enables it:
The central processing unit (CPU) is the primary functional component of a computer. In AI training and inference, the CPU runs the operating system and helps manage compute resources required for training purposes.
Graphics processing units (GPUs), or electronic circuits built for high-performance computer graphics and image processing, are used in various devices, including video cards, motherboards and mobile phones. However, due to their parallel processing capabilities, they are also increasingly being used in the training of AI models. One method is to connect many GPUs to a single AI system to increase that system’s processing power.
Field-programmable gate arrays (FPGAs) are highly customizable AI accelerators that depend on specialized knowledge to be reprogrammed for a specific purpose. Unlike other AI accelerators, FPGAs have a unique design that suits a specific function, often having to do with the processing of data in real-time, which is critical to AI inference. FPGAs are reprogrammable on a hardware level, enabling a higher level of customization.
ASICs are AI accelerators designed with a specific purpose or workload in mind, like deep learning in the case of the WSE-3 ASICs accelerator produced by Cerebras. ASICs help data scientists speed AI inference capabilities and lower the cost. Unlike FPGAs, ASICs cannot be reprogrammed, but since they are constructed with a singular purpose, they typically outperform other, more general-purpose accelerators. One example of these is Google’s Tensor Processing Unit (TPU), developed for neural network machine learning using Google's own TensorFlow software.
Enterprises interested in investing in AI applications as part of their digital transformation journey should educate themselves about the benefits and challenges of AI inference. For those who have thoroughly investigated its various applications and are ready to put it to use, here are five steps to establishing effective AI inference:
Preparing data is critical to creating effective AI models and applications. Enterprises can create datasets for AI models to train on using data from within their organization or from without. For optimal results, it's typical to use a combination of both. Another key part of assembling data your AI will train on is the cleansing of data—the removing of any duplicate entries and the resolution of any formatting problems.
Once a dataset has been assembled, the next step is the selection of the right AI model for your application. Models come in a range from simple to complex, with the more complex ones able to accommodate more inputs and infer at a subtler level than the less complex ones. During this step, it’s important to be clear about your needs, as training more complex models can require more time, money and other resources than training simpler ones.
To get the wanted outputs from an AI application, businesses will usually need to go through many, rigorous rounds of AI training. As models train, the accuracy of their inferences will get sharper and the amount of compute resources required to reach those inferences, such as compute power and latency, will lessen. As the model matures, it shifts into a new phase where it can start to make inferences about new data from the data it's learned on. This is an exciting step because you can see your model begin to operate in the way it was designed to.
Before your model is deemed operational, it’s important you check and monitor its outputs for any inaccuracies, biases or data privacy issues. Postprocessing, as this phase is sometimes called, is where you create a step-by-step process for ensuring your model’s accuracy. The postprocessing phase is the moment to create a methodology that will ensure that your AI is giving you the answers you want and functioning the way it’s intended to.
After rigorous monitoring and postprocessing, your AI model is ready to be deployed for business use. This last step includes the implementation of the architecture and data systems that will enable your AI model to function, as well as the creation of any change management procedures to educate stakeholders on how to use your AI application in their day-to-day roles.
Depending on the kind of AI application enterprises require, there are different types of AI inference they can choose from. If a business is looking to build an AI model to be used with an Internet of Things (IoT) application, streaming inference (with its measurement capabilities) is likely the most appropriate choice. However, if an AI model is designed to interact with humans, online inference (with its LLM capabilities) would be a better fit. Here are the three types of AI inference and the characteristics that make them unique.
Dynamic inference, also known as online inference, is the fastest kind of AI inference and is used in the most popular LLM AI applications, such as OpenAI’s ChatGPT. Dynamic inference makes outputs and predictions the instant it’s asked for them and, after, requires low latency and speedy access to data to function. Another characteristic of dynamic inference is that outputs can come so quickly that there isn’t time to review them before they reach an end user. This causes some enterprises to add a layer of monitoring between output and the end user to ensure quality control.
Batch inference generates AI predictions offline by using large batches of data. With a batch inference approach, data that’s been previously collected is then applied to ML algorithms. While not ideal for situations where outputs are required in a few seconds or less, batch inference is a good fit for AI predictions that are updated regularly throughout the day or over the course of a week, like sales or marketing dashboards or risk assessments.
Streaming inference uses a pipeline of data, usually supplied through regular measurements from sensors, and feeds it into an algorithm that uses the data to continually make calculations and predictions. IoT applications, such as AI used to monitor a power plant or traffic in a city via sensors connected to the internet, rely on streaming inference to make their decisions.
AI on IBM Z® uses machine learning to convert data from every transaction into real-time insights.
IBM® watsonx.ai™ AI studio is part of the IBM watsonx™ AI and data platform, bringing together new generative AI (gen AI) capabilities powered by foundation models and traditional machine learning (ML) into a powerful studio spanning the AI lifecycle.
With a hybrid-by-design strategy, you can accelerate the impact of AI across your enterprise.
IBM Consulting™ is working with global clients and partners to co-create what’s next in AI. Our diverse, global team of more than 20,000 AI experts can help you quickly and confidently design and scale cutting-edge AI solutions and automation across your business.
Use of generative AI for business is on the rise, and it’s easy to see why.
Explore more about the transformative technology of AI that is already helping enterprises tackle business challenges.
Chat with a solo model to experience working with generative AI in watsonx.ai.
Artificial intelligence, or AI, is technology that enables computers and machines to simulate human intelligence and problem-solving capabilities.
Machine learning (ML) is a branch of artificial intelligence (AI) and computer science that focuses on the using data and algorithms to enable AI to imitate the way that humans learn, gradually improving its accuracy.
An AI model is a program that has been trained on a set of data to recognize certain patterns or make certain decisions without further human intervention.
All links reside outside ibm.com
1 “Why Companies Are Vastly Underprepared For The Risks Posed By AI”, Forbes, June 15, 2023
2 “Onshoring Semiconductor Production: National Security Versus Economic Efficiency”, Council on Foreign Relations, April 2024