Proprietary data, your competitive edge in generative AI

Author

Staff Editor

IBM Think

72% of top-performing CEOs agree that having the most advanced generative AI tools gives an organization a competitive advantage, according to the IBM Institute for Business Value. But if those generative AI tools are not grounded in an enterprise’s unique context, organizations might not get the full benefit from them.

As powerful as the big, general-purpose generative AI models such as ChatGPT and Google Gemini are, they aren’t trained on organization-specific data sets. When they’re plugged into an organization’s processes, they might be missing important information that can cause them to get confused and produce suboptimal results.

“Every company has its own language,” explains Michael Choie, Senior Managing Consultant, AI and Analytics, IBM Consulting. “Take the word ‘dressing.’ For a grocery chain, that’s going to mean ‘salad dressing.’ For a hospital, that’s going to mean ‘wound dressing.’”

AI Leaders customize AI with proprietary data

IBM partnered with The Harris Poll to publish AI in Action 2024, a survey of 2,000 organizations across the globe. The survey discovered that 15% of these organizations—called AI Leaders—are achieving quantifiable results with AI.

One thing that sets AI Leaders apart is confidence in their ability to customize their AI efforts for optimal value. This doesn’t mean an organization must build its own models from scratch to stand out from the crowd. Instead, it can adapt existing AI models by leveraging the one thing nobody else has: proprietary enterprise data.

“Every AI vendor, such as X or Google, has access to public information. They also have access to data from their own platforms,” explains Shobhit Varshney, Vice President and Senior Partner, Americas AI Leader, IBM Consulting. “What they don’t have access to is your enterprise data. That piece of the puzzle is missing.”

As Varshney elaborates in AI in Action 2024, “The next frontier is getting AI to cross the chasm and get inside an enterprise so it can absorb, learn and become your competitive advantage.”

How proprietary data brings enterprise context to AI models

There are three primary ways to feed proprietary data to an AI model: prompt engineering, retrieval augmented generation (RAG) and fine-tuning.

1. Prompt engineering

In this context, prompt engineering means including proprietary data in the prompt that is passed to the AI.

Say that a user wants an AI model to summarize call center conversations. The user can write a prompt—“Summarize this conversation”—and attach the call transcript as part of the prompt.

Prompt engineering doesn’t require altering the model itself. It is best suited for low-volume, generic tasks where it is reasonable to include the necessary context in every prompt.

2. Retrieval augmented generation (RAG)

Retrieval augmented generation (RAG) means hooking an AI model up to a proprietary database. The model can pull relevant information from this database when responding to prompts.

For example, an organization can give a customer service chatbot access to a database of company products. When users ask the chatbot questions about these products, it can look at the corresponding documentation and retrieve the correct answer.

RAG does not require any permanent changes to the model. It can improve accuracy and reduce hallucinations, but it can also increase response times.

3. Fine-tuning

Fine-tuning means giving an AI model enough additional data to change some of its parameters. Fine-tuning permanently changes the behavior of a model, adapting it to a particular use case or context. It’s also faster and cheaper than training a brand-new model.

“If you have a neural network that has 100 different layers, training it would mean that you’re modifying all 100 layers,” explains Choie. “Fine-tuning would mean that you’re really changing the last few layers. You’re still modifying the model, but you don’t have to change it entirely because it’s already performing well.”

Fine-tuning requires a little more upfront investment than prompt engineering and RAG. It is useful for turning a smaller model into an expert in a specialized domain. For example, an insurance company can fine-tune a model to master the art of processing new claims.

Varshney likens a fine-tuned model to an intensively trained new hire fresh out of school. They might not have the breadth of knowledge that a genius polymath (or big, general-purpose AI model) has, but they are much better at processing claims than the polymath would be.

“It can’t do your taxes or write a legal contract,” Varshney says, “But if I ask it to process a claim, it would know how to do it right away.”

Using proprietary data in these ways can offer a significant competitive advantage by familiarizing AI models with an enterprise’s unique processes, products, customers and other nuances.

“If you have an AI whose main users are from a particular enterprise, it is important that the AI uses data from that same enterprise,” Choie says.

When AI models have access to proprietary data, they are grounded in a specific business context, which means their outputs are also grounded in that context.

“I can take an open AI model, fine-tune it with my own proprietary data, and that copy is uniquely mine,” Varshney says. “I own the IP behind it. I run it on my own infrastructure.”

As a result, these models can produce more accurate and effective outputs than unaugmented, off-the-shelf models pulling from a general body of public data.

The value of using open-source AI models

Organizations can use many different types of AI models to achieve results. But open-source models—such as IBM Granite™ models, which are available under an Apache 2.0 license for broad, unencumbered commercial usage—offer certain benefits.

“When training an AI model, there are many different parameters and techniques you need to adjust to ensure that the model learns effectively and efficiently. You need specialized data scientists and machine learning experts to set that up,” Choie explains. “The benefit of fine-tuning open models is that we have these models that some brilliant people have already put their hands on. All we need to do is feed the models additional task-specific data and adjust a few layers, which is a much simpler task than building a model."

In addition to letting organizations benefit from the wisdom of the crowd, open-source models can enable them to experiment without the cost of failure being too high. This experimentation, in turn, helps organizations pursue a multimodel strategy, using many different—and differently tuned—models for domain-specific tasks.

This multimodel strategy is considered a best practice. AI in Action 2024 found that 62% of AI Leaders use multiple models, compared to 32% of AI Learners.

“It's almost a no-brainer to use open-source models,” Choie says. “They’re cost-effective, you have some of the best people in the industry working on them, and whenever there are updates or issues, the community works on them together.”

The latest AI News + Insights  

Expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Building a data architecture to unlock the value of proprietary data

Effective data management is one of the key characteristics that separates AI Leaders from other organizations, according to AI in Action 2024. 61% of AI Leaders believe in their ability to access and effectively manage organizational data to support AI initiatives, versus 11% of AI Learners.

But feeding proprietary data to AI models isn’t as simple as it sounds. Data silos, quality control and other issues can all stand in the way.

In broad terms, the solution is to implement an integrated data fabric that knocks down silos, ensures interoperability and orchestrates fluid data movement across platforms.

But what does that look like in practice? Here are a few key considerations:

Data integration

The first hurdles to many AI efforts are collecting and storing data, processes that aren’t as simple as they might seem.

Capturing data in traditional databases often leads to data silos, which can prevent an organization from aggregating all the data it needs to build an effective RAG database or fine-tune models. According to the IBM® Data Differentiator, 82% of enterprises experience data silos that stymie their key workflows.

Organizations need to implement pipelines to retrieve data from disparate sources, prepare it for use and deposit it in an accessible, centralized store.

Retrieving and preparing data might involve the use of stream-processing tools, such as Apache Kafka, or ETL- and ELT-ready data integration tools, such as IBM DataStage. Organizations must also choose the right repositories for data, which can include:

Data lakes, which offer low-cost storage environments designed to handle massive amounts of raw structured and unstructured data.

Data warehouses, which are built to support data analytics, business intelligence and data science efforts.

Data lakehouses, such as watsonx.data, which merge the capabilities of warehouses and lakes into a single data management solution.

A hybrid cloud infrastructure is also an important component of data integration efforts. Many enterprises today have data distributed between on-premises data stores and multiple cloud services.

“You need to make sure you can aggregate all of this information, no matter where it is, and feed it into your AI models,” Choie says. “If you're not doing hybrid, you're going to be missing out on something."

Cleaning and preparing data

Bad inputs lead to bad outputs. Organizations need to ensure the proprietary data they’re feeding to AI models is reliable and accurate.

“You need to figure out the gold in your data—the differentiator—so you can amplify that,” Varshney says. “You want to reduce the noise in the data, and you want to provide high-quality data to fine-tune on.”

Data must be cleaned up before it is passed to an AI. Otherwise, it can make the model perform worse.

Varshney offers the example of a call center ticket with a not-so-obvious solution: “People might try five different ways to fix it before they find the one that works. You can’t send that ticket directly to the model. It will be very noisy. It will contain all the things people tried. The model might get confused about which is the right outcome. You want to clean the noise so the model sees only the real solution.”

Cleaning, preparing and curating datasets involves some manual work on the part of data scientists and analysts, either in-house or external partners. It also involves tools such as:

AI-enabled data management tools can automatically validate data, flag errors and convert data to the proper format.

Synthetic data generators can help fill-in missing values and augment human-prepared assets with larger corpora.

Data preprocessing and engineering tools, such as Apache Spark and the pandas Python library.

Data observability tools can track the flow of data over time, monitor usage and data lineage and detect anomalies.

Generative AI is only one part of the equation

Whatever competitive advantages proprietary data can bring to generative AI, lasting strategic advantage comes from deploying the correct mix of technology and business processes.

“The workflow itself is where the money is,” Varshney explains. “The model is a commodity, and we will keep getting better and better models. What we really need to figure out is the right surgical blend of bringing traditional AI, automation and generative AI together in a workflow.”

In other words, organizations cannot drop generative AI—even a fine-tuned model fit to their specifications—into their processes and expect results. Rather, they must evaluate their processes and adapt their workflows to the models as much as they adapt their models to their workflows.

Consider the humble dishwasher.

“When we developed dishwashers, we did not expect them to stand up and wash the way we do over a sink,” Varshney says. “We changed the process so that the dishwasher could really excel at washing. We set the problem in the correct format. We need to do the same here. We need to reengineer processes and figure out the right blend of traditional AI and generative AI. Then, you start to unlock value.”

AI in Action 2024

We surveyed 2,000 organizations about their AI initiatives to discover what’s working, what’s not and how you can get ahead.

Proprietary data—your competitive edge in generative AI