How to build a data strategy for generative AI

Authors

Staff Writer

IBM Think

Staff Editor

IBM Think

Generative AI, also known as gen AI, is artificial intelligence (AI) that can create text, images, video, audio and even software code in response to a user request. These days, organizations are lining up to build new generative AI apps, but they often overlook the steps needed to craft an effective data strategy that supports them.

Generative AI models—computer programs that have been trained to decide similarly to the human brain—require massive volumes of data to train on. And while organizations might have a brilliant idea for an application, if the underlying data isn’t handled properly, then the application fails.

From the cost of collecting and processing data, to the underlying infrastructure needed to store it securely, to the evolving requirements of data governance, it is important that organizations take a strategic approach for applications to be successful.

ChatGPT and the push for new generative AI business applications

In 2022, the launch of ChatGPT ushered in a new era of innovation in generative AI, prompting organizations to look at ways to leverage the technology for business applications. ChatGPT was an AI chatbot, trained on large language models (LLMs), that engaged with users in a conversational way. Since its launch, organizations have sought to apply its underlying technology to various business problems, including automation, increasing productivity and customer insights.

Various risks and challenges have become apparent as well. In the medical field, for example, while it has helped automate certain diagnoses, it has also raised privacy and security concerns.¹ Furthermore, a condition known as AI hallucinations remains persistent, causing some generative AI models to ‘make up’ facts when they can’t find the answer to a question.

But while these—and other—problems persist, organizations of all sizes and across various industries have continued to invest heavily in the space, seeking new ways to leverage its power. According to Menlo Ventures, from 2022 to 2023, enterprise investment in generative AI increased sixfold, from USD 2.3 to. 13.8 billion.

Generative AI applications bring new challenges to AI infrastructure

AI infrastructure is a term that describes the hardware and software solutions required to build AI applications. In the age of generative AI, AI infrastructure must evolve to meet the higher demands on compute resources, data storage capacity, bandwidth and more associated with the technology. But organizations are in such a hurry to deploy new generative AI applications, they sometimes overlook AI and data infrastructure needs.

As organizations seek to leverage generative AI and all its potential for business purposes, they must rethink key aspects of their approaches to data infrastructure and strategy.

Converting unstructured data into structured data

To build a successful, generative AI business application, organizations typically need a combination of structured and unstructured data. Structured data, also known as quantitative data, is data that has been previously formatted so it can be easily processed by machine learning (ML) algorithms that power generative AI applications.

Using advanced ML models, algorithms simulate the way humans learn from large amounts of data (datasets) until they can understand questions about the data and respond by creating new content.

While some data collected by enterprises is already structured (for example, customer and financial information such as names, dates and transaction amounts), a large amount is unstructured. Unstructured data, also known as qualitative data, is data that doesn’t have a predefined format. Unstructured data is wide-ranging and can include video, audio and text files from emails, web pages, social media accounts and Internet of Things (IoT) sensors.

As the digital economy expands, the amount of unstructured data collected by enterprises is growing at an exponential rate. According to Forbes, 80% to 90% of the data collected by enterprises is unstructured. Unstructured data is unfit for ML purposes and must be transformed before it can be used to train an AI model.

Converting unstructured data into data that can be processed by a computer and used for business purposes involves extracting relevant information and organizing it into a predefined format. The volume and complexity of the data create challenges, and the challenging data management environment and adhering to data governance laws can be costly.

Navigating the complex world of data governance

Data governance is the practice of helping ensure the quality, security and availability of data that belongs to an organization through sets of policies and procedures. The rise of generative AI and big data has brought data governance and all its requirements to the forefront of the modern enterprise.

Generative AI, with its capacity to create new content based on data upon which it has trained, creates new demands in the safe and lawful collection, storage and processing of data.

Quality

Because generative AI models are trained on massive datasets, the data within those sets must be of the highest quality, and its integrity must be unquestionable. Data governance plays an important role in helping ensure that the datasets generative AI models train on are accurate and complete, a key component in generating answers that can be relied upon.

Compliance

Depending on industry and location, generative AI business applications face a rigorous compliance environment in how data can be used. GDPR (General Data Protection Regulation) rules, for example, govern how data belonging to EU residents can be used by organizations. Violations carry heavy fines and penalties when customer information is compromised in any way.

In 2021, Google and other companies were fined over a billion dollars for violating data protection rules stipulated in the GDPR.

Transparency

For a generative AI application to be effective, the origin of its data and how the data has been transformed for business use must be clearly established and visible. Data governance helps ensure that documentation exists—and is transparent to users—at every step of the data lifecycle, from collection, through storage, processing and output, so users understand how an answer was generated.

Best practices for building a data strategy that supports generative AI applications

The success of generative AI applications depends on having the right data strategy and infrastructure in place to support it. Here are some best practices to help ensure success.

Start with a specific business question your organization needs answered

Due to the nature of unstructured data—where it comes from, how it's collected and stored—organizations tend to collect much.

But that doesn’t mean it’s all going to be useful to a generative AI application. “Start with a question,” advises Margaret Graves, Senior Fellow at The IBM Center for the Business of Government. “It doesn’t have to be just one question, it can be a few, but try to focus on specific ways the application you want to build is going to advance and support your mission.”

Since the debut of ChatGPT in 2022, enterprises have been in a rush to apply generative AI to a range of business problems, including increasing productivity, identifying insights and speeding digital transformation. While these are certainly areas the technology can address, they are also broad and might lead to an organization building an app that lacks specificity.

The more specific the business problem, the easier to identify the relevant datasets you’ll need to train your generative AI model on and the kind of AI infrastructure you’ll need to support the process.

Craft a strategy that helps ensure that your application has the data it needs

Once an organization has decided which business questions it wants to focus a generative AI application on, it can start to look at the datasets relevant to training its AI models. Graves likens this part of the process to looking at a spectrum. “On one end,” she says, “you’ve got highly confidential, proprietary internal data that you need to train your model on. On the other, you’ve got more general data that isn’t proprietary but will help your application perform better.”

The world of RFPs (Request for Proposals) is a good example as it is one of the most compelling business use cases of generative AI to emerge in the last few years. A B2B enterprise looking to build a generative AI application to help automate aspects of its RFP process would need to train on internal data or it wouldn’t be able to present a business’ unique capabilities. But that same generative AI model would also need to train on more general data, such as how to craft a sentence and structure its answers grammatically, or its responses wouldn’t make sense.

“Both of these aspects need to be brought together in your data strategy—broad, general datasets and more proprietary, internal datasets as well,” Graves says. “Otherwise, you’re just building a tool and throwing a lot of data at it and seeing what happens which is a waste of money and time.”

Leverage domain-specific data when applicable

Using domain-specific data, data that’s relevant to a specific industry or field, can help businesses create AI models that are more focused on their particular business need. “There’s an emphasis on domain-specific data right now when it comes to training AI models, for example in the finance or HR fields,” says Jason Prow, Senior Partner at IBM Consulting. “With all the data that’s out there, organizing your model around a specific domain is becoming critical.”

Leveraging domain data in the creation of AI models helps tailor the models in ways that can make them more applicable to a specific business need. Domain-specific models are more accurate and relevant to user needs and can lead to better overall performance of associated generative AI applications.

Domain-specific data can be technical and complex, so organizations seeking to leverage it need to consider adding a “semantic” later, a layer of abstraction in their AI models to help translate it. “The pharmaceutical industry in particular does a lot of semantic description,” says Anthony Vachino, Associate Partner, IBM Consulting. “Different companies do different trials, and the semantic layer describes it in ways that can help make the research applicable to other companies so they don’t have to replicate it.

Locate your data infrastructure strategically

Whether preparing for geopolitical shifts that can disrupt supply chains or natural disasters that threaten critical infrastructure, modern data leaders are starting to consider more than just talent and cost when choosing where they store and access data. According to the IBM Institute of Business Value, 60% of government leaders believe the frequency of supply chain and infrastructure shocks will increase in the future, while 70% believe that they will increase in intensity.

Different regions have different advantages, and things such as talent, data ecosystem and infrastructure, governance and geopolitical factors all need to be considered. Executives are taking note: Last year, according to the same IBV report, nearly 70% of executives surveyed said they expected AI to change where they located key resources, while this year, that percentage jumped to 96%.

Dan Chenok, Executive Director of the IBM Center for the Business of Government, is interested in the potential of using distributed data in training generative AI models because it allows for data to be stored and accessed in more than one location. “Distributed data allows you to train the model on data that's sitting in multiple locations,” he says, “while security and regulations are maintained through access control.”

Supporting generative AI applications requires an open, hybrid approach

Modern, hybrid solutions help organizations build AI models that are better suited to solving specific business problems, saving money, time and other critical resources. “When you integrate across multiple platforms, you can provide better services, especially if you’re an enterprise working in multiple locations,” Chenok adds. “And the best solutions help you reconcile it all so your application will perform.”

Open, hybrid data lakehouses give users the ability to share data across both cloud and on-premises infrastructure—wherever data resides—so it can be accessed by generative AI applications. Data lakehouses are platforms that merge aspects of data warehouses and data lakes into a single, unified data management solution.

Data lakes are low-cost data storage solutions built to handle massive amounts of structured and unstructured data, and data warehouses are systems that collect data from multiple sources into a single location so it can be analyzed. While not as scalable as lakes or warehouses, data lakehouses tend to be more streamlined, higher performing and capable of supporting a wider range of workloads.

For enterprises in search of a more comprehensive solution, platforms such as Databricks, Snowflake and Amazon RedShift are becoming more popular because of the complexity of preparing data for generative AI and developing and deploying the applications. Comprehensive solutions help with data management, model training and solution deployment, allowing organizations to launch a generative AI application with built-in scalability and governance for various use cases.

IBM watsonx.data is a fit-for-purpose data store built on an open data lakehouse that increases the scalability of generative AI workloads. The open, hybrid, built-for-purpose approach improves integration with different kinds of databases, enabling enterprises to leverage data that’s spread across different ecosystems and environments, and not get locked into a single region or set of rules.

How to build a data strategy to support your generative AI applications