Tags

Conquering the 3 core challenges of unstructured data

3D render of floating white square platforms in a grid with a blue cube above one

Authors

SVP

IBM Software

Staff Writer

IBM Think

Trusted data is critical for helping enterprises succeed in their generative AI initiatives. Enterprises struggle to harness what could be a powerful source of insights: unstructured data. Some 90% of data produced by enterprises is unstructured, with valuable information stored in emails, PDF documents, video files and other formats.¹

The good news is that evolving solutions and approaches can empower enterprises to organize, access and derive intelligence from their unstructured data. Think contributor Alice Gomstyn sat down with Dinesh Nirmal, the senior vice president of IBM Software, to discuss how enterprises can unlock the potential of data troves once considered beyond their reach.

The latest tech news, backed by expert insights

Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement.

Gomstyn: What challenges do organizations face when it comes to using their unstructured data?

Nirmal: There are three core challenges with unstructured data. Scalability is one. How do you scale it and how do you govern it? Two, how do you make sure there is generative AI performance and accuracy associated with it? And the third one is all around how to correlate unstructured and structured together to derive value from that data.

AI Academy

Is data management the secret to generative AI?

Explore why high-quality data is essential for the successful use of generative AI.

Gomstyn: Can you elaborate on the scalability challenge and what it takes to address it?

Nirmal: Unstructured data is more complex in the sense that it could have hundreds of fields and some of them might be mass fields or secure fields. When you ingest those documents, it becomes critical that it’s a governed ingestion and that data is stored in a governed store such as a data lakehouse.

You also need governance in your data pipeline. How do you bring observability and monitoring into it? If there's a drift in that pipeline or a change in that pipeline, how do you quickly identify it and resolve it? These pipelines could be complex and long, and you want to make sure that you are getting the correct results, execution time, performance and accuracy throughout. You need tools to make sure you can build, govern and observe pipelines.

For enterprises, it’s also about security. Data security becomes a critical element to make sure that they don’t lose that data. We have data security tools to make sure the data is encrypted. So, as you scale, you want to make sure that the governance and the security that you have on the structured side also comes to the unstructured side.

Gomstyn: What about the second core challenge: achieving generative AI model performance?

Nirmal: There’s a huge opportunity there because generative AI can only be successful if we can give governed, trusted data to these models for training and prompting.

Governance tools also enable access to data. Using governance tools like data catalogs, I can make unstructured data available to my data scientists and prompt engineers so they can prompt tune their models using the unstructured data.

Governance and innovation go hand in hand. If you’re really innovating to provide self-service of data, then governance needs to be in place for you to do the self-service. From a data products perspective, making that data self-service available is the first element you have to prioritize.

Gomstyn: How do you navigate the third challenge of correlating structured and unstructured data?

Nirmal: The current landscape is that if you have unstructured data in the form of a document, you must divide or subdivide the document into multiple pieces and store it as embeddings within a vector database.

The challenge that happens is that you lose accuracy because you don't know where you’re chunking the data. Let's say you chunked or cut off in the middle of a table. When you bring the table back, you're bringing half of the table, and you have lost the accuracy of it.

What can we do? We not only store the data in a vector DB, but we also take the transactional aspects of that document and put it into a transactional database. And when you have a natural language query, you compare both sides to say, how do I bring the data together to get better accuracy and performance for that? That's where RAG SQL or Graph RAG come in — you can use them to get a higher level of accuracy. That's the whole point of making sure that you’re correlating the data between the transactional database and what you have on a vector DB.

Gomstyn: What are the most critical skills and competencies that IT leaders must develop to effectively manage unstructured data?

Nirmal: Data engineering is the most important piece in the unstructured data side of things. On the structured side, data engineering is a well-organized discipline, but on the unstructured side, it hasn't really taken off because there's a tremendous amount of data.

But now, governance, security and all those things are coming into the unstructured side of things. We need data engineers to literally engineer the data, to make it available as data pipelines. We need them to create data products for unstructured data and make self-service available for every data scientist and every engineer. The skills that data engineers use on the structured data side can be used on the unstructured side; they’ll just be applied at a much, much bigger scale.

Gomstyn: How do you measure the success of unstructured data pilot projects?

Nirmal: The real return of investment comes when there’s value to the end user at the enterprise. So, for example, I call my phone company, and a customer rep is on the line. When I ask a question, they must look up the answer before giving it to me.

Now, using generative AI, I can do it online. I can just go ask a simple question to an assistant or a chatbot, which can access an unstructured data format like a bill document. Within 15 seconds, I have an answer that summarizes my bill or something about my account. Look at the time I saved. I didn't need to take 15 minutes waiting on a call to somebody to answer. I just have it at my fingertips. Generative AI has enabled that for me as an end user.

It's all about the productivity, time savings and optimization that generative AI is driving, especially on the unstructured data side of things.

This interview was edited and condensed for clarity and length.

3D render of a spiral of several icons lined up such as a camera, volume knob and a clipboard

Read the Data Leader's guide to learn how you can make your organization's data AI-ready.

Resources

3D render of several icons lined up such as a microphone and a camera

AI Agents run on data - is yours ready?

Your data is your competitive edge. Learn how to unlock it securely and drive measurable ROI from AI in this short webinar.

Data management explained

Techsplainers by IBM breaks down the essentials of data for AI, from key concepts to real‑world use cases. Clear, quick episodes help you learn the fundamentals fast.

3D rendering of several icons lined up such as a volume knob and a clipboard

Unify and access your data to help scale your AI

Learn why the path to AI-ready data often starts with effective access to both structured and unstructured data and the challenges that can impede data leaders.

Legal overhead turned into strategic insight

Learn how an AI-powered legal agent helps accelerate decision-making, reduce manual work and improve compliance.

Two men talking to each other on a podcast

AI Academy: Building a data strategy for enterprise AI

In this episode, Cathy Reese explains how organizations today need a data strategy that’s ready for advanced AI, which will require them to harness their highest quality data assets.

3D rendering of several icons lined up such as a camera and paper airplanes

The hybrid, open data lakehouse for AI

Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.

Two rendered glass cubes

Cost of a Data Breach Report 2025

Data breach costs have hit a new high. Get up-to-date insights into cybersecurity threats and their financial impacts on organizations.

3D render of two lines of several icons such as a camera, volume knob and a clipboard

The data leader’s guide to AI-ready data

Understand the actionable steps data leaders can take to overcome data challenges, establish the groundwork for a trusted data foundation and help get your organization’s data ready for AI.

3D render of several icons lined up such as a camera, volume knob and a clipboard

How the C-suite is turning information into impact

Explore insights from 1,700 CDOs in this cross-industry report for data leaders.

Related solutions

Unstructured data integration

Ingest, transform, and pre-process unstructured data at scale with watsonx.data integration.

Explore watsonx.data integration

Data integration solutions

Create resilient, high performing and cost optimized data pipelines for your generative AI initiatives, real-time analytics, warehouse modernization and operational needs with IBM data integration solutions.

Explore data integration solutions

Data and AI consulting services

Successfully scale AI with the right strategy, data, security and governance in place.

Explore data and AI consulting services

Take the next step

Learn how IBM watsonx.data integration automates unstructured data ingestion and transformation, preparing it for downstream AI use cases.