Conquering the 3 core challenges of unstructured data

3D render of floating white square platforms in a grid with a blue cube above one

Authors

Dinesh Nirmal

SVP

IBM Software

Alice Gomstyn

Staff Writer

IBM Think

Trusted data is critical for helping enterprises succeed in their generative AI initiatives. Enterprises struggle to harness what could be a powerful source of insights: unstructured data. Some 90% of data produced by enterprises is unstructured, with valuable information stored in emails, PDF documents, video files and other formats.1

The good news is that evolving solutions and approaches can empower enterprises to organize, access and derive intelligence from their unstructured data. Think contributor Alice Gomstyn sat down with Dinesh Nirmal, the senior vice president of IBM Software, to discuss how enterprises can unlock the potential of data troves once considered beyond their reach.

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

Thank you! You are subscribed.

Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.

Gomstyn: What challenges do organizations face when it comes to using their unstructured data?

Nirmal: There are three core challenges with unstructured data. Scalability is one. How do you scale it and how do you govern it? Two, how do you make sure there is generative AI performance and accuracy associated with it? And the third one is all around how to correlate unstructured and structured together to derive value from that data.

AI Academy

Become an AI expert

Gain the knowledge to prioritize AI investments that drive business growth. Get started with our free AI Academy today and lead the future of AI in your organization.

Gomstyn: Can you elaborate on the scalability challenge and what it takes to address it?

Nirmal: Unstructured data is more complex in the sense that it could have hundreds of fields and some of them might be mass fields or secure fields. When you ingest those documents, it becomes critical that it’s a governed ingestion and that data is stored in a governed store such as a data lakehouse.

You also need governance in your data pipeline. How do you bring observability and monitoring into it? If there's a drift in that pipeline or a change in that pipeline, how do you quickly identify it and resolve it? These pipelines could be complex and long, and you want to make sure that you are getting the correct results, execution time, performance and accuracy throughout. You need tools to make sure you can build, govern and observe pipelines.

For enterprises, it’s also about security. Data security becomes a critical element to make sure that they don’t lose that data. We have data security tools to make sure the data is encrypted. So, as you scale, you want to make sure that the governance and the security that you have on the structured side also comes to the unstructured side.

Gomstyn: What about the second core challenge: achieving generative AI model performance?

Nirmal: There’s a huge opportunity there because generative AI can only be successful if we can give governed, trusted data to these models for training and prompting.

Governance tools also enable access to data. Using governance tools like data catalogs, I can make unstructured data available to my data scientists and prompt engineers so they can prompt tune their models using the unstructured data.

Governance and innovation go hand in hand. If you’re really innovating to provide self-service of data, then governance needs to be in place for you to do the self-service. From a data products perspective, making that data self-service available is the first element you have to prioritize.

Gomstyn: How do you navigate the third challenge of correlating structured and unstructured data?

Nirmal: The current landscape is that if you have unstructured data in the form of a document, you must divide or subdivide the document into multiple pieces and store it as embeddings within a vector database.

The challenge that happens is that you lose accuracy because you don't know where you’re chunking the data. Let's say you chunked or cut off in the middle of a table. When you bring the table back, you're bringing half of the table, and you have lost the accuracy of it.

What can we do? We not only store the data in a vector DB, but we also take the transactional aspects of that document and put it into a transactional database. And when you have a natural language query, you compare both sides to say, how do I bring the data together to get better accuracy and performance for that? That's where RAG SQL or Graph RAG come in — you can use them to get a higher level of accuracy. That's the whole point of making sure that you’re correlating the data between the transactional database and what you have on a vector DB.

Gomstyn: What are the most critical skills and competencies that IT leaders must develop to effectively manage unstructured data?

Nirmal: Data engineering is the most important piece in the unstructured data side of things. On the structured side, data engineering is a well-organized discipline, but on the unstructured side, it hasn't really taken off because there's a tremendous amount of data.

But now, governance, security and all those things are coming into the unstructured side of things. We need data engineers to literally engineer the data, to make it available as data pipelines. We need them to create data products for unstructured data and make self-service available for every data scientist and every engineer. The skills that data engineers use on the structured data side can be used on the unstructured side; they’ll just be applied at a much, much bigger scale.

Gomstyn: How do you measure the success of unstructured data pilot projects?

Nirmal: The real return of investment comes when there’s value to the end user at the enterprise. So, for example, I call my phone company, and a customer rep is on the line. When I ask a question, they must look up the answer before giving it to me.

Now, using generative AI, I can do it online. I can just go ask a simple question to an assistant or a chatbot, which can access an unstructured data format like a bill document. Within 15 seconds, I have an answer that summarizes my bill or something about my account. Look at the time I saved. I didn't need to take 15 minutes waiting on a call to somebody to answer. I just have it at my fingertips. Generative AI has enabled that for me as an end user.

It's all about the productivity, time savings and optimization that generative AI is driving, especially on the unstructured data side of things.

This interview was edited and condensed for clarity and length.

Related solutions
IBM® watsonx Orchestrate™ 

Easily design scalable AI assistants and agents, automate repetitive tasks and simplify complex processes with IBM® watsonx Orchestrate™.

Explore watsonx Orchestrate
Artificial intelligence solutions

Put AI to work in your business with IBM’s industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions
Artificial intelligence consulting and services

IBM Consulting AI services help reimagine how businesses work with AI for transformation.

Explore AI services
Take the next step

Whether you choose to customize pre-built apps and skills or build and deploy custom agentic services using an AI studio, the IBM watsonx platform has you covered.

Explore watsonx Orchestrate Explore watsonx.ai