IBM watsonx.data plus Unstructured: Turning unstructured data into AI-ready fuel

Digital illustration with unstructured logo on left and dashboard on right with icons and a calendar view

Author

Edward Calvesbert

Vice President, Product Management - watsonx.data

IBM

Brian Raymond

Founder and CEO of Unstructured

We’re excited to announce a new partnership between IBM and Unstructured, an IBM Ventures portfolio company. Together, we’re addressing one of the most significant barriers to scaling enterprise AI: the preparation of unstructured data for generative AI.

The unstructured data challenge

Approximately 80% of enterprise data is unstructured—residing in PDFs, emails, collaboration platforms and document repositories. Yet less than 1% of this data is in a format suitable directly for AI consumption. This gap represents both a massive opportunity and a critical challenge for organizations​​​​ scaling AI initiatives.​​​​​​

​​​​Traditional approaches to unstructured data preparation are holding enterprises back. Manual pipelines require 6-12 months to build and remain brittle, breaking with each new document format or source system change. Engineering teams spend valuable time on data plumbing rather than AI innovation. Without proper structure and consistency, AI models deliver unreliable results, undermining trust and delaying time-to-value.​​​

​​​​IBM watsonx.data addresses this challenge as the industry’s only hybrid, open data lakehouse built for AI and analytics. It simplifies access, preparation and governance across both structured and unstructured data, helping organizations establish a trusted data foundation for generative AI at scale.​​​​​

The watsonx.data “Unstructured” advantage

Through this partnership, Unstructured extends the power of watsonx.data to access and transform unstructured data into AI-ready formats to fuel reliable, scalable and trusted generative AI.

Comprehensive connectivity and format support​​​

Unstructured provides more than 30 pre-built connectors to enterprise data sources including SharePoint, Google Drive, Salesforce, Confluence, Box and Dropbox. With support for over 70 file types—from PDFs with complex layouts to scanned images, emails and Microsoft Office documents—organizations can access and transform their complete data estate.​​​

​​​​Unlike basic text extraction tools, Unstructured’s intelligent document understanding preserves critical elements such as tables, hierarchies and semantic structure, ensuring AI models receive contextually rich data rather than just raw text.​​​​​

Accelerated pipeline development​​​

A no-code visual workflow builder empowers business and data teams to design and manage data pipelines without requiring specialized engineering resources. For organizations with development teams, a comprehensive API provides programmatic control and customization options.​​​

​​​​Automatic incremental synchronization processes ingest only new and changed documents, reducing compute costs and keeping AI applications current. Multi-source orchestration coordinates data flows across multiple systems simultaneously, eliminating manual coordination overhead.​​​

Enterprise-grade governance and compliance

Unstructured is SOC 2 Type II, ISO 27001, HIPAA, and GDPR compliant, meeting the rigorous security and privacy standards enterprise IT organizations require. Together with watsonx.data, the solution provides version control, data lineage tracking and granular access controls that honor source system permissions throughout the data pipeline.​​​

Optimized for AI workflows​​​

Unstructured delivers semantically enriched, properly chunked, and embedded data optimized for modern AI architectures:

  • Retrieval-augmented generation (RAG):​​ Contextually intelligent chunking improves retrieval accuracy and response quality​​​
  • Vector database integration:​​ Automatic embedding generation streamlines ingestion into vector stores​​​
  • Agentic systems:​​ Provides structured, actionable context that enables autonomous agents to reason, plan, and interact with data more effectively​​​​​
  • Multi-modal AI:​​ Coordinated processing of text and image content​​​

With watsonx.data and Unstructured, teams can move quickly with production-ready pipelines ​​combining​​​​ speed, flexibility and AI-readiness all in one integrated solution.

Better together: Fueling the watsonx engine

If watsonx.data is the data engine powering generative AI applications, Unstructured provides the fuel. Together, watsonx.data and Unstructured deliver AI-ready unstructured data and enable advanced retrieval-augmented generation patterns that improve the accuracy and reliability of AI. 

Enterprises can accelerate time-to-value by replacing manual document preparation with automated, intelligent processing. Governance policies flow from document source systems all the way to the AI applications, improving trust and transparency at every stage. By removing the bottleneck of unstructured data preparation, and providing a data foundation with unified data access, preparation, and governance, organizations can finally unlock the full potential of their unstructured content to power reliable, enterprise-grade AI.

To see watsonx.data and Unstructured in action, join our upcoming joint webinar or book a meeting. Together, we’ll help you move from​ spending time preparing messy, unstructure​d data​​ ​​to accelerating enterprise-grade AI agents and applications, powered by AI-ready data, at scale.​​​

Join the upcoming webinar

Book a meeting