What is unstructured data?

Aerial view of a crowd crossing a street

Authors

Alexandra Jonker

Staff Editor

IBM Think

Alice Gomstyn

Staff Writer

IBM Think

What is unstructured data?

Unstructured data is information that does not have a predefined format. Unstructured datasets are massive (often terabytes or petabytes of data) and contain 90% of all enterprise-generated data.¹

The proliferation of unstructured data is driven by its diverse and extensive data sources—including text documents, social media, image and audio files, instant messages and smart devices. Almost all new data generated today is unstructured: every message sent, photo uploaded or sensor triggered adds to the growing volume.

Unlike structured data (which has a predefined data model) unstructured data does not easily conform to the fixed schemas of conventional databases. Instead, unstructured data is often stored in file systems, non-relational (or NoSQL databases) or in data lakes.

Unstructured data’s complexity and nonuniform data structure also necessitate more sophisticated methods of data analysis. Technologies such as machine learning (ML) and natural language processing (NLP) are commonly leveraged to extract insights from unstructured datasets.

In the recent past, unstructured data was considered dark data. The challenges of unstructured data (that is, its volume and lack of uniformity) rendered it unusable for many business use cases.

Today, however, enterprises with abundant unstructured data possess a significant strategic asset. When combined, structured and unstructured data provide a complete view of data across an enterprise. And, especially relevant in this current moment, unstructured data can also help businesses unlock the full potential of generative AI (gen AI).

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

What are examples of unstructured data?

Most unstructured data is textual: email messages, Word documents, PDFs, blogs and social media posts. Textual unstructured data also encompasses call transcripts and message text files, such as those from Microsoft Teams or Slack.

However, unstructured data can also be nontextual. Common examples of nontextual unstructured data include image files (such as JPEG, GIF and PNG), multimedia files, video files, mobile activity and sensor data from Internet of Things (IoT) devices.

Unstructured vs. structured vs. semi-structured data

Data is often categorized as structured, unstructured or semi-structured based on its format and schema rules. As its name suggests, semi-structured data shares attributes of both structured and unstructured data. Here's a brief overview of each type of data:

 Structured data

Has a clear, predefined schema
Fits neatly into rows and columns, such as those found in Excel spreadsheets or a relational database management system (RDBMS)
Examples include phone numbers, SEO tags and customer data

Unstructured data

Does not have a predefined schema
Does not conform to the rigid structure of a traditional relational database
Examples include text from web pages, call transcripts and media files

Semi-structured data

Does not have a predefined schema but has metadata—such as tags and semantic markers—that enable indexing and analysis
Does not conform to the rigid structure of a traditional relational database
Examples include JavaScript Object Notation (JSON), CSV and eXtensible Markup Language (XML) files

Why is unstructured data important?

Unstructured data represents the lion’s share of all data generated in an enterprise. It’s diverse, flexible and flush with insights, some of which may not exist in structured data sets. While structured data is still immensely valuable, most companies today are sitting on vast stores of unstructured data that remain largely untapped.

Unstructured data is also instrumental to modern AI. Unstructured data (in the form of public and internal, proprietary data) can be used to train AI models and improve model performance.

Mixture of Experts | 9 January, episode 89

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Watch all episodes of Mixture of Experts

What are use cases for unstructured data?

 With the right tools, unstructured data can provide a wide variety of use cases, such as:

Generative AI (gen AI)
Retrieval augmented generation (RAG)
Customer behavior and sentiment analysis
Predictive data analytics
Chatbot text analysis

Generative AI (gen AI)

Generative AI relies on deep learning models that identify and encode the patterns and relationships in huge amounts of data. Unstructured data, usually from the internet, is well-suited to provide the extremely high volume of rich, unlabeled data necessary for training.

Retrieval augmented generation (RAG)

RAG is an architecture for optimizing the performance of a gen AI model by giving it access to additional external knowledge bases, such as an organization’s internal unstructured data. This process helps adapt models to domain-specific use cases so they can provide better answers.

Customer behavior and sentiment analysis

Sentiment analysis analyzes large volumes of text to determine whether it expresses a positive negative or neutral sentiment. As a tool to understand customer behavior, sentiment analysis uses the vast troves of unstructured textual data generated by customers across digital channels.

Predictive data analytics

Companies employ predictive analytics to forecast future outcomes and identify risks and opportunities using historical data. For instance, a healthcare organization could mine health records (unstructured text data) to learn how a specific disease has been diagnosed and treated, and create a predictive model based on the findings.

Chatbot text analysis

An enterprise-grade chatbot can analyze and extract insights from the unstructured text data in its conversations with customers or employees. Typically, analysis is performed using techniques such as natural language processing (NLP) and machine learning. Insights from the analyzed text data can help inform customer behavior and improve chatbot performance.

Unstructured data for AI: A closer look

Artificial intelligence-related use cases for unstructured data are increasingly a focal point for enterprises embracing AI innovation. Consider gen AI, the technology behind ChatGPT and other viral AI apps. It begins with a foundation model, commonly a large language model (LLM).

Creating a foundation model involves training a deep learning algorithm on huge volumes of unstructured data, usually from the internet. This unstructured data is rich and diverse, teaching AI models context and nuance.

However, unstructured training data can be quite general, rather than specific to a domain or organization, and potentially outdated. The final model might struggle to respond to prompts asking for domain-specific answers.

To address such challenges, organizations can adapt a pre-trained model to a specific use case or task in several ways. One method, fine-tuning, tailors a base model by training it on a smaller, task-specific dataset. It requires high quality, structured data—often proprietary data or specialized, domain-specific knowledge.

However, a different method, retrieval augmented generation (RAG), can incorporate unstructured data. While LLMs typically source information from their training data, RAG adds an information retrieval component to the AI workflow, gathering relevant data and feeding it to the model to improve response quality. This data can include internal, unstructured datasets.

Compared to fine-tuning, RAG ensures more timely and accurate outcomes as it is constantly retrieving the latest information during response generation. It can help transform AI initiatives from frozen in time and generic, to customized, relevant and impactful.

Like structured data, unstructured data also requires proper data governance and data management before being used for AI. It needs to be classified, assessed for data quality, filtered for PII and deduplicated.

With the right tools, and even the help of AI, businesses can transform their unstructured data and make it usable. Knowing how to effectively make order out of the data chaos is now a competitive differentiator—and catalyst for enterprise gen AI.

How is unstructured data stored?

Unstructured data is typically stored in its native format, which broadens storage options. Some common data storage environments for unstructured data include:

Object storage

Object storage (or object-based storage) stores data as objects, a simple, self-contained repository that includes the data, metadata and a unique identifying number. This architecture is ideal for storing, archiving, backing up and managing high volumes of static unstructured data. Cloud-based object storage is often used to optimize the storage costs and data usage of AI workloads.

Data lakes

Data lakes are data storage environments designed to handle large amounts of raw data in any data format—specifically, the flood of big data created by internet-connected apps and services. They use cloud computing to make data storage more scalable and affordable. And typically, data lakes use cloud-based object storage, such as Azure Blob Storage, Google Cloud Storage or IBM Cloud® Object Storage.

Data lakehouses

Data lakehouses are considered the next evolution of data management, combining the best parts of data lakes and data warehouses. They offer fast, low-cost storage with the flexibility to support data analytics and AI/ML workloads. Data lakehouses also support real-time data ingestion, which is critical for AI applications used to support real-time decision-making.

NoSQL databases

Structured query language (SQL) is a standardized, domain-specific programming language used for storing, manipulating and retrieving data. A NoSQL, or non-SQL, database is designed to store data outside of traditional SQL database structures, without a schema. NoSQL databases provide the speed and scalability necessary for managing large, unstructured datasets. Examples include MongoDB, Redis and HBase.

What are tools for processing unstructured data?

Once unstructured data is stored, it often requires processing to be effectively used for downstream use cases, such as for business intelligence or unstructured data analytics.

Some organizations use open source frameworks to process large, unstructured data sets. For example, Apache Hadoop is often integrated into data lake architectures to enable batch processing of unstructured and semi-structured data (such as streaming audio and social media sentiment). Apache Spark is another open-source framework for big data processing. However, Spark uses in-memory processing and is lightning fast, therefore better suited for machine learning and AI applications.

There are also modern data integration platforms specifically designed to handle both structured and unstructured data. These multi-purpose integration tools automatically ingest raw data, organize it and then move processed data into target databases. These features significantly reduce the time-intensive manual work of data science teams tasked with preparing raw, unstructured data for AI.

Technology for unstructured data analysis

There are various tools and technologies organizations can use to uncover insights from their unstructured data.

AI analytics

AI analytics tools rely on the ability of artificial intelligence to quickly process large volumes of data, which is key for organizations that want to find valuable insights in massive unstructured data sets. With machine learning and natural language processing, AI algorithms sift through unstructured data to find patterns, make real-time predictions or provide recommendations. These analytical models can also integrate into existing dashboards or APIs to automate decision-making.

Text mining

Text mining uses Naïve Bayes, support vector machines (SVM) and other deep learning algorithms to help organizations explore and discover hidden relationships within unstructured data. A variety of techniques are deployed for text mining, such as information retrieval, information extraction, data mining and natural language processing.

Natural language processing (NLP)

NLP uses machine learning to help computers understand and communicate with human language. In the context of unstructured data analysis, NLP enables the extraction of insights from unstructured text data, such as customer reviews and social media posts. It can be used to enhance text mining by offering advanced language processing and understanding, such as sentiment analysis.

Unify and access your data to help scale your AI

Learn why the path to AI-ready data often starts with effective access to both structured and unstructured data and the challenges that can impede data leaders.

Resources

AI agents run on data—is yours ready?

Your data is your competitive edge. Learn how to unlock it securely and drive measurable ROI from AI in this short webinar.

Building unstructured data pipelines for enterprise AI

Join our webinar to see how IBM is extending data integration to unstructured data with automated ingestion, transformation and vectorization for AI.

The hybrid, open data lakehouse for AI

Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.

Is your data ready for gen AI?

Explore our Data Matters hub to learn how you can tackle data and AI challenges like integration.

How the C-suite is turning information into impact

Explore insights from 1,700 CDOs in this cross-industry report for data leaders.

Unleash the power of AI for seamless data integration

Understand why organizations need to adopt a unified approach that lets them manage the full spectrum of integration capabilities from a single pane of glass, eliminating the need to rely on numerous tools.

Unify and access your data to help scale your AI

Learn why the path to AI-ready data often starts with effective access to both structured and unstructured data and the challenges that can impede data leaders.

From data chaos to AI clarity: Activating AI through high-quality enterprise data

Understand how focusing on well-governed, secure and collaborative access to data at scale empowers enterprises to maximize their AI investments.

Footnotes

¹ “Untapped value: What every executive needs to know about unstructured data," IDC, Aug 2023.

What is unstructured data?

Authors

What is unstructured data?

The latest AI News + Insights

What are examples of unstructured data?

Unstructured vs. structured vs. semi-structured data

Why is unstructured data important?

Decoding AI: Weekly News Roundup

What are use cases for unstructured data?

Generative AI (gen AI)

Retrieval augmented generation (RAG)

Customer behavior and sentiment analysis

Predictive data analytics

Chatbot text analysis

Unstructured data for AI: A closer look

How is unstructured data stored?

Object storage

Data lakes

Data lakehouses

NoSQL databases

What are tools for processing unstructured data?

Technology for unstructured data analysis

Share

Resources

Footnotes

The latest AI News + Insights