Unstructured data is information that does not have a predefined format. Unstructured datasets are massive (often terabytes or petabytes of data) and contain 90% of all enterprise-generated data.1
The proliferation of unstructured data is driven by its diverse and extensive data sources—including text documents, social media, image and audio files, instant messages and smart devices. Almost all new data generated today is unstructured: every message sent, photo uploaded or sensor triggered adds to the growing volume.
Unlike structured data (which has a predefined data model) unstructured data does not easily conform to the fixed schemas of conventional databases. Instead, unstructured data is often stored in file systems, non-relational (or NoSQL databases) or in data lakes.
Unstructured data’s complexity and nonuniform data structure also necessitate more sophisticated methods of data analysis. Technologies such as machine learning (ML) and natural language processing (NLP) are commonly leveraged to extract insights from unstructured datasets.
In the recent past, unstructured data was considered dark data. The challenges of unstructured data (that is, its volume and lack of uniformity) rendered it unusable for many business use cases.
Today, however, enterprises with abundant unstructured data possess a significant strategic asset. When combined, structured and unstructured data provide a complete view of data across an enterprise. And, especially relevant in this current moment, unstructured data can also help businesses unlock the full potential of generative AI (gen AI).
Most unstructured data is textual: email messages, Word documents, PDFs, blogs and social media posts. Textual unstructured data also encompasses call transcripts and message text files, such as those from Microsoft Teams or Slack.
However, unstructured data can also be nontextual. Common examples of nontextual unstructured data include image files (such as JPEG, GIF and PNG), multimedia files, video files, mobile activity and sensor data from Internet of Things (IoT) devices.
Data is often categorized as structured, unstructured or semi-structured based on its format and schema rules. As its name suggests, semi-structured data shares attributes of both structured and unstructured data. Here's a brief overview of each type of data:
Structured data
Unstructured data
Semi-structured data
Unstructured data represents the lion’s share of all data generated in an enterprise. It’s diverse, flexible and flush with insights, some of which may not exist in structured data sets. While structured data is still immensely valuable, most companies today are sitting on vast stores of unstructured data that remain largely untapped.
Unstructured data is also instrumental to modern AI. Unstructured data (in the form of public and internal, proprietary data) can be used to train AI models and improve model performance.
With the right tools, unstructured data can provide a wide variety of use cases, such as:
Generative AI relies on deep learning models that identify and encode the patterns and relationships in huge amounts of data. Unstructured data, usually from the internet, is well-suited to provide the extremely high volume of rich, unlabeled data necessary for training.
RAG is an architecture for optimizing the performance of a gen AI model by giving it access to additional external knowledge bases, such as an organization’s internal unstructured data. This process helps adapt models to domain-specific use cases so they can provide better answers.
Sentiment analysis analyzes large volumes of text to determine whether it expresses a positive negative or neutral sentiment. As a tool to understand customer behavior, sentiment analysis uses the vast troves of unstructured textual data generated by customers across digital channels.
Companies employ predictive analytics to forecast future outcomes and identify risks and opportunities using historical data. For instance, a healthcare organization could mine health records (unstructured text data) to learn how a specific disease has been diagnosed and treated, and create a predictive model based on the findings.
An enterprise-grade chatbot can analyze and extract insights from the unstructured text data in its conversations with customers or employees. Typically, analysis is performed using techniques such as natural language processing (NLP) and machine learning. Insights from the analyzed text data can help inform customer behavior and improve chatbot performance.
Artificial intelligence-related use cases for unstructured data are increasingly a focal point for enterprises embracing AI innovation. Consider gen AI, the technology behind ChatGPT and other viral AI apps. It begins with a foundation model, commonly a large language model (LLM).
Creating a foundation model involves training a deep learning algorithm on huge volumes of unstructured data, usually from the internet. This unstructured data is rich and diverse, teaching AI models context and nuance.
However, unstructured training data can be quite general, rather than specific to a domain or organization, and potentially outdated. The final model might struggle to respond to prompts asking for domain-specific answers.
To address such challenges, organizations can adapt a pre-trained model to a specific use case or task in several ways. One method, fine-tuning, tailors a base model by training it on a smaller, task-specific dataset. It requires high quality, structured data—often proprietary data or specialized, domain-specific knowledge.
However, a different method, retrieval augmented generation (RAG), can incorporate unstructured data. While LLMs typically source information from their training data, RAG adds an information retrieval component to the AI workflow, gathering relevant data and feeding it to the model to improve response quality. This data can include internal, unstructured datasets.
Compared to fine-tuning, RAG ensures more timely and accurate outcomes as it is constantly retrieving the latest information during response generation. It can help transform AI initiatives from frozen in time and generic, to customized, relevant and impactful.
Like structured data, unstructured data also requires proper data governance and data management before being used for AI. It needs to be classified, assessed for data quality, filtered for PII and deduplicated.
With the right tools, and even the help of AI, businesses can transform their unstructured data and make it usable. Knowing how to effectively make order out of the data chaos is now a competitive differentiator—and catalyst for enterprise gen AI.
Unstructured data is typically stored in its native format, which broadens storage options. Some common data storage environments for unstructured data include:
Object storage (or object-based storage) stores data as objects, a simple, self-contained repository that includes the data, metadata and a unique identifying number. This architecture is ideal for storing, archiving, backing up and managing high volumes of static unstructured data. Cloud-based object storage is often used to optimize the storage costs and data usage of AI workloads.
Data lakes are data storage environments designed to handle large amounts of raw data in any data format—specifically, the flood of big data created by internet-connected apps and services. They use cloud computing to make data storage more scalable and affordable. And typically, data lakes use cloud-based object storage, such as Azure Blob Storage, Google Cloud Storage or IBM Cloud® Object Storage.
Data lakehouses are considered the next evolution of data management, combining the best parts of data lakes and data warehouses. They offer fast, low-cost storage with the flexibility to support data analytics and AI/ML workloads. Data lakehouses also support real-time data ingestion, which is critical for AI applications used to support real-time decision-making.
Structured query language (SQL) is a standardized, domain-specific programming language used for storing, manipulating and retrieving data. A NoSQL, or non-SQL, database is designed to store data outside of traditional SQL database structures, without a schema. NoSQL databases provide the speed and scalability necessary for managing large, unstructured datasets. Examples include MongoDB, Redis and HBase.
Once unstructured data is stored, it often requires processing to be effectively used for downstream use cases, such as for business intelligence or unstructured data analytics.
Some organizations use open source frameworks to process large, unstructured data sets. For example, Apache Hadoop is often integrated into data lake architectures to enable batch processing of unstructured and semi-structured data (such as streaming audio and social media sentiment). Apache Spark is another open-source framework for big data processing. However, Spark uses in-memory processing and is lightning fast, therefore better suited for machine learning and AI applications.
There are also modern data integration platforms specifically designed to handle both structured and unstructured data. These multi-purpose integration tools automatically ingest raw data, organize it and then move processed data into target databases. These features significantly reduce the time-intensive manual work of data science teams tasked with preparing raw, unstructured data for AI.
There are various tools and technologies organizations can use to uncover insights from their unstructured data.
AI analytics tools rely on the ability of artificial intelligence to quickly process large volumes of data, which is key for organizations that want to find valuable insights in massive unstructured data sets. With machine learning and natural language processing, AI algorithms sift through unstructured data to find patterns, make real-time predictions or provide recommendations. These analytical models can also integrate into existing dashboards or APIs to automate decision-making.
Text mining uses Naïve Bayes, support vector machines (SVM) and other deep learning algorithms to help organizations explore and discover hidden relationships within unstructured data. A variety of techniques are deployed for text mining, such as information retrieval, information extraction, data mining and natural language processing.
NLP uses machine learning to help computers understand and communicate with human language. In the context of unstructured data analysis, NLP enables the extraction of insights from unstructured text data, such as customer reviews and social media posts. It can be used to enhance text mining by offering advanced language processing and understanding, such as sentiment analysis.
Use IBM database solutions to meet various workload needs across the hybrid cloud.
Explore IBM Db2, a relational database that provides high performance, scalability and reliability for storing and managing structured data. It is available as SaaS on IBM Cloud or for self-hosting.
Unlock the value of enterprise data with IBM Consulting, building an insight-driven organization that delivers business advantage.
1 “Untapped value: What every executive needs to know about unstructured data," IDC, Aug 2023.