In this tutorial, you will use IBM’s open source Docling with Python to convert unstructured data contained in a group of scanned files into a structured format.
Structured data is information organized into a fixed field within a record or file. It resides in SQL databases, JSON from APIs, XML and CSV files and Excel spreadsheets. Structured data is data ready for efficient data processing, analysis and management.
By contrast, unstructured data is information that does not conform to a predefined data model or schema. It lacks an organized, tabular form and is typically text-heavy. Examples include emails, social media posts and customer reviews as well as non-text formats such as audio recordings, video files and images.
Unstructured data makes up the vast majority (90%) of enterprise information, growing faster than any other type of data.1 Certain industries—like healthcare or logistics and supply chain—can have a plethora of scanned documents saved as images ready for processing. Although rich in information, unstructured data is challenging for conventional databases and data analysis tools to process directly.
The conversion from unstructured to structured data is vital because structured information is readily interpretable by machines and algorithms. It enables:
The goal of the unstructured to structured data conversion process is to transform raw, unstructured inputs into structured or semi-structured outputs that analytics and AI systems can consume.
After collecting the unstructured text or data sources, the data must be processed. This stage transforms raw data into usable data through a series of functions that help ensure every dataset maintains accuracy and structure throughout the process. Some techniques include:
The goal of the unstructured to structured data conversion process is to transform raw, unstructured inputs into structured or semi-structured outputs that analytics and AI systems can consume.
After collecting the unstructured text or data sources, the data must be processed. This stage transforms raw data into usable data through a series of functions that help ensure every dataset maintains accuracy and structure throughout the process. Some techniques include:
To run this tutorial effectively, users need to have Python downloaded. This tutorial stably runs with Python 3.13.
There are several ways in which you can run the code provided in this tutorial. Either use IBM® watsonx.ai® to follow along step-by-step or clone our GitHub repository to run the full Jupyter Notebook.
Follow the following steps to set up an IBM account to use a Jupyter Notebook.
1. Several Python versions can work for this tutorial. At the time of publishing, we recommend downloading Python 3.13, the latest version.
2. In your preferred IDE, clone the GitHub repository by using
3. Inside a terminal, create a virtual environment to avoid Python dependency issues.
4. Then, navigate to this tutorial’s directory.
We need a few libraries and modules for this tutorial. Make sure to import the following ones and if they’re not installed, a quick pip installation can be performed. The
Some helpful libraries here include
We will use scanned images from this Kaggle Scanned Document: Table Dataset.
In this example, from a set of source JPG files, we use Docling to convert the documents from scanned images into text and tables.
We will establish a list
Next, we can see some of the text output from the conversions and how it has been organized and structured. Note how we now have text labels associated with the text blocks extracted.
If we were setting up a RAG pipeline here, we would chunk the text next before vectorizing and storing in a vector database.
Output:
Finally, we can export the extracted structured tables to dataframes for easier visualization. Now that the tables have been extracted to a tabular format, they are, by definition, structured and easier for AI application consumption. We can save the extracted data into other structured formats to be stored.
If we were to continue with a RAG application, we would convert the table data to markdown format for passing into a large language model (LLM).
In this tutorial, you converted unstructured data held in scanned documents into an AI-ready structured output.
Although we only converted a few documents, the concepts explored in this simplified use case serve as the foundation for creating an automated extract, transform, load (ETL) workflow for enterprise data. For larger unstructured datasets, validation is an important step to perform after conversion to ensure high data quality and accuracy.
1 “Untapped value: What every executive needs to know about unstructured data,” IDC, Aug 2023.
Ingest, transform, and pre-process unstructured data at scale with watsonx.data integration.
Create resilient, high performing and cost optimized data pipelines for your generative AI initiatives, real-time analytics, warehouse modernization and operational needs with IBM data integration solutions.
Successfully scale AI with the right strategy, data, security and governance in place.