How to convert unstructured data to structured data

By Erika Russi

In this tutorial, you will use IBM’s open source Docling with Python to convert unstructured data contained in a group of scanned files into a structured format.

Structured vs. unstructured data

Structured data is information organized into a fixed field within a record or file. It resides in SQL databases, JSON from APIs, XML and CSV files and Excel spreadsheets. Structured data is data ready for efficient data processing, analysis and management.

By contrast, unstructured data is information that does not conform to a predefined data model or schema. It lacks an organized, tabular form and is typically text-heavy. Examples include emails, social media posts and customer reviews as well as non-text formats such as audio recordings, video files and images.

Unstructured data makes up the vast majority (90%) of enterprise information, growing faster than any other type of data.¹ Certain industries—like healthcare or logistics and supply chain—can have a plethora of scanned documents saved as images ready for processing. Although rich in information, unstructured data is challenging for conventional databases and data analysis tools to process directly.

The importance of conversion

The conversion from unstructured to structured data is vital because structured information is readily interpretable by machines and algorithms. It enables:

Analysis automation: Running real-time queries, generating reports and performing statistical analysis.
Business intelligence: Extracting valuable insights for decision-making.
Machine learning (ML) model readiness: Providing clean, organized inputs for ML models to learn from.
AI-powered solutions: Enabling advanced analytics powered by AI models or retrieval augmented generation (RAG) applications by using generative AI.

The gap between raw documents and usable data carries real business consequences. Swedish proptech company Edsvard closed that gap with Contract Intelligence, built on IBM Cloud and watsonx.data. Watsonx.data processes data where it already lives, removing bespoke ETL (Extract, Transform, Load) pipelines and reducing latency. IBM Cloud’s hybrid design supports both cloud and on-premises deployments. Using OCR, named-entity recognition and a custom language model, the platform extracts structured data from contracts and flags risks before they become disputes. The outcome is a 90% reduction in manual handling, higher data quality, faster property onboarding and fewer tenant disputes. The techniques in this tutorial are a starting point for building exactly that kind of pipeline.

The conversion process

The goal of the unstructured to structured data conversion process is to transform raw, unstructured inputs into structured or semi-structured outputs that analytics and AI systems can consume.

After collecting the unstructured text or data sources, the data must be processed. This stage transforms raw data into usable data through a series of functions that help ensure every dataset maintains accuracy and structure throughout the process. Some techniques include:

The goal of the unstructured to structured data conversion process is to transform raw, unstructured inputs into structured or semi-structured outputs that analytics and AI systems can consume.

Optical character recognition (OCR)—converts scanned documents or images into machine-readable text
Natural language processing (NLP)—pre-processes text and can be used for keyword or feature extraction

Prerequisites

To run this tutorial effectively, users need to have Python downloaded. This tutorial stably runs with Python 3.13.

Steps

Step 1. Set up your environment

There are several ways in which you can run the code provided in this tutorial. Either use IBM® watsonx.ai® to follow along step-by-step or clone our GitHub repository to run the full Jupyter Notebook.

Option 1: Use watsonx.ai

Follow the following steps to set up an IBM account to use a Jupyter Notebook.

You need an IBM Cloud® account to create a watsonx.ai project.
Create a watsonx.ai project by using your IBM Cloud account.
Create a Jupyter Notebook.
1. This step opens a Jupyter Notebook environment where you can copy the code from this tutorial. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset.

Option 2: Run the tutorial locally

1. Several Python versions can work for this tutorial. At the time of publishing, we recommend downloading Python 3.13, the latest version.

2. In your preferred IDE, clone the GitHub repository by using https://github.com/IBM/ibmdotcom-tutorials.git as the HTTPS URL. For detailed steps on how to clone a repository, refer to the GitHub documentation.

3. Inside a terminal, create a virtual environment to avoid Python dependency issues.

python3.13 -m venv myvenv
source myvenv/bin/activate

4. Then, navigate to this tutorial’s directory.

cd docs/tutorials/docling

Step 2. Install and import relevant libraries

We need a few libraries and modules for this tutorial. Make sure to import the following ones and if they’re not installed, a quick pip installation can be performed. The-q flag quiets or suppresses the progress bars.

Some helpful libraries here includedocling andpandas . We will be using open source Docling’s OCR support for parsing JPG files, but similar OCR tools are available to use with OpenAI and AWS. Pandas will be used to visualize the extracted data from scanned images as a structured dataframe.

We will use scanned images from this Kaggle Scanned Document: Table Dataset.

! pip install -q docling pandas

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

Step 3. Convert scanned images

In this example, from a set of source JPG files, we use Docling to convert the documents from scanned images into text and tables.

We will establish a listsources with the list of two JPG images’ URLs. To convert from images, we will usePdfPipelineOptions settingdo_table_extraction ,do_ocr andgenerate_picture_images all toTrue . Once the format options are established informat_options , we can useDocumentConverter to convert the sources and save the results in a dictionary,conversions .

sources = [“https://storage.googleapis.com/kagglesdsdata/datasets/6696558/10791317/10.1.1.1.2019_2.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20260216%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20260216T162351Z&X-Goog-Expires=345600&X-Goog-SignedHeaders=host&X-Goog-Signature=452bf02297d455b8d09229a59e758b3bf268c45e67c97b9a237b36b2dc9d365bee8d0d83e21fa55df78ec9224e8a64edfc7f25312dc3bfc6a4026a62a79ebb8a72a85856e139b18753d5f60c795bb452c9176b7fd360d400f7af5399e22c615284001af5d68df20c8b25975b86452856391a2bd37ba0354453011dcba6757f6617f7d90d82d660df1777c618d7877b835fd848cfda7cbfae7e14f3566542d9108d8e3f53e8956779a1a6d1afedf42d680411bf280f9827a7410631d708e94c1d03d83f6b40226ccb0bcfa4e9a76fa62aa2c4ea5d8869566f0a4fe9dd9dd0a47a9d46714459173ce0182fcd4e9658c2b2e1a19c5ab27b454bab24e53ffba23cc2”,
“https://storage.googleapis.com/kagglesdsdata/datasets/6696558/10791317/10.1.1.1.2019_3.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20260212%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20260212T215329Z&X-Goog-Expires=345600&X-Goog-SignedHeaders=host&X-Goog-Signature=afb95356cb9ab3c186c37380b3b5d01a80a68a25cf5793a0877dad8e7948e85ed0b31e9352c347f7ef6810490391998dd05ea4abe15993ff2041eafda3831ba7eb9e583e2e200dd330664f0210e389e4c3a8cedae0ba09ac48d4bb3b7e95b0b9c07abb21fa9d9149c6f936bad6a3e7c29de4d96bbe2da8962168c166a482d89a9257d21a19ccebf5c24e4248d0c87aafffd29039a95559a2bea28d30e89e08afce27d894755809b1e19813e7fec56d20bc7922af62152f3d35d180bf8bbb2a2ef36ae4d71cc443b3ca6a2b93ea40d452a81b795f2a098aaa9969798d31b125c43c50590a0b1a59158e96da682499e5d9e96e4103cad68c9185c522921f73c60f”]

pdf_pipeline_options = PdfPipelineOptions(
do_table_extraction=True,
do_ocr=True,
generate_picture_images=True,)

format_options = {InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_pipeline_options),}

converter = DocumentConverter(format_options=format_options)

conversions = { source: converter.convert(source=source).document for source in sources }

Step 4. Clean text output

Next, we can see some of the text output from the conversions and how it has been organized and structured. Note how we now have text labels associated with the text blocks extracted.

If we were setting up a RAG pipeline here, we would chunk the text next before vectorizing and storing in a vector database.

conversions[sources[0]].texts

Output:

[SectionHeaderItem(self_ref=’#/texts/0’, parent=RefItem(cref=’#/body’), children=[], content_layer=<ContentLayer.BODY: ‘body’>, meta=None, label=<DocItemLabel.SECTION_HEADER: ‘section_header’>, prov=[ProvenanceItem(page_no=1, bbox=BoundingBox(l=164.80041162882412, t=959.773255788561, r=246.38477240931664, b=949.9796513245942, coord_origin=<CoordOrigin.BOTTOMLEFT: ‘BOTTOMLEFT’>), charspan=(0, 12))], source=[], comments=[], orig=’2 Susan Bull’, text=’2 Susan Bull’, formatting=None, hyperlink=None, level=1),
TextItem(self_ref=’#/texts/1’, parent=RefItem(cref=’#/body’), children=[], content_layer=<ContentLayer.BODY: ‘body’>, meta=None, label=<DocItemLabel.TEXT: ‘text’>, prov=[ProvenanceItem(page_no=1, bbox=BoundingBox(l=164.8004092391199, t=923.8633721730953, r=629.8312739300967, b=804.7078487680192, coord_origin=<CoordOrigin.BOTTOMLEFT: ‘BOTTOMLEFT’>), charspan=(0, 650))], source=[], comments=[], orig=’17 students taking an MSc in Human Centred Systems took part. 8 had taken an MSc module in User Modelling. All had Pocket PCs. 10 undergraduate students taking a degree in Computer Interactive Systems, who had completed undergraduate modules on Personalisation and Adaptive Systems and Interactive Learning Environments, voluntarily took part. Data was obtained by anonymous questionnaire from all subjects, and anonymous logbooks on Pocket PC use over 6 weeks from MSc students. Due to the low numbers it is inappropriate to perform a statistical analysis of the results: the aim is to discover if initial data indicates further work to be valuable.’, text=’17 students taking an MSc in Human Centred Systems took part. 8 had taken an MSc module in User Modelling. All had Pocket PCs. 10 undergraduate students taking a degree in Computer Interactive Systems, who had completed undergraduate modules on Personalisation and Adaptive Systems and Interactive Learning Environments, voluntarily took part. Data was obtained by anonymous questionnaire from all subjects, and anonymous logbooks on Pocket PC use over 6 weeks from MSc students. Due to the low numbers it is inappropriate to perform a statistical analysis of the results: the aim is to discover if initial data indicates further work to be valuable.’, formatting=None, hyperlink=None),
SectionHeaderItem(self_ref=’#/texts/2’, parent=RefItem(cref=’#/body’), children=[], content_layer=<ContentLayer.BODY: ‘body’>, meta=None, label=<DocItemLabel.SECTION_HEADER: ‘section_header’>, prov=[ProvenanceItem(page_no=1, bbox=BoundingBox(l=166.4027855785592, t=772.2323590576536, r=231.7289027527237, b=760.466768979679, coord_origin=<CoordOrigin.BOTTOMLEFT: ‘BOTTOMLEFT’>), charspan=(0, 11))], source=[], comments=[], orig=’2.1 Results’, text=’2.1 Results’, formatting=None, hyperlink=None, level=1),
TextItem(self_ref=’#/texts/3’, parent=RefItem(cref=’#/body’), children=[], content_layer=<ContentLayer.BODY: ‘body’>, meta=None, label=<DocItemLabel.TEXT: ‘text’>, prov=[ProvenanceItem(page_no=1, bbox=BoundingBox(l=163.16872994516027, t=742.7795250670322, r=629.8312719924684, b=621.8154305269595, coord_origin=<CoordOrigin.BOTTOMLEFT: ‘BOTTOMLEFT’>), charspan=(0, 654))], source=[], comments=[], orig=’Location-Aware User Modelling System. Logbook data shows the most common location of Pocket PC use to be at home, followed by various rooms in EECE. Some students also used their Pocket PC in other parts of the campus and elsewhere. Results of 3 typical users are presented in Table 1, as an example of similarities and differences between Pocket PC use. 10 of the generally common activities are listed: reading, email, web browsing, notes, calendar, computer assisted learning, word processing, calculator, music, games. Each user also performed a few additional tasks in other categories, not shown (e.g. MSN Messenger, Excel, viewing lecture slides).’, text=’Location-Aware User Modelling System. Logbook data shows the most common location of Pocket PC use to be at home, followed by various rooms in EECE. Some students also used their Pocket PC in other parts of the campus and elsewhere. Results of 3 typical users are presented in Table 1, as an example of similarities and differences between Pocket PC use. 10 of the generally common activities are listed: reading, email, web browsing, notes, calendar, computer assisted learning, word processing, calculator, music, games. Each user also performed a few additional tasks in other categories, not shown (e.g. MSN Messenger, Excel, viewing lecture slides).’, formatting=None, hyperlink=None),
TextItem(self_ref=’#/texts/4’, parent=RefItem(cref=’#/tables/0’), children=[], content_layer=<ContentLayer.BODY: ‘body’>, meta=None, label=<DocItemLabel.CAPTION: ‘caption’>, prov=[ProvenanceItem(page_no=1, bbox=BoundingBox(l=164.80041598055863, t=605.5712206071299, r=484.61111464845345, b=594.1453492827321, coord_origin=<CoordOrigin.BOTTOMLEFT: ‘BOTTOMLEFT’>), charspan=(0, 66))], source=[], comments=[], orig=’Table 1: Activities and location of use of Pocket PC by 3 students’, text=’Table 1: Activities and location of use of Pocket PC by 3 students’, formatting=None, hyperlink=None)]

Step 5. Export tables to dataframes

Finally, we can export the extracted structured tables to dataframes for easier visualization. Now that the tables have been extracted to a tabular format, they are, by definition, structured and easier for AI application consumption. We can save the extracted data into other structured formats to be stored.

If we were to continue with a RAG application, we would convert the table data to markdown format for passing into a large language model (LLM).

import pandas as pd

for source in sources:
for table_ix, table in enumerate(conversions[source].tables):
table_df: pd.DataFrame = table.export_to_dataframe(doc=conversions[source])
print(f”## Source {source}”)
display(table_df)

Conclusion

In this tutorial, you converted unstructured data held in scanned documents into an AI-ready structured output.

Although we only converted a few documents, the concepts explored in this simplified use case serve as the foundation for creating an automated extract, transform, load (ETL) workflow for enterprise data. For larger unstructured datasets, validation is an important step to perform after conversion to ensure high data quality and accuracy.

Footnotes

¹“Untapped value: What every executive needs to know about unstructured data,” IDC, Aug 2023.

Author

Erika Russi

Data Scientist

IBM

3D render of a spiral of several icons lined up such as a camera, volume knob and a clipboard

Read the Data Leader's guide to learn how you can make your organization's data AI-ready.

Resources

3D render of several icons lined up such as a microphone and a camera

AI Agents run on data - is yours ready?

Your data is your competitive edge. Learn how to unlock it securely and drive measurable ROI from AI in this short webinar.

Data management explained

Techsplainers by IBM breaks down the essentials of data for AI, from key concepts to real‑world use cases. Clear, quick episodes help you learn the fundamentals fast.

3D rendering of several icons lined up such as a volume knob and a clipboard

Unify and access your data to help scale your AI

Learn why the path to AI-ready data often starts with effective access to both structured and unstructured data and the challenges that can impede data leaders.

Legal overhead turned into strategic insight

Learn how an AI-powered legal agent helps accelerate decision-making, reduce manual work and improve compliance.

Two men talking to each other on a podcast

AI Academy: Building a data strategy for enterprise AI

In this episode, Cathy Reese explains how organizations today need a data strategy that’s ready for advanced AI, which will require them to harness their highest quality data assets.

3D rendering of several icons lined up such as a camera and paper airplanes

The hybrid, open data lakehouse for AI

Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.

Cost of a Data Breach Report 2025

Data breach costs have hit a new high. Get up-to-date insights into cybersecurity threats and their financial impacts on organizations.

3D render of two lines of several icons such as a camera, volume knob and a clipboard

The data leader’s guide to AI-ready data

Understand the actionable steps data leaders can take to overcome data challenges, establish the groundwork for a trusted data foundation and help get your organization’s data ready for AI.

3D render of several icons lined up such as a camera, volume knob and a clipboard

How the C-suite is turning information into impact

Explore insights from 1,700 CDOs in this cross-industry report for data leaders.