How to convert unstructured data to structured data

In this tutorial, you will use IBM’s open source Docling with Python to convert unstructured data contained in a group of scanned files into a structured format.

Structured vs. unstructured data

Structured data is information organized into a fixed field within a record or file. It resides in SQL databases, JSON from APIs, XML and CSV files and Excel spreadsheets. Structured data is data ready for efficient data processing, analysis and management.

By contrast, unstructured data is information that does not conform to a predefined data model or schema. It lacks an organized, tabular form and is typically text-heavy. Examples include emails, social media posts and customer reviews as well as non-text formats such as audio recordings, video files and images.

Unstructured data makes up the vast majority (90%) of enterprise information, growing faster than any other type of data.1 Certain industries—like healthcare or logistics and supply chain—can have a plethora of scanned documents saved as images ready for processing. Although rich in information, unstructured data is challenging for conventional databases and data analysis tools to process directly.

The importance of conversion

The conversion from unstructured to structured data is vital because structured information is readily interpretable by machines and algorithms. It enables:

 

The conversion process

The goal of the unstructured to structured data conversion process is to transform raw, unstructured inputs into structured or semi-structured outputs that analytics and AI systems can consume.

After collecting the unstructured text or data sources, the data must be processed. This stage transforms raw data into usable data through a series of functions that help ensure every dataset maintains accuracy and structure throughout the process. Some techniques include:

The goal of the unstructured to structured data conversion process is to transform raw, unstructured inputs into structured or semi-structured outputs that analytics and AI systems can consume.

After collecting the unstructured text or data sources, the data must be processed. This stage transforms raw data into usable data through a series of functions that help ensure every dataset maintains accuracy and structure throughout the process. Some techniques include:

 

Prerequisites

To run this tutorial effectively, users need to have Python downloaded. This tutorial stably runs with Python 3.13. 

Steps

Step 1. Set up your environment

There are several ways in which you can run the code provided in this tutorial. Either use IBM® watsonx.ai® to follow along step-by-step or clone our GitHub repository to run the full Jupyter Notebook.

Option 1: Use watsonx.ai

Follow the following steps to set up an IBM account to use a Jupyter Notebook.

  1. You need an IBM Cloud® account to create a watsonx.ai project.
  2. Create a watsonx.ai project by using your IBM Cloud account.
  3. Create a Jupyter Notebook.
    1. This step opens a Jupyter Notebook environment where you can copy the code from this tutorial. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset.       

Option 2: Run the tutorial locally

1. Several Python versions can work for this tutorial. At the time of publishing, we recommend downloading Python 3.13, the latest version.

2. In your preferred IDE, clone the GitHub repository by using https://github.com/IBM/ibmdotcom-tutorials.git  as the HTTPS URL. For detailed steps on how to clone a repository, refer to the GitHub documentation.

3. Inside a terminal, create a virtual environment to avoid Python dependency issues.

python3.13 -m venv myvenv
source myvenv/bin/activate
Powered by Granite
Generating explanation

4. Then, navigate to this tutorial’s directory.

cd docs/tutorials/docling
Powered by Granite
Generating explanation

Step 2. Install and import relevant libraries

We need a few libraries and modules for this tutorial. Make sure to import the following ones and if they’re not installed, a quick pip installation can be performed. The-q  flag quiets or suppresses the progress bars.

Some helpful libraries here includedocling andpandas . We will be using open source Docling’s OCR support for parsing JPG files, but similar OCR tools are available to use with OpenAI and AWS. Pandas will be used to visualize the extracted data from scanned images as a structured dataframe.

We will use scanned images from this Kaggle Scanned Document: Table Dataset.

! pip install -q docling pandas
Powered by Granite
Generating explanation
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
Powered by Granite
Generating explanation

Step 3. Convert scanned images

In this example, from a set of source JPG files, we use Docling to convert the documents from scanned images into text and tables.

We will establish a listsources with the list of two JPG images’ URLs. To convert from images, we will usePdfPipelineOptions settingdo_table_extraction ,do_ocr andgenerate_picture_images all toTrue . Once the format options are established informat_options , we can useDocumentConverter to convert the sources and save the results in a dictionary,conversions .


sources = [“https://storage.googleapis.com/kagglesdsdata/datasets/6696558/10791317/10.1.1.1.2019_2.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20260216%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20260216T162351Z&X-Goog-Expires=345600&X-Goog-SignedHeaders=host&X-Goog-Signature=452bf02297d455b8d09229a59e758b3bf268c45e67c97b9a237b36b2dc9d365bee8d0d83e21fa55df78ec9224e8a64edfc7f25312dc3bfc6a4026a62a79ebb8a72a85856e139b18753d5f60c795bb452c9176b7fd360d400f7af5399e22c615284001af5d68df20c8b25975b86452856391a2bd37ba0354453011dcba6757f6617f7d90d82d660df1777c618d7877b835fd848cfda7cbfae7e14f3566542d9108d8e3f53e8956779a1a6d1afedf42d680411bf280f9827a7410631d708e94c1d03d83f6b40226ccb0bcfa4e9a76fa62aa2c4ea5d8869566f0a4fe9dd9dd0a47a9d46714459173ce0182fcd4e9658c2b2e1a19c5ab27b454bab24e53ffba23cc2”,
“https://storage.googleapis.com/kagglesdsdata/datasets/6696558/10791317/10.1.1.1.2019_3.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20260212%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20260212T215329Z&X-Goog-Expires=345600&X-Goog-SignedHeaders=host&X-Goog-Signature=afb95356cb9ab3c186c37380b3b5d01a80a68a25cf5793a0877dad8e7948e85ed0b31e9352c347f7ef6810490391998dd05ea4abe15993ff2041eafda3831ba7eb9e583e2e200dd330664f0210e389e4c3a8cedae0ba09ac48d4bb3b7e95b0b9c07abb21fa9d9149c6f936bad6a3e7c29de4d96bbe2da8962168c166a482d89a9257d21a19ccebf5c24e4248d0c87aafffd29039a95559a2bea28d30e89e08afce27d894755809b1e19813e7fec56d20bc7922af62152f3d35d180bf8bbb2a2ef36ae4d71cc443b3ca6a2b93ea40d452a81b795f2a098aaa9969798d31b125c43c50590a0b1a59158e96da682499e5d9e96e4103cad68c9185c522921f73c60f”]

pdf_pipeline_options = PdfPipelineOptions(
    do_table_extraction=True,
    do_ocr=True,
    generate_picture_images=True,)

format_options = {InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_pipeline_options),}

converter = DocumentConverter(format_options=format_options)

conversions = { source: converter.convert(source=source).document for source in sources }

Step 4. Clean text output

Next, we can see some of the text output from the conversions and how it has been organized and structured. Note how we now have text labels associated with the text blocks extracted.

If we were setting up a RAG pipeline here, we would chunk the text next before vectorizing and storing in a vector database.

conversions[sources[0]].texts

Output:

[SectionHeaderItem(self_ref=’#/texts/0’, parent=RefItem(cref=’#/body’), children=[], content_layer=<ContentLayer.BODY: ‘body’>, meta=None, label=<DocItemLabel.SECTION_HEADER: ‘section_header’>, prov=[ProvenanceItem(page_no=1, bbox=BoundingBox(l=164.80041162882412, t=959.773255788561, r=246.38477240931664, b=949.9796513245942, coord_origin=<CoordOrigin.BOTTOMLEFT: ‘BOTTOMLEFT’>), charspan=(0, 12))], source=[], comments=[], orig=’2 Susan Bull’, text=’2 Susan Bull’, formatting=None, hyperlink=None, level=1),
TextItem(self_ref=’#/texts/1’, parent=RefItem(cref=’#/body’), children=[], content_layer=<ContentLayer.BODY: ‘body’>, meta=None, label=<DocItemLabel.TEXT: ‘text’>, prov=[ProvenanceItem(page_no=1, bbox=BoundingBox(l=164.8004092391199, t=923.8633721730953, r=629.8312739300967, b=804.7078487680192, coord_origin=<CoordOrigin.BOTTOMLEFT: ‘BOTTOMLEFT’>), charspan=(0, 650))], source=[], comments=[], orig=’17 students taking an MSc in Human Centred Systems took part. 8 had taken an MSc module in User Modelling. All had Pocket PCs. 10 undergraduate students taking a degree in Computer Interactive Systems, who had completed undergraduate modules on Personalisation and Adaptive Systems and Interactive Learning Environments, voluntarily took part. Data was obtained by anonymous questionnaire from all subjects, and anonymous logbooks on Pocket PC use over 6 weeks from MSc students. Due to the low numbers it is inappropriate to perform a statistical analysis of the results: the aim is to discover if initial data indicates further work to be valuable.’, text=’17 students taking an MSc in Human Centred Systems took part. 8 had taken an MSc module in User Modelling. All had Pocket PCs. 10 undergraduate students taking a degree in Computer Interactive Systems, who had completed undergraduate modules on Personalisation and Adaptive Systems and Interactive Learning Environments, voluntarily took part. Data was obtained by anonymous questionnaire from all subjects, and anonymous logbooks on Pocket PC use over 6 weeks from MSc students. Due to the low numbers it is inappropriate to perform a statistical analysis of the results: the aim is to discover if initial data indicates further work to be valuable.’, formatting=None, hyperlink=None),
SectionHeaderItem(self_ref=’#/texts/2’, parent=RefItem(cref=’#/body’), children=[], content_layer=<ContentLayer.BODY: ‘body’>, meta=None, label=<DocItemLabel.SECTION_HEADER: ‘section_header’>, prov=[ProvenanceItem(page_no=1, bbox=BoundingBox(l=166.4027855785592, t=772.2323590576536, r=231.7289027527237, b=760.466768979679, coord_origin=<CoordOrigin.BOTTOMLEFT: ‘BOTTOMLEFT’>), charspan=(0, 11))], source=[], comments=[], orig=’2.1 Results’, text=’2.1 Results’, formatting=None, hyperlink=None, level=1),
TextItem(self_ref=’#/texts/3’, parent=RefItem(cref=’#/body’), children=[], content_layer=<ContentLayer.BODY: ‘body’>, meta=None, label=<DocItemLabel.TEXT: ‘text’>, prov=[ProvenanceItem(page_no=1, bbox=BoundingBox(l=163.16872994516027, t=742.7795250670322, r=629.8312719924684, b=621.8154305269595, coord_origin=<CoordOrigin.BOTTOMLEFT: ‘BOTTOMLEFT’>), charspan=(0, 654))], source=[], comments=[], orig=’Location-Aware User Modelling System. Logbook data shows the most common location of Pocket PC use to be at home, followed by various rooms in EECE. Some students also used their Pocket PC in other parts of the campus and elsewhere. Results of 3 typical users are presented in Table 1, as an example of similarities and differences between Pocket PC use. 10 of the generally common activities are listed: reading, email, web browsing, notes, calendar, computer assisted learning, word processing, calculator, music, games. Each user also performed a few additional tasks in other categories, not shown (e.g. MSN Messenger, Excel, viewing lecture slides).’, text=’Location-Aware User Modelling System. Logbook data shows the most common location of Pocket PC use to be at home, followed by various rooms in EECE. Some students also used their Pocket PC in other parts of the campus and elsewhere. Results of 3 typical users are presented in Table 1, as an example of similarities and differences between Pocket PC use. 10 of the generally common activities are listed: reading, email, web browsing, notes, calendar, computer assisted learning, word processing, calculator, music, games. Each user also performed a few additional tasks in other categories, not shown (e.g. MSN Messenger, Excel, viewing lecture slides).’, formatting=None, hyperlink=None),
TextItem(self_ref=’#/texts/4’, parent=RefItem(cref=’#/tables/0’), children=[], content_layer=<ContentLayer.BODY: ‘body’>, meta=None, label=<DocItemLabel.CAPTION: ‘caption’>, prov=[ProvenanceItem(page_no=1, bbox=BoundingBox(l=164.80041598055863, t=605.5712206071299, r=484.61111464845345, b=594.1453492827321, coord_origin=<CoordOrigin.BOTTOMLEFT: ‘BOTTOMLEFT’>), charspan=(0, 66))], source=[], comments=[], orig=’Table 1: Activities and location of use of Pocket PC by 3 students’, text=’Table 1: Activities and location of use of Pocket PC by 3 students’, formatting=None, hyperlink=None)]

Step 5. Export tables to dataframes

Finally, we can export the extracted structured tables to dataframes for easier visualization. Now that the tables have been extracted to a tabular format, they are, by definition, structured and easier for AI application consumption. We can save the extracted data into other structured formats to be stored.

If we were to continue with a RAG application, we would convert the table data to markdown format for passing into a large language model (LLM).

import pandas as pd

for source in sources:
    for table_ix, table in enumerate(conversions[source].tables):
        table_df: pd.DataFrame = table.export_to_dataframe(doc=conversions[source])
        print(f”## Source {source}”)
        display(table_df)

Conclusion

In this tutorial, you converted unstructured data held in scanned documents into an AI-ready structured output.

Although we only converted a few documents, the concepts explored in this simplified use case serve as the foundation for creating an automated extract, transform, load (ETL) workflow for enterprise data. For larger unstructured datasets, validation is an important step to perform after conversion to ensure high data quality and accuracy.

Author

Erika Russi

Data Scientist

IBM

Related solutions
Unstructured data integration

Ingest, transform, and pre-process unstructured data at scale with watsonx.data integration.

Explore watsonx.data integration
Data integration solutions

Create resilient, high performing and cost optimized data pipelines for your generative AI initiatives, real-time analytics, warehouse modernization and operational needs with IBM data integration solutions.

Explore data integration solutions
Data and AI consulting services

Successfully scale AI with the right strategy, data, security and governance in place.

Explore data and AI consulting services
Take the next step

Learn how IBM watsonx.data integration automates unstructured data ingestion and transformation, preparing it for downstream AI use cases.

  1. Explore watsonx.data integration
  2. Explore data integration solutions