Data Ingestion

Overview

Ingestion is the process of parsing information from source documents so that it can be embedded into a search space for later retrieval. While this is a straightforward process for plain text complications arise when the source documents are in non 'text' formats, eg. Microsoft Word or PDF, and when they contain complex formatting such as repeating headers and footers, text in multiple columns, or tables.

This section contains tips, techniques, accelerators, and pointers to assets that help overcome these challenges.

Where to Start?

If you are a non-technical user, your documents are relatively simple and you need a solution with no code use Watsonx Orchestrate.

If you are a technical user and your documents are relatively simple and you have access to Watson Discovery, start from there. Watson Discovery has a nice UI to use with minimal need for coding. For complex documents, you can use watson discovery to annotate and extract specific parts of your documents, either for your RAG backend pipeline, or for front end and highlighting the texts by getting the exact position of a paragraph within your documents.

However, Watson Discovery has its limitations. If your documents are larger than 50 MB, you will not be able to load them into Watson Discovery. If the document is too complex (i.e. includes nested tables or irregular table formats), Watson Discovery may not capture the whole document structure. You may want to implement a custom data ingestion pipeline using open source libraries for such cases.

With regards to open source libraries, LangChain and LlamaIndex are very similar in terms of data ingestion capabilities. However, LangChain would be easier to use than LlamaIndex in terms of documentation and integration with vector databases and other applications. Both LangChain and LlamaIndex have helper functions that can convert a Document object from one to another, so switching between LangChain and LlamaIndex can be done at any point in your pipeline.

The IBM RAG Cookbook

Ingesting Documents with Complex Tables

The following notes include several lessons learned for ingestion of documents that include complex tables and different approaches that were tested during one of the engagements. Please note that some of the observations might be different from engagement to engagement however most of the conclusions should be similar for engagements that include documents with nested and complex tables.

Watson Discovery

Watson Discovery does not support documents larger than 50 MB. In cases where the large size is due the fact that the original documents are in the Word format, we need to convert the original Word documents to PDF in order to use Watson Discovery. However, using this approach Watson Discovery is sometimes unable to capture table formats properly, especially in cases of complex or nested tables.

For ingestion using Watson Discovery two different approaches can be tested: a pre-trained model, and a user model trained by manually annotating few pages through the Discovery's Smart Document Understanding. The following discusses lessons learned of these two approaches for one of the engagements that included complex nested tables.

pre-trained model
- Pros:
  - The HTML output of Discovery could often capture the structure of the tables better than a user-trained model (SDU)
- Cons:
  - Discovery still had difficulty detecting nested tables. In some cases it would only see the inner-most table as a table and capture the outer tables as text, losing the data structure.
- When using Watson Discovery, using the raw text may result in better results compared to using HTML outputs.

user-defined model
- Pros:
  - Very easy to annotate and good for simple documents where you would like to filter specific sections later.
- Cons:
  - Could not detect the nested table structures even if we annotated all the tables within the tables.
  - Even for the simple tables it had problem detecting them in new documents; even after manually annotating 20 pages from more than 7 documents.

Although Watson Discovery results were not as accurate as other custom libraries mentioned below for complex and nested tables, the results included a lot of metadata such as page numbers, and coordinates of the texts, which are helpful in showing the highlights for the UI. This metadata can be really helpful for adding extra filters during retrieval or even for splitting the documents based on certain HTML sections.

PyPDFLoader vs PyMuPDF

The PyMuPDF library is much faster and results in more accurate results specially on larger data. Using the following two functions on 50.6 MB data set, it took 0.131 seconds for PyMuPDF vs 0.366 seconds for PypdfLoader from LangChain. So for large files and if you don't want to extract separate sections of the document as separate sections, use the PyMuPdf library.

def pdf_to_text(path: str,
                        start_page: int = 1,
                        end_page: Optional[int | None] = None) -> (list[str], list[str], list[dict]):
    """
    Converts PDF to plain text.

    Params:
        path (str): Path to the PDF file.
        start_page (int): Page to start getting text from.
        end_page (int): Last page to get text from.
    """
    logger.debug("Processing PDF %s".format(path))

    from langchain_community.document_loaders import PyPDFLoader

    loader = PyPDFLoader(path)
    pages = loader.load()
    total_pages = len(pages)
    logger.debug(f'Total pages: {total_pages}')
    _ids = []
    _metadata = []
    _file_name = Path(path).name

    if end_page is None:
        end_page = len(pages)

    _text_list = []
    for index in range(start_page - 1, end_page):
        text = pages[index].page_content
        text = text.replace('\n', ' ')
        text = re.sub(r'\s+', ' ', text)
        _text_list.append(text)
        _ids.append(f'page_{index + 1}')
        _metadata.append({
            "id": index + 1,
            "file_name": _file_name,
            "page": f'page {index + 1}'
        })

    return _text_list, _ids, _metadata

def large_pdf_to_text(path: str,
                        start_page: int = 1,
                        end_page: Optional[int | None] = None,
                        remove_string_list=None) -> (list[str], list[str], list[dict]):
    """
        Converts PDF to plain text using PyMuPDF, impressive execution speed, making it suitable for
        large-scale PDF processing. While it does not offer dedicated table extraction features, it can
        still be used to analyze the PDF structure and extract tabular data with additional processing
        steps if needed.

        50.6MB -> 0.131 seconds (PyMuPDF) vs. 0.366 seconds (pypdf)

        Params:
            file_stream (stream): Stream to the PDF file.
            path (str): Path to the PDF
            start_page (int): Page to start getting text from.
            end_page (int): Last page to get text from.
            remove_string_list (list[str]): Remove string list.
        """
    logger.debug("Processing PDF %s".format(path))
    if remove_string_list is None:
        remove_string_list = []

    import fitz

    # pdf_reader = fitz.open(stream = file_stream, filetype = "pdf")
    pdf_reader = fitz.open(path)
    if end_page is None:
        end_page = pdf_reader.page_count

    _text_list = []
    _ids = []
    _metadata = []
    _file_name = Path(path).name
    for index in range(start_page - 1, end_page):
        text = pdf_reader[index].get_text("text")
        text = text.replace('\n', ' ')
        text = re.sub(r'\s+', ' ', text)
        for remove_string in remove_string_list:
            re_compile = re.compile(remove_string)
            text = re_compile.sub('', text)
            _metadata.append({
                "id": index + 1,
                "file_name": _file_name,
                "page": f'page {index + 1}'
            })
        _ids.append(f'page_{index + 1}')
        _text_list.append(text)

    return _text_list, _ids, _metadata

Unstructured PDF loader for html ingestion UnstructuredPDFLoader from LangChain

Pros:
- Included a lot of metadata such as page numbers and x,y coordinates of the texts similar to Watson Discovery, which could be useful for showing highlights down the line.
Cons:
- Unable to capture a lot of nested or complex tables properly. Also, for this approach we would need to convert each of the original docx formats to PDF.

Unstructured Word loader for HTML ingestion UnstructuredWordDocumentLoader from LangChain

Pros:
- Using the Docs resulted in capturing the tables better and helps in better context for LLMs to generate answers.
Cons:
- Metadata such as page numbers were not captured properly even after adding page breaks compared to the PDF loader.

HTML table summarization

In an attempt to preserve the structure of the tables, we input table elements from extracted HTML (using Unstructured library) into an LLM (Mixtral for long tables is a good option due to its context size), we can then create a prompt to either summarize the tables or use few shot learning to convert the tables to a format that we think might be more understandable by the large language models.

Pros:
- The output text would be easier to chunk and ingest and preserved the structure of the tables
Cons:
- For large-size multi-page tables, for the summarization prompt we might lose some information so the few shot learning to convert the tables instead of summarizing them might result in better outputs.

Other table reader libraries such as camelot and tabula were also tested however these libraries were not as accurate as unstructured for complex tables and, since they only detect tables within a document, more preprocessing might be needed to create meaningful chunks to pass to the large language models.

To conclude, while Watson Discovery would be an easy and fast way to ingest documents and capture a lot of additional metadata that can be helpful in building the UI (for highlighting the passages for example), the “unstructured” library (which has wrappers in both LlamaIndex and LangChain) could capture the complex tables within the documents more accurately compared to Discovery. So a combination of both approaches can be helpful for projects that include complex tables within their documents.

IBM Tools

Ingestion with Watsonx Orchestrate

Ingestion without a graphical user interface can be hard for business users as it requires tech experience and knowledge. Watsonx Orchestrate eliminates the frustration for business users by implementing a drag-and-drop user interface (UI). By clicking the Blue upload button we initiate the ability to directly upload and store documents into watsonx Discovery (aka Elasticsearch).

This allows user to directly upload into Elasticsearch their documentation to create their own custom knowledge base.

Instead of directly ingesting the documents in to watsonx Discovery, you also have the option to connect to Watson Discovery through watsonx Orchestrate instead. You can find more details about how to ingest data through Watson Discovery below.

Watson Discovery

One of the other internal tools that can be used for ingestion is Watson Discovery. Watson Discovery is not purely an ingestion tool and with it, you can ingest, normalize, enrich, and search your unstructured data. The image below shows the core components and capabilities of Watson Discovery.

As you can see above, one of the core capabilities of Watson Discovery is ingesting structured and unstructured data in different formats such as JSON, HTML, PDF, Word, and more, and Smart Document Understanding.

If you have Watson Discovery provisioned in your environment, you can easily ingest documents by first creating a project and then creating a collection where you can upload your data from either local storage or a cloud location such as an object storage bucket.

Once you create your collection, you can go to the Manage Collections tab on the hamburger menu and then, under Identity Fields, you can choose which method you want to use to process the documents.

There are three options:

Text extraction only: which extracts only texts from the documents.
User-trained models: where you can annotate different parts of your document with custom tags and train your own models based on repeated, visual patterns within your documents
pre-trained models: which can extract text and identify tables, lists and sections using pre-trained ibm models.

It might take a while until all your documents are processed. After that you can use Watson Discovery API to read and use your data in your code.

Below is sample code showing how you can use the API to read the data from your collection:

from langchain.docstore.document import Document
import os
import logging
import time
from ibm_watson import DiscoveryV2
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator

WD_PAGE_SIZE = 200
MAX_RETRIES = 2
def get_documents_from_wd(collection_ids=[WD_COLLECTION_ID]):

    print("Fetching documents from Watson Discovery")
    documents = []

    print("Configuring the WD client")
    authenticator = IAMAuthenticator(WD_API_KEY)
    wd_client = DiscoveryV2(
        version="2023-03-31",
        authenticator=authenticator
    )

    print("Setting the WD service URL")
    wd_client.set_service_url(WD_SERVICE_URL)

    print("Fetching documents from WD")
    page_id = 0
    retries = 0
    while True:
        try:
            print("Fetching page: " + str(page_id))
            response = wd_client.query(
                project_id=WD_PROJECT_ID,
                collection_ids=collection_ids,
                return_=["text"],
                count=WD_PAGE_SIZE,
                offset=page_id*WD_PAGE_SIZE
            ).get_result()
            if response is None or not isinstance(response, dict):
                print("No query result")
                raise ValueError("No query result")
            if "results" not in response or response["results"] is None or not isinstance(response["results"], list):
                print("No query result 2")
                raise ValueError("No query result")
            results = response["results"]
            if len(results) == 0:
                print("No more results")
                break
            print("Fetched " + str(len(results)) + " documents")
            documents.extend(list(map(lambda result: Document(page_content=result["text"][0],
                                                              metadata= {"collection_id": WD_COLLECTION_ID, "document_id" : result['document_id']}), results)))
            page_id += 1
        except Exception as error:
            logging.error("Failed to fetch documents from WD", str(error))
            retries += 1
            time.sleep(5)
            if retries > MAX_RETRIES:
                break
            print("Retrying...")
    print("Fetched " + str(len(documents)) + " documents")
    return documents

Web Crawl in Watson Discovery

When creating a project in Watson Discovery you can schedule a web crawl to fetch information from an specified URL. All you need to do is to select the Web Crawl when selecting the data source and then specify the the URL and the Crawl schedule.

After that you can do the steps that were mentioned above to manage your collection and use the API to read the data from your collection. To see more details on how to setup Web Crawl in Watson Discovery you can watch this video.

For more information and features you can refer to the Watson Discovery developer page.

Deep Search

Deep Search is IBM Research's open-source toolkit for ingestion. Deep Search leverages state-of-the-art AI methods to continuously collect, convert, enrich, and link large document collections. You can use it for both public and proprietary PDF documents.

Deep Search converts unstructured PDF documents into structured JSON files with accuracy and ease. It enables you to automate knowledge extraction as well as to fine-tune your proprietary Foundational Models and Large Language Models.

You can find more information on Deep Search here.

Deep Search can be a good internal tool for reading complex documents in place of unstructured or other open source libraries. You can also find very good documentation on the repo with different types of example as python notebooks:

Deep Search Example Notebooks

Table Ingestion from PDFs

Deep Search provides good results when extracting tables from PDFs in JSON format.

For testing, we've used a partial PDF document from the CIPP, which includes three pages, with each displaying table data. The best performance of RAG in this case is achieved by using Deep Sarch to extract tables into JSON format, then converting them to HTML for use in Watson Discovery / watsonx Discovery instead of directly using PDFs, converting PDFs to HTMLs without Deep Search, or only using Deep Search's JSON in Watson Discovery / watsonx Discovery. We found that Watson Discovery selects the wrong answer from the wrong cell in the table in some cases but faster, and in the same case, Watson Discovery can pinpoint the exact answer within the table grid.

IBM Datacap

IBM Datacap is a comprehensive solution for document and data capture; offering fast, accurate, and cost-effective scanning, classification, recognition, validation, and export of data and document images. It captures documents, extracts relevant data, and integrates it into downstream business processes.

Datacap acquires paper documents via scanners, multifunction printers, or mobile devices and imports electronic documents from file systems, fax, or email servers. It enhances data extraction with image-processing features such as deskewing and removing lines, smears, and borders.

Datacap uses optical character recognition (OCR), intelligent character recognition (ICR) for handwriting, optical mark recognition (OMR) for marks, and bar code reading to efficiently extract data.

For detailed installation and usage instructions, see the IBM Datacap documentation.

Open Source Tools

LangChain Data Loaders

LangChain supports different types of document loaders which load data from a source as LangChain's Document object. A Document object is a piece of text and associated metadata which can be eaily used for chunking and retrieval later down the pipeline.

Currently LangChain supports loaders for either loading a whole directory or different types of files such as:

CSV
HTML
JSON
Markdown
Microsoft Office
PDF

You can use these loaders in few lines in python. For example to load a PDF you can use the following code:

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("example_data/layout-parser-paper.pdf")
pages = loader.load_and_split()

Please refer to LangChain's Documentation to see how to use these loaders.

Other than the ones mentioned in the link above, you can find additional loaders which can load data directly from specific applications such as email, Github, Google drive, etc. You can find the list of all different loaders here.

In cases where you have documents with complex structures, formats and tables, you can use the unstructured loaders from LangChain. These loaders under the hood use the unstructured library which is one of the best libraries for complex documents (in different formats such as PDF, MSWord, Power Point, etc,) and for breaking down the documents into different components: LangChain UnstructuredLoader

LlamaIndex Data Loaders

Similar to LangChain, LlamaIndex also supports different types of loaders. One of the simples ways to load data using LlamaIndex is to use the SimpleDirectoryReader. SimpleDirectoryReader supports the following types:

.csv - comma-separated values
.docx - Microsoft Word
.epub - EPUB ebook format
.hwp - Hangul Word Processor
.ipynb - Jupyter Notebook
.jpeg, .jpg - JPEG image
.mbox - MBOX email archive
.md - Markdown
.mp3, .mp4 - audio and video
.pdf - Portable Document Format
.png - Portable Network Graphics
.ppt, .pptm, .pptx - Microsoft PowerPoint

You can use the SimpleDirectoryLoader using the following code snippet:

from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_dir="path/to/directory")
documents = reader.load_data()

You can also add specific extensions to read only those extensions from the directory:

SimpleDirectoryReader(
   input_dir="path/to/directory", required_exts=[".pdf", ".docx"]
)

For ore details on how to use LlamaIndex Loaders please refer to its documentation.

Also similar to Huggingface you can create your own custom loaders or use other people's custom loaders on llamahub.ai.

One disadvantage of LlamaIndex compared to LangChain though is that the documentation and community support is not as good as LangChain so debugging your code might be harder.

Unstructured

In cases where you are loading complex documents, you can also directly use the unstructured library. Unstructured library is really good at reading and parsing tables.

The unstructured library is also supported by both LangChain and LlamaIndex wrappers and that might be easier to use it for RAG pipelines down the line since they will automatically create and return the Document objects. However, if your project needs more customization and you want to directly use the unstructured library instead, you can refer to the following documentation: Unstructured.

Data Ingestion

Overview

Where to Start?

Ingesting Documents with Complex Tables

Watson Discovery

PyPDFLoader vs PyMuPDF

Unstructured PDF loader for html ingestion UnstructuredPDFLoader from LangChain

Unstructured Word loader for HTML ingestion UnstructuredWordDocumentLoader from LangChain

HTML table summarization

IBM Tools

Ingestion with Watsonx Orchestrate

Watson Discovery

Web Crawl in Watson Discovery

Deep Search

Table Ingestion from PDFs

IBM Datacap

Open Source Tools

LangChain Data Loaders

LlamaIndex Data Loaders

Unstructured

Contributors