The following notes include several lessons learned for ingestion of documents that include complex tables and different approaches that were tested during one of the engagements. Please note that some of the observations might be different from engagement to engagement however most of the conclusions should be similar for engagements that include documents with nested and complex tables.
Watson Discovery
Watson Discovery does not support documents larger than 50 MB. In cases where the large size is due the fact that the original documents are in the Word format, we need to convert the original Word documents to PDF in order to use Watson Discovery. However, using this approach Watson Discovery is sometimes unable to capture table formats properly, especially in cases of complex or nested tables.
For ingestion using Watson Discovery two different approaches can be tested: a pre-trained model, and a user model trained by manually annotating few pages through the Discovery's Smart Document Understanding. The following discusses lessons learned of these two approaches for one of the engagements that included complex nested tables.
- pre-trained model
- Pros:
- The HTML output of Discovery could often capture the structure of the tables better than a user-trained model (SDU)
- Cons:
- Discovery still had difficulty detecting nested tables. In some cases it would only see the inner-most table as a table and capture the outer tables as text, losing the data structure.
- When using Watson Discovery, using the raw text may result in better results compared to using HTML outputs.
- user-defined model
- Pros:
- Very easy to annotate and good for simple documents where you would like to filter specific sections later.
- Cons:
- Could not detect the nested table structures even if we annotated all the tables within the tables.
- Even for the simple tables it had problem detecting them in new documents; even after manually annotating 20 pages from more than 7 documents.
Although Watson Discovery results were not as accurate as other custom libraries mentioned below for complex and nested tables, the results included a lot of metadata such as page numbers, and coordinates of the texts, which are helpful in showing the highlights for the UI. This metadata can be really helpful for adding extra filters during retrieval or even for splitting the documents based on certain HTML sections.
The PyMuPDF library is much faster and results in more accurate results specially on larger data. Using the following two functions on 50.6 MB data set, it took 0.131 seconds for PyMuPDF vs 0.366 seconds for PypdfLoader from LangChain. So for large files and if you don't want to extract separate sections of the document as separate sections, use the PyMuPdf library.
def pdf_to_text(path: str,
start_page: int = 1,
end_page: Optional[int | None] = None) -> (list[str], list[str], list[dict]):
"""
Converts PDF to plain text.
Params:
path (str): Path to the PDF file.
start_page (int): Page to start getting text from.
end_page (int): Last page to get text from.
"""
logger.debug("Processing PDF %s".format(path))
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader(path)
pages = loader.load()
total_pages = len(pages)
logger.debug(f'Total pages: {total_pages}')
_ids = []
_metadata = []
_file_name = Path(path).name
if end_page is None:
end_page = len(pages)
_text_list = []
for index in range(start_page - 1, end_page):
text = pages[index].page_content
text = text.replace('\n', ' ')
text = re.sub(r'\s+', ' ', text)
_text_list.append(text)
_ids.append(f'page_{index + 1}')
_metadata.append({
"id": index + 1,
"file_name": _file_name,
"page": f'page {index + 1}'
})
return _text_list, _ids, _metadata
def large_pdf_to_text(path: str,
start_page: int = 1,
end_page: Optional[int | None] = None,
remove_string_list=None) -> (list[str], list[str], list[dict]):
"""
Converts PDF to plain text using PyMuPDF, impressive execution speed, making it suitable for
large-scale PDF processing. While it does not offer dedicated table extraction features, it can
still be used to analyze the PDF structure and extract tabular data with additional processing
steps if needed.
50.6MB -> 0.131 seconds (PyMuPDF) vs. 0.366 seconds (pypdf)
Params:
file_stream (stream): Stream to the PDF file.
path (str): Path to the PDF
start_page (int): Page to start getting text from.
end_page (int): Last page to get text from.
remove_string_list (list[str]): Remove string list.
"""
logger.debug("Processing PDF %s".format(path))
if remove_string_list is None:
remove_string_list = []
import fitz
# pdf_reader = fitz.open(stream = file_stream, filetype = "pdf")
pdf_reader = fitz.open(path)
if end_page is None:
end_page = pdf_reader.page_count
_text_list = []
_ids = []
_metadata = []
_file_name = Path(path).name
for index in range(start_page - 1, end_page):
text = pdf_reader[index].get_text("text")
text = text.replace('\n', ' ')
text = re.sub(r'\s+', ' ', text)
for remove_string in remove_string_list:
re_compile = re.compile(remove_string)
text = re_compile.sub('', text)
_metadata.append({
"id": index + 1,
"file_name": _file_name,
"page": f'page {index + 1}'
})
_ids.append(f'page_{index + 1}')
_text_list.append(text)
return _text_list, _ids, _metadata
- Pros:
- Included a lot of metadata such as page numbers and x,y coordinates of the texts similar to Watson Discovery, which could be useful for showing highlights down the line.
- Cons:
- Unable to capture a lot of nested or complex tables properly. Also, for this approach we would need to convert each of the original docx formats to PDF.
- Pros:
- Using the Docs resulted in capturing the tables better and helps in better context for LLMs to generate answers.
- Cons:
- Metadata such as page numbers were not captured properly even after adding page breaks compared to the PDF loader.
HTML table summarization
In an attempt to preserve the structure of the tables, we input table elements from extracted HTML (using Unstructured library) into an LLM (Mixtral for long tables is a good option due to its context size), we can then create a prompt to either summarize the tables or use few shot learning to convert the tables to a format that we think might be more understandable by the large language models.
- Pros:
- The output text would be easier to chunk and ingest and preserved the structure of the tables
- Cons:
- For large-size multi-page tables, for the summarization prompt we might lose some information so the few shot learning to convert the tables instead of summarizing them might result in better outputs.
Other table reader libraries such as camelot and tabula were also tested however these libraries were not as accurate as unstructured for complex tables and, since they only detect tables within a document, more preprocessing might be needed to create meaningful chunks to pass to the large language models.
To conclude, while Watson Discovery would be an easy and fast way to ingest documents and capture a lot of additional metadata that can be helpful in building the UI (for highlighting the passages for example), the “unstructured” library (which has wrappers in both LlamaIndex and LangChain) could capture the complex tables within the documents more accurately compared to Discovery. So a combination of both approaches can be helpful for projects that include complex tables within their documents.