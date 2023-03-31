Explore more of what the RAG Cookbook has to offer to gain a deeper insight into today's RAG solutions
Ingestion is the process of parsing information from source documents so that it can be embedded into a search space for later retrieval. While this is a straightforward process for plain text complications arise when the source documents are in non 'text' formats, eg. Microsoft Word or PDF, and when they contain complex formatting such as repeating headers and footers, text in multiple columns, or tables.
This section contains tips, techniques, accelerators, and pointers to assets that help overcome these challenges.
If you are a non-technical user, your documents are relatively simple and you need a solution with no code use Watsonx Orchestrate.
If you are a technical user and your documents are relatively simple and you have access to Watson Discovery, start from there. Watson Discovery has a nice UI to use with minimal need for coding. For complex documents, you can use watson discovery to annotate and extract specific parts of your documents, either for your RAG backend pipeline, or for front end and highlighting the texts by getting the exact position of a paragraph within your documents.
However, Watson Discovery has its limitations. If your documents are larger than 50 MB, you will not be able to load them into Watson Discovery. If the document is too complex (i.e. includes nested tables or irregular table formats), Watson Discovery may not capture the whole document structure. You may want to implement a custom data ingestion pipeline using open source libraries for such cases.
With regards to open source libraries, LangChain and LlamaIndex are very similar in terms of data ingestion capabilities. However, LangChain would be easier to use than LlamaIndex in terms of documentation and integration with vector databases and other applications. Both LangChain and LlamaIndex have helper functions that can convert a Document object from one to another, so switching between LangChain and LlamaIndex can be done at any point in your pipeline.
The following notes include several lessons learned for ingestion of documents that include complex tables and different approaches that were tested during one of the engagements. Please note that some of the observations might be different from engagement to engagement however most of the conclusions should be similar for engagements that include documents with nested and complex tables.
Watson Discovery does not support documents larger than 50 MB. In cases where the large size is due the fact that the original documents are in the Word format, we need to convert the original Word documents to PDF in order to use Watson Discovery. However, using this approach Watson Discovery is sometimes unable to capture table formats properly, especially in cases of complex or nested tables.
For ingestion using Watson Discovery two different approaches can be tested: a pre-trained model, and a user model trained by manually annotating few pages through the Discovery's Smart Document Understanding. The following discusses lessons learned of these two approaches for one of the engagements that included complex nested tables.
Although Watson Discovery results were not as accurate as other custom libraries mentioned below for complex and nested tables, the results included a lot of metadata such as page numbers, and coordinates of the texts, which are helpful in showing the highlights for the UI. This metadata can be really helpful for adding extra filters during retrieval or even for splitting the documents based on certain HTML sections.
The PyMuPDF library is much faster and results in more accurate results specially on larger data. Using the following two functions on 50.6 MB data set, it took 0.131 seconds for PyMuPDF vs 0.366 seconds for PypdfLoader from LangChain. So for large files and if you don't want to extract separate sections of the document as separate sections, use the PyMuPdf library.
In an attempt to preserve the structure of the tables, we input table elements from extracted HTML (using Unstructured library) into an LLM (Mixtral for long tables is a good option due to its context size), we can then create a prompt to either summarize the tables or use few shot learning to convert the tables to a format that we think might be more understandable by the large language models.
Other table reader libraries such as camelot and tabula were also tested however these libraries were not as accurate as unstructured for complex tables and, since they only detect tables within a document, more preprocessing might be needed to create meaningful chunks to pass to the large language models.
To conclude, while Watson Discovery would be an easy and fast way to ingest documents and capture a lot of additional metadata that can be helpful in building the UI (for highlighting the passages for example), the “unstructured” library (which has wrappers in both LlamaIndex and LangChain) could capture the complex tables within the documents more accurately compared to Discovery. So a combination of both approaches can be helpful for projects that include complex tables within their documents.
Ingestion without a graphical user interface can be hard for business users as it requires tech experience and knowledge. Watsonx Orchestrate eliminates the frustration for business users by implementing a drag-and-drop user interface (UI). By clicking the Blue upload button we initiate the ability to directly upload and store documents into watsonx Discovery (aka Elasticsearch).
This allows user to directly upload into Elasticsearch their documentation to create their own custom knowledge base.
Instead of directly ingesting the documents in to watsonx Discovery, you also have the option to connect to Watson Discovery through watsonx Orchestrate instead. You can find more details about how to ingest data through Watson Discovery below.
One of the other internal tools that can be used for ingestion is Watson Discovery. Watson Discovery is not purely an ingestion tool and with it, you can ingest, normalize, enrich, and search your unstructured data. The image below shows the core components and capabilities of Watson Discovery.
As you can see above, one of the core capabilities of Watson Discovery is ingesting structured and unstructured data in different formats such as JSON, HTML, PDF, Word, and more, and Smart Document Understanding.
If you have Watson Discovery provisioned in your environment, you can easily ingest documents by first creating a project and then creating a collection where you can upload your data from either local storage or a cloud location such as an object storage bucket.
Once you create your collection, you can go to the Manage Collections tab on the hamburger menu and then, under Identity Fields, you can choose which method you want to use to process the documents.
There are three options:
It might take a while until all your documents are processed. After that you can use Watson Discovery API to read and use your data in your code.
Below is sample code showing how you can use the API to read the data from your collection:
When creating a project in Watson Discovery you can schedule a web crawl to fetch information from an specified URL. All you need to do is to select the Web Crawl when selecting the data source and then specify the the URL and the Crawl schedule.
After that you can do the steps that were mentioned above to manage your collection and use the API to read the data from your collection. To see more details on how to setup Web Crawl in Watson Discovery you can watch this video.
For more information and features you can refer to the Watson Discovery developer page.
Deep Search is IBM Research's open-source toolkit for ingestion. Deep Search leverages state-of-the-art AI methods to continuously collect, convert, enrich, and link large document collections. You can use it for both public and proprietary PDF documents.
Deep Search converts unstructured PDF documents into structured JSON files with accuracy and ease. It enables you to automate knowledge extraction as well as to fine-tune your proprietary Foundational Models and Large Language Models.
You can find more information on Deep Search here.
Deep Search can be a good internal tool for reading complex documents in place of unstructured or other open source libraries. You can also find very good documentation on the repo with different types of example as python notebooks:
Deep Search provides good results when extracting tables from PDFs in JSON format.
For testing, we've used a partial PDF document from the CIPP, which includes three pages, with each displaying table data. The best performance of RAG in this case is achieved by using Deep Sarch to extract tables into JSON format, then converting them to HTML for use in Watson Discovery / watsonx Discovery instead of directly using PDFs, converting PDFs to HTMLs without Deep Search, or only using Deep Search's JSON in Watson Discovery / watsonx Discovery. We found that Watson Discovery selects the wrong answer from the wrong cell in the table in some cases but faster, and in the same case, Watson Discovery can pinpoint the exact answer within the table grid.
IBM Datacap is a comprehensive solution for document and data capture; offering fast, accurate, and cost-effective scanning, classification, recognition, validation, and export of data and document images. It captures documents, extracts relevant data, and integrates it into downstream business processes.
Datacap acquires paper documents via scanners, multifunction printers, or mobile devices and imports electronic documents from file systems, fax, or email servers. It enhances data extraction with image-processing features such as deskewing and removing lines, smears, and borders.
Datacap uses optical character recognition (OCR), intelligent character recognition (ICR) for handwriting, optical mark recognition (OMR) for marks, and bar code reading to efficiently extract data.
For detailed installation and usage instructions, see the IBM Datacap documentation.
LangChain supports different types of document loaders which load data from a source as LangChain's Document object. A Document object is a piece of text and associated metadata which can be eaily used for chunking and retrieval later down the pipeline.
Currently LangChain supports loaders for either loading a whole directory or different types of files such as:
You can use these loaders in few lines in python. For example to load a PDF you can use the following code:
from langchain_community.document_loaders import PyPDFLoader loader = PyPDFLoader("example_data/layout-parser-paper.pdf") pages = loader.load_and_split()
Please refer to LangChain's Documentation to see how to use these loaders.
Other than the ones mentioned in the link above, you can find additional loaders which can load data directly from specific applications such as email, Github, Google drive, etc. You can find the list of all different loaders here.
In cases where you have documents with complex structures, formats and tables, you can use the unstructured loaders from LangChain. These loaders under the hood use the unstructured library which is one of the best libraries for complex documents (in different formats such as PDF, MSWord, Power Point, etc,) and for breaking down the documents into different components: LangChain UnstructuredLoader
Similar to LangChain, LlamaIndex also supports different types of loaders. One of the simples ways to load data using LlamaIndex is to use the SimpleDirectoryReader. SimpleDirectoryReader supports the following types:
You can use the SimpleDirectoryLoader using the following code snippet:
from llama_index.core import SimpleDirectoryReader reader = SimpleDirectoryReader(input_dir="path/to/directory") documents = reader.load_data()
You can also add specific extensions to read only those extensions from the directory:
SimpleDirectoryReader( input_dir="path/to/directory", required_exts=[".pdf", ".docx"] )
For ore details on how to use LlamaIndex Loaders please refer to its documentation.
Also similar to Huggingface you can create your own custom loaders or use other people's custom loaders on llamahub.ai.
One disadvantage of LlamaIndex compared to LangChain though is that the documentation and community support is not as good as LangChain so debugging your code might be harder.
In cases where you are loading complex documents, you can also directly use the unstructured library. Unstructured library is really good at reading and parsing tables.
The unstructured library is also supported by both LangChain and LlamaIndex wrappers and that might be easier to use it for RAG pipelines down the line since they will automatically create and return the Document objects. However, if your project needs more customization and you want to directly use the unstructured library instead, you can refer to the following documentation: Unstructured.