Gain insights into an ocean of documents with deep search
Do you have vast amounts of digital documents (such as PDF files, patents and corporate documents) piled up in your organization that are humanly impossible to read and digest? Wouldn’t it be nice if you could query against these document piles with questions like “List all the materials claimed by company X in the US patent office”? Deep search is an IBM Research® service that automatically analyzes enormous digital libraries and facilitates discovering unknown facts. It implements an AI-based approach to enable intelligent querying against document repositories. This capability has been demonstrated to aid innovation across various industries such as material sciences, insurance and drug discovery.
IBM deep search service
How does deep search work? Initially, as shown in figure 1, the digital documents are segmented into multiple components (heading, introduction, references and so on) using machine learning models and converted into structured data representations (such as HTML or JSON). These supervised learning models are customizable and highly accurate, making use of huge data sets and modern neural network topologies.
The second step of deep search involves using the existing data sources (corporate databases, publicly available data sets and the like) to identify the concepts (such as alloy, material) and relationships that are relevant to the context of knowledge discovery. Finally, a searchable and queryable knowledge base is built by linking the structured data formats of documents to the identified concepts and relationships.
Deep search in action
The document processing techniques coupled with the graph analytics provided by deep search can accelerate novel discoveries from document repositories across industries. The chemical company Nagase & Co has put deep search to extensive use in developing new compounds. ENI, an oil and gas company, is using the service for upstream exploration. Currently, deep search is also aiding drug discovery in COVID-19 research.
Knowledge discovery at scale
In addition to the knowledge engineering techniques described above, automatic analysis of a huge number of documents demands powerful storage, compute and network infrastructure. The deep search platform is currently available as a service through Red Hat® OpenShift® on IBM Cloud®. It can also be set up on your premises in an OpenShift environment on IBM Power Systems as well as Intel x86 servers. The software is designed as a group of cloud-based microservices that can scale along with the number of documents and hardware resources for large search applications. This hardware-software codesigned platform has demonstrated capability to ingest as many as 100,000 pages per day per core.
IBM Systems Lab Services can help your organization make better use of document repositories using the deep search platform. Our experienced consultants help you set up the OpenShift platform, work with your subject matter experts to build the knowledge bases and design queries to help you develop novel insights into your digital libraries.