Documents have always been (and continue to be) a significant data source for any business or corporation. It’s crucial to be able to scan and digitize physical documents to extract their information and represent them in a way that allows for further analysis (e.g., for a mortgage or loan process for a bank) no matter how the data is captured. Even for documents created digitally (e.g., PDF documents) the process of extracting information can be a challenge.
At IBM, we are treating this as a multi-disciplinary challenge spanning across computer vision, natural language understanding, information representation and model optimization. With this approach, we are advancing the state-of-the-art in document understanding, which allows our models to analyze the layout and reading order in complex documents and understand visuals and represent them in multimodality manners that understand plots, chart and diagrams.
This work led to the new enhanced optical character recognition (OCR) IBM has created to digitize important, valuable business documents more easily and accurately for the enterprise to extract information for analysis.
Cleaner and more accurate extraction creates multiple benefits, including the following:
Also, there are a variety of use cases that utilize optical character recognition technology that will benefit from the enhancements being made by IBM. From data extraction to automating big data processing workflows, OCR powers many systems and services used every day.
Document understanding is the ability to read these business documents—either programmatically or by OCR—and interpret their content so it can take part in an automatic business process. An example of an automatic business process utilizing OCR would be insurance automated claims processing, where data is extracted from ID cards, claims forms and claim descriptions, among others.
To perform the digitization of documents, optical character recognition (OCR) is utilized. OCR is composed of two stages:
This means that with OCR, we know where the words are on the document and what those words are. However, when using OCR, challenges arise when documents are captured under any number of non-ideal conditions. This can include incorrect scanner settings, insufficient resolution, bad lighting (e.g., mobile capture), loss of focus, unaligned pages and added artifacts from badly printed documents.
Our team focused on these two challenging areas to address how the next generation of OCR technology can detect and extract data from low-quality and natural-scene image documents.
Imagine for a moment that you are going to build a computer vision system for reading text in documents or extracting structure and visual elements. To train this system, you will undoubtedly need a lot of data that has to be correctly labeled and sanitized for human errors. Furthermore, you might realize that you require a different granularity of classes to train a better model—but acquiring new labeled data is costly. The cost will likely force you to make some compromises or use a narrower set of training regimens which may affect accuracy.
But what if you could quickly synthesize all of the data you need? How would that affect the way you approach the problem?
Synthetic data is at the core of our work in document understanding and our high-accuracy technology. As we developed our OCR model, we required significant amounts of data—data that is hard to acquire and annotate. As a result, we created new methods to synthesize data and apply optimization techniques to increase our architecture accuracy given that the synthetic data can be altered.
Now we are synthesizing data for object segmentation, text recognition, NLP-based grammatical correction models, entity grouping, semantic classification and entity linkage.
Another advantage of synthetic data generation is the ability to control the granularity and format of the labels, including different colors, font, font sizes, background noise, etc. This enables us to design architectures that can recognize punctuation, layout, handwritten characters and form elements.
By leveraging synthetic data to train models mentioned previously, we’re excited to announce this effort has resulted in a major update to our core OCR model, providing a significant boost in accuracy and lower processing time.
Not all documents within an enterprise are of equal value. For example, business documents are central to the operation of business and are at the heart of digital transformation. Such documents include contracts, loan applications, invoices, purchase orders, financial statements and many more. The information in these business documents is presented in natural language and is unstructured. Understanding these documents poses a change due to the complex document layout and the poor-quality scans.
Now with IBM’s latest OCR technology, these critical documents can be read and the key information contained within can be extracted.
As data continues to provide the key insights enterprises need to analyze their business, understand their customers and automate workflows, document-understanding technology like optical character recognition (OCR) is more important than ever.
IBM’s latest research is leading the OCR revolution by pushing the boundaries of OCR capabilities and raising the standard for OCR in the development community. We’re committed to improving our product and providing our customers with the highest level of performance and accuracy possible.
This new OCR technology is being rolled out across all IBM products utilizing OCR and will allow users to digitize important, valuable business documents more easily and accurately for the enterprise to extract information for analysis.
To learn more, check out the documentation and release notes.