How IBM’s latest research is leading the optical character recognition (OCR) revolution and pushing the boundaries of capabilities.

Documents have always been (and continue to be) a significant data source for any business or corporation. It’s crucial to be able to scan and digitize physical documents to extract their information and represent them in a way that allows for further analysis (e.g., for a mortgage or loan process for a bank) no matter how the data is captured. Even for documents created digitally (e.g., PDF documents) the process of extracting information can be a challenge.

At IBM, we are treating this as a multi-disciplinary challenge spanning across computer vision, natural language understanding, information representation and model optimization. With this approach, we are advancing the state-of-the-art in document understanding, which allows our models to analyze the layout and reading order in complex documents and understand visuals and represent them in multimodality manners that understand plots, chart and diagrams.

This work led to the new enhanced optical character recognition (OCR) IBM has created to digitize important, valuable business documents more easily and accurately for the enterprise to extract information for analysis.

Cleaner and more accurate extraction creates multiple benefits, including the following:

  • Accelerated workflows
  • Automated document routing and content processing
  • Reduced costs
  • Superior data security
  • Disaster recovery

Also, there are a variety of use cases that utilize optical character recognition technology that will benefit from the enhancements being made by IBM. From data extraction to automating big data processing workflows, OCR powers many systems and services used every day.

Document understanding

Document understanding is the ability to read these business documents—either programmatically or by OCR—and interpret their content so it can take part in an automatic business process. An example of an automatic business process utilizing OCR would be insurance automated claims processing, where data is extracted from ID cards, claims forms and claim descriptions, among others.

To perform the digitization of documents, optical character recognition (OCR) is utilized. OCR is composed of two stages:

  • Detection: Localize the various words in the document.
  • Recognition: Identify the comprising characters in the detected words.

This means that with OCR, we know where the words are on the document and what those words are. However, when using OCR, challenges arise when documents are captured under any number of non-ideal conditions. This can include incorrect scanner settings, insufficient resolution, bad lighting (e.g., mobile capture), loss of focus, unaligned pages and added artifacts from badly printed documents.

Our team focused on these two challenging areas to address how the next generation of OCR technology can detect and extract data from low-quality and natural-scene image documents.

Better training and accuracy

Imagine for a moment that you are going to build a computer vision system for reading text in documents or extracting structure and visual elements. To train this system, you will undoubtedly need a lot of data that has to be correctly labeled and sanitized for human errors. Furthermore, you might realize that you require a different granularity of classes to train a better model—but acquiring new labeled data is costly. The cost will likely force you to make some compromises or use a narrower set of training regimens which may affect accuracy.

But what if you could quickly synthesize all of the data you need? How would that affect the way you approach the problem?

Synthetic data is at the core of our work in document understanding and our high-accuracy technology. As we developed our OCR model, we required significant amounts of data—data that is hard to acquire and annotate. As a result, we created new methods to synthesize data and apply optimization techniques to increase our architecture accuracy given that the synthetic data can be altered.

Now we are synthesizing data for object segmentation, text recognition, NLP-based grammatical correction models, entity grouping, semantic classification and entity linkage.

Another advantage of synthetic data generation is the ability to control the granularity and format of the labels, including different colors, font, font sizes, background noise, etc. This enables us to design architectures that can recognize punctuation, layout, handwritten characters and form elements.

By leveraging synthetic data to train models mentioned previously, we’re excited to announce this effort has resulted in a major update to our core OCR model, providing a significant boost in accuracy and lower processing time.

Higher-level document understanding

Not all documents within an enterprise are of equal value. For example, business documents are central to the operation of business and are at the heart of digital transformation. Such documents include contracts, loan applications, invoices, purchase orders, financial statements and many more. The information in these business documents is presented in natural language and is unstructured. Understanding these documents poses a change due to the complex document layout and the poor-quality scans.

Now with IBM’s latest OCR technology, these critical documents can be read and the key information contained within can be extracted.


As data continues to provide the key insights enterprises need to analyze their business, understand their customers and automate workflows, document-understanding technology like optical character recognition (OCR) is more important than ever.

IBM’s latest research is leading the OCR revolution by pushing the boundaries of OCR capabilities and raising the standard for OCR in the development community. We’re committed to improving our product and providing our customers with the highest level of performance and accuracy possible.

This new OCR technology is being rolled out across all IBM products utilizing OCR and will allow users to digitize important, valuable business documents more easily and accurately for the enterprise to extract information for analysis.

To learn more, check out the documentation and release notes.

The new OCR technology is already available in IBM Watson Discovery—try it out and get started today.


More from Automation

Debunking observability myths – Part 6: Observability is about one part of your stack

3 min read - In our blog series, we’ve debunked the following observability myths so far: Part 1: You can skip monitoring and rely solely on logs Part 2: Observability is built exclusively for SREs Part 3: Observability is only relevant and beneficial for large-scale systems or complex architectures Part 4: Observability is always expensive Part 5: You can create an observable system without observability-driven automation Today, we're delving into another misconception about observability—the belief that it's solely applicable to a specific part of your stack or…

Observing Camunda environments with IBM Instana Business Monitoring

3 min read - Organizations today struggle to detect, identify and act on business operations incidents. The gap between business and IT continues to grow, leaving orgs unable to link IT outages to business impact.  Site reliability engineers (SREs) want to understand business impact to better prioritize their work but don’t have a way of monitoring business KPIs. They struggle to link IT outages to business impacts because data is often siloed and knowledge is tribal. It forces teams into a highly reactive mode…

Buying APM was a good decision (so is getting rid of it)

4 min read - For a long time, there wasn’t a good standard definition of observability that encompassed organizational needs while keeping the spirit of IT monitoring intact. Eventually, the concept of “Observability = Metrics + Traces + Logs” became the de facto definition. That’s nice, but to understand what observability should be, you must consider the characteristics of modern applications: Changes in how they’re developed, deployed and operated The blurring of lines between application code and infrastructure New architectures and technologies like Docker,…

IBM Tech Now: September 18, 2023

< 1 min read - ​Welcome IBM Tech Now, our video web series featuring the latest and greatest news and announcements in the world of technology. Make sure you subscribe to our YouTube channel to be notified every time a new IBM Tech Now video is published. IBM Tech Now: Episode 84 On this episode, we're covering the following topics: The IBM Security X-Force Cloud Threat Landscape Report The introduction of IBM Intelligent Remediation Stay plugged in You can check out the IBM Blog Announcements…