How IBM’s latest research is leading the optical character recognition (OCR) revolution and pushing the boundaries of capabilities.

Documents have always been (and continue to be) a significant data source for any business or corporation. It’s crucial to be able to scan and digitize physical documents to extract their information and represent them in a way that allows for further analysis (e.g., for a mortgage or loan process for a bank) no matter how the data is captured. Even for documents created digitally (e.g., PDF documents) the process of extracting information can be a challenge.

At IBM, we are treating this as a multi-disciplinary challenge spanning across computer vision, natural language understanding, information representation and model optimization. With this approach, we are advancing the state-of-the-art in document understanding, which allows our models to analyze the layout and reading order in complex documents and understand visuals and represent them in multimodality manners that understand plots, chart and diagrams.

This work led to the new enhanced optical character recognition (OCR) IBM has created to digitize important, valuable business documents more easily and accurately for the enterprise to extract information for analysis.

Cleaner and more accurate extraction creates multiple benefits, including the following:

  • Accelerated workflows
  • Automated document routing and content processing
  • Reduced costs
  • Superior data security
  • Disaster recovery

Also, there are a variety of use cases that utilize optical character recognition technology that will benefit from the enhancements being made by IBM. From data extraction to automating big data processing workflows, OCR powers many systems and services used every day.

Document understanding

Document understanding is the ability to read these business documents—either programmatically or by OCR—and interpret their content so it can take part in an automatic business process. An example of an automatic business process utilizing OCR would be insurance automated claims processing, where data is extracted from ID cards, claims forms and claim descriptions, among others.

To perform the digitization of documents, optical character recognition (OCR) is utilized. OCR is composed of two stages:

  • Detection: Localize the various words in the document.
  • Recognition: Identify the comprising characters in the detected words.

This means that with OCR, we know where the words are on the document and what those words are. However, when using OCR, challenges arise when documents are captured under any number of non-ideal conditions. This can include incorrect scanner settings, insufficient resolution, bad lighting (e.g., mobile capture), loss of focus, unaligned pages and added artifacts from badly printed documents.

Our team focused on these two challenging areas to address how the next generation of OCR technology can detect and extract data from low-quality and natural-scene image documents.

Better training and accuracy

Imagine for a moment that you are going to build a computer vision system for reading text in documents or extracting structure and visual elements. To train this system, you will undoubtedly need a lot of data that has to be correctly labeled and sanitized for human errors. Furthermore, you might realize that you require a different granularity of classes to train a better model—but acquiring new labeled data is costly. The cost will likely force you to make some compromises or use a narrower set of training regimens which may affect accuracy.

But what if you could quickly synthesize all of the data you need? How would that affect the way you approach the problem?

Synthetic data is at the core of our work in document understanding and our high-accuracy technology. As we developed our OCR model, we required significant amounts of data—data that is hard to acquire and annotate. As a result, we created new methods to synthesize data and apply optimization techniques to increase our architecture accuracy given that the synthetic data can be altered.

Now we are synthesizing data for object segmentation, text recognition, NLP-based grammatical correction models, entity grouping, semantic classification and entity linkage.

Another advantage of synthetic data generation is the ability to control the granularity and format of the labels, including different colors, font, font sizes, background noise, etc. This enables us to design architectures that can recognize punctuation, layout, handwritten characters and form elements.

By leveraging synthetic data to train models mentioned previously, we’re excited to announce this effort has resulted in a major update to our core OCR model, providing a significant boost in accuracy and lower processing time.

Higher-level document understanding

Not all documents within an enterprise are of equal value. For example, business documents are central to the operation of business and are at the heart of digital transformation. Such documents include contracts, loan applications, invoices, purchase orders, financial statements and many more. The information in these business documents is presented in natural language and is unstructured. Understanding these documents poses a change due to the complex document layout and the poor-quality scans.

Now with IBM’s latest OCR technology, these critical documents can be read and the key information contained within can be extracted.


As data continues to provide the key insights enterprises need to analyze their business, understand their customers and automate workflows, document-understanding technology like optical character recognition (OCR) is more important than ever.

IBM’s latest research is leading the OCR revolution by pushing the boundaries of OCR capabilities and raising the standard for OCR in the development community. We’re committed to improving our product and providing our customers with the highest level of performance and accuracy possible.

This new OCR technology is being rolled out across all IBM products utilizing OCR and will allow users to digitize important, valuable business documents more easily and accurately for the enterprise to extract information for analysis.

To learn more, check out the documentation and release notes.

The new OCR technology is already available in IBM Watson Discovery—try it out and get started today.

Was this article helpful?

More from Automation

Introducing IBM MQ version 9.4: Built for change

4 min read - We live in a world where businesses must be able to respond to change rapidly, whether it is to meet changing customer expectations or to take advantage of technology shifts that, while disruptive, offer the ability to surpass competitors. This is often at odds with the continual pressures that businesses face around reducing risk and costs across IT operations. To perform well against these diverse challenges, businesses must have an architectural foundation that: is stable and robust to reduce risk…

IBM Hybrid Cloud Mesh and Red Hat Service Interconnect: A new era of app-centric connectivity 

2 min read - To meet customer demands, applications are expected to be performing at their best at all times. Simultaneously, applications need to be flexible and cost effective, and therefore supported by an underlying infrastructure that is equally reliant, performant and secure as the applications themselves.   Easier said than done. According to EMA's 2024 Network Management Megatrends report only 42% of responding IT professionals would rate their network operations as successful.   In this era of hyper-distributed infrastructure where our users, apps, and data…

How AI-powered recruiting helps Spain’s leading soccer team score

4 min read - Phrases like “striking the post” and “direct free kick outside the 18” may seem foreign if you’re not a fan of football (for Americans, see: soccer). But for a football scout, it’s the daily lexicon of the job, representing crucial language that helps assess a player’s value to a team. And now, it’s also the language spoken and understood by Scout Advisor—an innovative tool using natural language processing (NLP) and built on the IBM® watsonx™ platform especially for Spain’s Sevilla…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters