Best practices on OCR with Document Processing Extension

The best practices on optical character recognition (OCR) with Document Processing Extension help you understand the factors that are involved in OCR results and performances.

The accuracy of heuristic algorithms in OCR is affected by different aspects of a document. The following points help you to understand the factors that impact OCR results, performance, and suggest document formats that improve those results.

Quality

First, a document must have clear, machine-printed text. Machine text is created with a word processor, typewriter, or printer. The following points detail some of the factors that reduce OCR success and that you should avoid where possible.

Skewing and distortion: OCR can automatically handle rotation and a small amount of skewing, but recognition degrades as skewing and distortion increase.
Noise: Speckles, streaks, watermarks, stamps, and other marks that are not part of the text can interfere with OCR. Noise can include images with handwritten notes, circled text, and other notations, which are sometimes done to documents before scanning. When noise touches text, it can interfere with character recognition. Even when it does not, it can interfere with line recognition and block identification.
Background: As part of text extraction, OCR must determine what is text and what is background. The Document Processing Extension engine supports color and gray scale images but too many colors or some color combinations can interfere with its ability to identify foreground and background colors. Inverse text is supported by Document Processing Extension, but can be harder to recognize.

Size

The size of a file has two related factors:

The size of the page: For a PDF file, the size is expressed as the printable size of the page, which is found in the page properties. In an image, the size is expressed as the number of pixels in the image and the DPI (dots per inch) of the image. The color depth of the image also affects memory usage because color images require more RAM during processing than black and white images.
The physical size of the file on a storage device: While the page size and the file size are related, they are not always directly proportional. Because of differences in image file compressions, a larger file size does not directly determine how much RAM is used during recognition compared to another file with a different compression type.

PDF documents themselves can vary widely based on the type of data within it. A PDF document might be a simple depiction of text and fonts or might be a combination of those along with different images and image formats that are embedded in the file.

As a result of these complexities, each of these attributes help provide a guide as to the amount of resources that are needed but are difficult to express with a simple formula.

Document Processing Extension supports ingestion of multipage files and the transaction layer supports ingestion of files up to 250 MB. However, the size that is supported for a single page image is smaller and is based on the factors that are explained previously. To improve the performance, use file types and formats that generate smaller files.

Format

Some file formats and compressed files, such as JPG or lossy-compressed files, keep the file size smaller but blurred with a loss of clarity to character edges. Lossy compression discards information to reduce the size of a file. Ultimately, lossy compression formats make it harder for the recognition engine to maintain a high accuracy. Use lossless compression formats where possible.

PDF documents can contain lossy or lossless images. Even if the PDF document contains lossless images, if the source images used to generate them are lower quality, low resolution, or are sourced from a lossy-compressed file, it can still make it harder for the recognition process to maintain high-quality output.

One of the best image formats for image processing is the TIFF format. This is a standard format and used widely in document processing. TIFF files have two important advantages over other formats such as PNG or JPEG.

First, TIFF images support multiple types of compression, both lossy and lossless. Generally, G4 (fax) compression works very well for black and white images and is lossless. LZW compression works well for grayscale and color and it is also a lossless compression. LZW can also be used for black and white images, but the file size can be larger than if fax compression was used. JPEG compression can produce smaller files, but it is a lossy compression and degrades the quality of the image. JEPG was intended to be used for storage of photographs, not for preserving document integrity.
A second reason why TIFF is recommended over JPEG or PNG formats is due to the DPI property. TIFF images maintain the source DPI of the image. If an image was scanned at 300 DPI, this information is retained with the image. The DPI provides a way to accurately re-create the original size of the document. JPEG and PNG files do not store the DPI. When these images are processed, the application that processes them use a default DPI of 72 or 96, which does not allow it to accurately re-create the original image size.

Fonts

The OCR engine is capable of recognizing text with many different fonts. However, standard fonts, such as Arial and New Times Roman, provide better recognition results than fonts that have more unusual character shapes. OCR can also handle different font sizes, although very small sizes might not produce enough pixels to clearly identify the characters. You need to test to determine where those limits are for your documents.

DPI settings

Optimal DPI for recognition is generally 200 or 300 DPI for both the X and Y axis. Higher DPIs of 400 or 500 DPI can be used for better recognition with small fonts or for languages that contain intricate characters. However, higher DPIs also increase file size and processing time. Recognition results for DPIs below 200 are lower from the loss of character clarity. DPI for the X and Y axis should be the same. Fax images both are low DPI and have unequal DPI values for the X and Y coordinates. As such, fax recognition is lower than for other types of documents.

Note: To process the file, you need the following font size and DPI:

If the font size is higher than 12 with 200 DPI, you can usually get a high level of confidence.
If the font size is below 12 or of the language has intricate characters, use a minimum of 300 DPI.
300 DPI is generally a good value to give you a balance of quality recognition and performance for typical document font sizes.

It is harder to quantify a minimum DPI for PDF documents. If a PDF document is generated directly from an electronic source, such as a Word document, this resulting document is typically composed of the text and font information. This kind of document can scale very cleanly, providing high-quality results. These documents can also contain embedded images.

Source images that are used within a PDF must be of a high quality. This includes any images that are embedded in an electronic document, as described above. PDF documents can also be created directly from scanned images, either directly by using a scanner device or, in a secondary step, by converting images from a scanner to PDF. For such images, follow the same recommendations as for stand-alone images, such as TIFF files. After an image is embedded into a PDF document, it becomes harder to determine its quality. One simple test is to zoom in on a PDF page and see if the lines of the text remain crisp and clear. If the text remains clear with crisp edges after a reasonable amount of enlarging, then it is likely to be a good quality image. However, it can take some experience to effectively use this analysis approach.

When you use images from digital cameras or mobile phones, recognize that they are intended to provide the photograph quality images and that pixel size can vary based on the camera and its specific size settings. When a document is captured by using a camera, its original physical size cannot be determined. Where possible, use documents that are generated by scanners or other sources, which generally perform better than a camera.

Character substitution

Some characters are very difficult to differentiate for an OCR engine. For example, O (capital letter), o (lowercase letter), and 0 (number). The Document Processing Extension engine does perform semantic normalization, so if looking for ‘cost’, it finds a match with ‘C0st’. You can also improve text accuracy for key values by selecting Mostly Alphabetic or Mostly Numeric when you define the keys in the ontology.

Performance

When you plan a system, one of the key metrics is throughput. In other words, how long does it take the average document to process, and by extension, how many documents per hour and per day can you process? You need to test to determine what your performance ultimately is. However, here are the primary factors, some of which have already been mentioned.

Hardware speed: This is important to consider when determining performance. If you are testing on the Development system and it does not match the Production system that you run on, the performance results differ.
System size: The Document Processing Extension engine can process documents in parallel. The exact number is based on the size of your system. So, if one document takes X seconds, it does not necessarily mean that 10 documents take 10 times X seconds. You need to test with many documents to determine total throughput.
File size: OCR analyzes images pixel by pixel. Therefore, large images have more pixels and take longer to process.
Complexity of the document: Images with a long or numerous text blocks take longer to process than images with a small amount of text.

Color documents

Documents with colored text process better if the colors are dark compared to lighter colors or letters that are not a solid texture. Light color or grayscale text and lines might not be recognized accurately or might not be recognized at all.

Testing

When you evaluate the recognition engines and set up an application, use the actual pages that your application must process. It is not advisable to test some sample text documents to see how well the engine performs. Performance can be different when compared to some “manufactured” test documents versus the actual documents that will be ultimately processed. Even if they are similar, a problem might exist in a test document that does not exist in a “real” document, and conversely.

Summary

The information provided in this page should give you a better understanding of why you are seeing the recognition results and performance of your documents. Ideally, it can also suggest the ways that you can modify your file ingestion methods to generate documents that provide the most optimal results.