Extract table information from PDF files using OCR and analytics technology

From the developerWorks archives

Xu Hua Li, Xiao Yang, Douglas Burdick, Yuan Yuan Li, and Hai Ji

Date archived: February 26, 2018 | First published: February 11, 2015

Learn how to build a REST application that provides a web service for converting PDF documents to text using IBM Bluemix. This service accepts a PDF file; converts the PDF file to a text file, capturing identified tables in the document (that is, XML or HTML); and returns the result to the user. The XML version is the output from the OCR engine, while the HTML version is the result of an error-correction process that fixes errors in the table structure identified by the OCR engine.

This content is no longer being updated or maintained. The full article is provided "as is" in a PDF file. Given the rapid evolution of technology, some content, steps, or illustrations may have changed.



static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Web development, Cloud computing
ArticleID=997093
ArticleTitle=Extract table information from PDF files using OCR and analytics technology
publish-date=02112015