How IBM is digitizing the world's text

Share this post:

The idea of digitizing all books and making them available on electronic libraries can be traced back to 1945, when Dr. Vannevar Bush wrote “As we may think” in the July issue of The Atlantic. His visionary description of an information centric application called “memex” influenced the development of the hypertext concept and the Internet. And projects in the 1970s – such as Michael Hart’s Project Gutenberg, and futurist Ray Kurzweil’s Optical Character Recognition (OCR) technology – continued the effort toward the digitization of textual information. But while billions of people access the Internet today, full digitization and availability of past textual information is still a work in progress.

Among many current efforts underway, IBM is working with the European Union on project IMPACT (IMProving ACcess to Text) to efficiently produce digital replicas of historically significant texts and making them widely available, editable and searchable online. As part of the project, IBM researchers in Haifa, Israel developed CONCERT (COoperative eNgine for Correction of ExtRacted Text). It automates simple, repetitious operations using an adaptive OCR engine that automatically learns from its text recognition errors.

Digitizing Japanese literature

The diverse nature of the Japanese language poses a serious challenge to digitizing the country’s literature. Japanese script is expressed beyond a few dozen standard characters, typical of most other languages. In addition to Japanese syllabary characters – hiragana and katakana – Japanese includes about 10,000 kanji characters (including old characters, variants and 2,136 commonly used characters), and ruby, a small Japanese syllabary character reading aid printed next to a kanji. Not to mention mixed vertical and horizontal texts.

The National Diet of Japan is Japan’s bicameral legislature. It is composed of a lower house, called the House of Representatives, and an upper house, called the House of Councilors. Both houses of the Diet are directly elected under a parallel voting system. In addition to passing laws, the Diet is formally responsible for selecting the Prime Minister.

Last year, IBM researchers in Tokyo combined their Social Accessibility tool with CONCERT to create a full-text digitization system prototype for the National Diet Library (NDL) of Japan. Dr. Makoto Nagao, the director of the National Diet Library, wrote the book “Digital Library” in 1994, in which he analyzed that the digitization of books is the first step towards realizing an ideal electronic library. The next step is to create a system which allows users take full advantage of digitized information.

“The system needs to have capabilities that are close to how we hold and utilize knowledge in our brain,” said Dr. Nagao.

In addition to helping Japanese Diet members to perform their duties, the NDL preserves all materials published in Japan as the national cultural heritage, and make them available to the government institutions and the general public. (As part of this effort, NDL also launched the International Library of Children’s Literature in 2000.)

NDL is making recorded academic literature available online to the public, including making them accessible for the visually impaired, and lending the recordings to libraries throughout Japan.

The IBM Research – Tokyo team also developed a full-text digitization system prototype that improves the digitization of Japanese literature printed during and after the Meiji Period (1868 – 1912); improve accessibility for people with disabilities in reading printed text; and facilitate effective searching and viewing of full-text data. The prototype is also designed with an eye toward future international collaboration and standardization of libraries, including the digitization of historically significant literature, broad utilization of books for various academic activities and online searching.

In a matter of years, all of our textual information will be fully digitized in a reusable way.

More stories

IBM Research AI Advances Speaker Diarization in Real Use Cases

In a recent publication, IBM researchers describe a novel speaker diarization algorithm that can consider not only speaker information, but also identifying clues about individual recording environments that help differentiate between the speakers, resulting in improved diarization accuracy for our in-house, real test cases as well as public benchmark data.

Continue reading

IBM Research AI at ICASSP 2020

The 45th International Conference on Acoustics, Speech, and Signal Processing is taking place virtually from May 4-8. IBM Research AI is pleased to support the conference as a bronze patron and to share our latest research results, described in nine papers that will be presented at the conference.

Continue reading

IBM Takes Its Quantum Computer to Japan to Launch Country-Wide Quantum Initiative

IBM quantum computing hardware comes to Japan – thanks to a new initiative between IBM and the University of Tokyo.

Continue reading