Indexing the Dead Sea Scrolls
A Jesuit priest and an IBMer collaborated to decrypt the earliest writings of the Bible — and kicked off a revolution in natural language processing
A page from the Dead Sea Scrolls

There was a time not long ago when the utility of computers was thought to be confined to accounting. Data processing machines were designed and built to calculate numbers. That began to change in 1949 when a determined Jesuit priest, Father Roberto Busa, approached IBM Chairman and CEO Thomas J. Watson Sr. with a hugely ambitious idea.

A professor at the Jesuit College of the Aloisianum in Gallarate, Italy, Busa had been attempting to compile an analytical index, or concordance, of Catholicism’s greatest theologian and one of history’s most prolific writers, St. Thomas Aquinas. But extracting, defining, organizing and understanding the shades of meaning in some 13 million words from the oeuvre of a 13th-century philosopher was proving impossible to do manually. So Busa conducted a tour of American universities in search of help. He gained an audience with Watson, who was on the board of trustees at Columbia University and was actively searching for ways to expand the utility of IBM’s machines in humanities research. Watson was immediately taken by Busa’s quest. “Even if you had time to waste for the rest of your life, you couldn’t do a job like that,” Watson said. “You seem to be more go-ahead and American than we are!”

Watson assigned Paul Tasman, an IBM engineer, as the point person. The unlikely duo gathered a small team and eight IBM 705 tabulators. They developed a methodology to input text, broken down into phrases, on punched cards. Eventually each word had its own card, and these were then fed to the machines to compile and sort the data. In five years, the system produced a full concordance — a feat that, according to one estimate, would have otherwise required 50 scholars as long as four decades to produce.

As a reward for their toil, the team embarked on an even more ambitious project to index the Dead Sea Scrolls, a collection of 183 works thought to contain the oldest writings of the Old Testament. Busa and Tasman imagined the project might kick off a wave of linguistics research that could open the utility of computing more broadly. It accomplished far more. The Dead Sea Scrolls project can be considered one of the first digital humanities projects. It also, in many ways, serves as the foundation of the modern field of natural language processing, which underpins many of today’s modern conveniences, from internet search engines and smart speakers to means of instantaneous translation and automated customer service systems.

The Dead Sea Scrolls project can be considered one of the first digital humanities projects
Reconstructing 40,000 fragments

In 1947, a Bedouin farm boy found a cache of Dead Sea Scrolls while chasing after his goats in a cave near Jordan. It would turn out to be just a portion of the 183 scrolls — comprising 40,000 fragments of 400 documents — thought to contain the oldest known works of the Old Testament and the foundations of Christianity.

Analyzing this trove of content was considered beyond human capability. For starters, the scrolls were in various states of destruction. “Many of them had crumbled into dust, and they had to be assembled in bits and pieces,” Tasman recalled in a 1968 interview. So the team set out to use “the same kind of programming format as we used on the St. Thomas project to artificially reconstruct the text that had been obliterated.”

On the Aquinas project this meant identifying words and searching for repeated usage in the context of others. Tasman spent three months creating software to track word frequency, use and sequence. As difficult as it had been to ascribe meaning to words written a millennium prior, the scrolls project was even more complicated. The texts were a mix of dead and unknown languages: Ancient Hebrew, Aramaic and Nabatean. “We didn’t know, in effect, what constituted a word,” Tasman said in a 1968 interview. “I personally worked directly with the padre. I had to teach him, in effect, the IBM machines, and at the same time he was teaching me linguistics.”

Rather than hunting words to fill in what was missing, the team identified groups of letters and analyzed how many times a group appeared in the context of another group. “By doing this, we were able to reconstruct approximately 85% of the obliterated text,” Tasman said.

The method of mechanical indexing, Busa explained, involved 44 steps in all. The only manual work required involved punching, verifying and checking cards in the beginning. The scholars also used mark-sensing, a new technology that allowed for corrections, before printing. Under Busa’s direction, a team at the newly formed literary data processing center in Gallarate prepared punched cards for each word. The cards were then flown to New York, where the complex layers of data — such as the location and order of words; the first letter of both the preceding and following words; and the graphic-semantic word family to which the word belongs — were converted to magnetic tape by the 705. The final alphabetical summary list was produced in Hebrew by the computer’s printing unit.

The Dead Sea Scrolls 183 scrolls 40,000 fragments 400 documents
A new dawn in linguistics research

The project’s intent wasn’t to determine the content of the Dead Sea Scrolls. Rather, Busa and Tasman wanted to give religious scholars a powerful new tool to shape their own interpretations. They hoped to democratize the study of important works to researchers of various languages, open a new chapter in literary and linguistic research, and maybe even make an impact across science and industry.

“We have developed an exclusive approach that permits us to find … the frequencies and rhythms of all the elements of language,” Busa said. “One of the most pressing problems has always been the indexing of all basic recorded knowledge. In the chemical industry alone there is the problem of correctly recording an almost infinite number of patents and abstracts. This has led to the virtual impossibility of searching all available records on any given subject and has caused great loss of time.”

Tasman felt similarly. “These indexes … furnish the scholar with a most valuable research tool,” he wrote in a 1957 paper. “The impact of machine analysis will soon be felt in law, medicine, library science, chemical and oil industries, scientific and engineering research, and wherever the increasing volume of data makes its fingertip availability imperative.”

Decades later, those words seem remarkably prescient in today’s era of Big Data and AI. Automated language interpretation — now widely known as natural language processing — is among the most vibrant areas of scientific research, providing great benefits to industry and society. A blend of computer science, artificial intelligence and linguistics, NLP has given rise to many modern technologies, including smart speakers, modes of instantaneous translation, and automated customer service. Just as Busa anticipated, scientists use NLP to pore over millions of published papers. As Tasman foresaw, businesses harness NLP to identify patterns and insights in correspondence. And, of course, billions of people turn to the power of NLP every day to search the web.

And it was all born out of the mind of an Italian priest on an impossible mission.

The impact of machine analysis will soon be felt wherever the increasing volume of data makes its fingertip availability imperative Paul Tasman IBM engineer
Related stories Thomas J. Watson Sr.

His people-first agenda created a culture that became the envy of industry and a business juggernaut

The IBM punched card

The paper on-ramp to the Information Age once held most of the world’s data

Machine-aided translation

A century-long quest to streamline communications across languages