MALACH: Multilingual Access to Large Spoken Archives

Participating Institutions

University of Southern California
Shoah Foundation Institute for
Visual History & Education
View Page

IBM Thomas J. Watson
Research Center
View Page

Johns Hopkins University
View Page

University of Maryland
View Page

Charles University
View Page

University of West Bohemia
View Page

AITIA International, Inc.
View Page

Contact Information

Digital archiving of the spoken word is emerging as an important method for capturing the human experience; in the future a great deal of our cultural heritage will be archived in this form. If we are to learn from our past, teachers, students, historians, and others will need effective access to these resources. The enormous scale of these collections and the tremendous expense of manually cataloging multilingual audiovisual materials will make it impractical to rely on manual techniques alone. At present, however, fully automatic techniques are far from adequate.

We will overcome these difficulties by utilizing a unique collection assembled by the Shoah Visual History Foundation. Presently the world's largest coherent archive of videotaped oral histories, it contains 116,000 hours of digitized interviews in 32 languages from 52,000 survivors, liberators, rescuers and witnesses of the Nazi Holocaust.

We propose to dramatically improve access to large multilingual collections of recorded speech by advancing the state of the art in technologies that work together to achieve this objective:

automatic speech recognition
computer-assisted translation of domain-specific multilingual thesauri
natural language processing techniques for automated creation of metadata
support for efficient professional cataloging
support for search and exploration.

In all of these efforts, we will automate the transfer of capabilities developed originally for English to other languages. We will provide access to multilingual materials by combining knowledge-based and corpus-based techniques to extend existing thesauri to new languages and by supporting cross-language searching of manually prepared segment-level summaries and automatic speech recognition transcripts. Advancing the state of the art in this technology will produce significantly improved access to this collection as well as to other artifacts of our cultural heritage.

Publications