Publication

Integration of Speech and Vision using Mutual Information

June 5, 2000

People

Deb Roy

Professor of Media Arts and Sciences

Share this publication

Deb Roy

Abstract

We are developing a system which learns words from co-occurring spoken and visual input. The goal is to automatically segment continuous speech at word boundaries without a lexicon, and to form visual categories which correspond to spoken words. Mutual information is used to integrate acoustic and visual distance metrics in order to extract an audio-visual lexicon from raw input. We report results of experiments with a corpus of infant-directed speech and images.

icassp2000.pdf