Integration of Speech and Vision using Mutual Information

June 5, 2000

Deb Roy


We are developing a system which learns words from co-occurring spoken and visual input. The goal is to automatically segment continuous speech at word boundaries without a lexicon, and to form visual categories which correspond to spoken words. Mutual information is used to integrate acoustic and visual distance metrics in order to extract an audio-visual lexicon from raw input. We report results of experiments with a corpus of infant-directed speech and images.

Related Content