As we move towards an increasingly IoT-enabled ecosystem, we find that it is easier than ever before to capture vast amounts of audio data. However, there are many scenarios in which we may seek a "compressed" representation of an audio stream, consisting of an intentional curation of content to achieve a specific presentation—a background soundtrack for studying or working; a summary of salient events over the course of a day; or an aesthetic soundscape that evokes nostalgia of a time and place.
In this work, we present a novel, automated approach to the task of content-driven "compression," built upon the tenets of auditory cognition, attention, and memory. We expand upon our previous experimental findings, which demonstrate the relative importance of higher-level gestalt and lower level spectral principles in determining auditory memory, to design corresponding computational implementations enabled by auditory saliency models, deep neural networks for audio classification, and spectral feature extraction. We demonstrate the approach by generating a number of 30 second binaural mixes from eight-hour recordings captured in three contrasting locations at the Media Lab, and conduct a qualitative evaluation illustrating the relationship between our feature space and a user's perception of the resulting presentations. Through this work, we suggest rethinking traditional paradigms of compression in favor of an approach that is goal-oriented and modulated by human perception.
A brief demo and explanation of our work can be found here: https://resenv.media.mit.edu/audio-summarization/