Project

WatchThis: A Wearable Point-and-Ask Interface powered by Vision-Language Models for Contextual Queries

Cathy Mengying Fang

Groups

Media Lab Research Theme: Life with AI

This paper introduces WatchThis, a novel wearable device that enables natural language interactions with real-world objects and environments through pointing gestures. Building upon previous work in gesture-based computing interfaces, WatchThis leverages recent advancements in Large Language Models (LLM) and Vision Language Models (VLM) to create a hands-free, contextual querying system. The prototype consists of a wearable watch with a rotating, flip-up camera that captures the area of interest when pointing, allowing users to ask questions about their surroundings in natural language. This design addresses limitations of existing systems that require specific commands or occupy the hands, while also maintaining a non-discrete form factor for social awareness. The paper explores various applications of this point-and-ask interaction, including object identification, translation, and instruction queries. By utilizing off-the-shelf components and open-sourcing the design, this work aims to facilitate further research and development in wearable, AI-enabled interaction paradigms.

Featured on:

Hackster.io: https://www.hackster.io/news/point-and-search-b5697e0b9c88

Seeed Studio project spotlight: https://www.seeedstudio.com/blog/2024/10/21/watchthis-a-wearable-point-and-ask-interface-powered-by-vision-language-models-and-xiao-esp32s3-sense/