Project

Multimodal AI for Education: Expanding Learning Beyond Text

Steven-Shine Chen, Jimin Lee

Groups

Education is inherently multimodal—we don’t just learn through words, but also through visuals, interactions, and hands-on experiences. While traditional AI tutoring systems rely heavily on text-based feedback, human learning often depends on sketches, diagrams, and interactive exploration to grasp complex ideas. To create AI that truly enhances education, we must move beyond text and develop multimodal AI systems that integrate vision, spatial reasoning, and interactivity into the learning process.

Our research explores how Large Multimodal Models (LMMs) can transform education by making learning more intuitive, interactive, and accessible. By integrating visual reasoning, sketch-based interaction, and real-time feedback, we aim to create AI-driven tools that support a wide range of learners and learning styles, from K-12 students to professionals.

Interactive Sketchpad: A Step Toward Multimodal AI in Education

One of our key contributions in this area is Interactive Sketchpad, an AI-powered tutoring system that enables students to solve math problems through interactive, visual collaboration with AI. Traditional tutoring systems struggle with geometry, calculus, and other spatial reasoning tasks because they provide feedback only in text. Interactive Sketchpad combines step-by-step explanations with AI-generated visualizations, allowing students to engage with mathematical concepts more naturally.

Built on a pre-trained Large Multimodal Model, Interactive Sketchpad is fine-tuned to dynamically generate problem-solving diagrams using code execution. Students receive AI-generated visual hints and can interact with the system through a shared digital sketchpad, where they can draw, annotate, and explore problems visually.

To learn more about Interactive Sketchpad, see the paper and code here

or see the overview video below:

By developing tools like Interactive Sketchpad, we take a step toward a future where AI tutors can reason visually, understand sketches, and provide rich, interactive learning experiences. Our work highlights the potential of multimodal AI to make education more engaging, personalized, and interactive, paving the way for multimodal human-AI collaboration in learning.