CVPR2025 Tutorial: Recent Advanced in Vision Foundation Models

CVPR 2025 Tutorial on "Recent Advances in Vision Foundation Models"

We present our CVPR tutorial proposal on Recent Advances in Vision Foundation Models, a topic that has garnered significant attention from the computer vision community. Our tutorial will cover the most advanced directions in designing and developing vision foundation models, including the state-of-the-art approaches and principles in (i) learning vision foundation models for multimodal understanding and generation, (ii) scaling test-time compute and enabling the self-training of foundation models to improve themselves on reasoning and perception, and (iii) physical and virtual agents based on vision foundation models that can take actions for robotics and in virtual environments.

Program (CDT)

You are welcome to join our tutorial either in-person or virtually via Zoom (Click into the CVPR2025 portal to find the Zoom link).

Afternoon Session
13:00 - 13:50	Advancing Multimodal LLMs: From Seeing to Understanding and Acting [Slides]	Zhe Gan
13:50 - 14:40	Multimodal Reasoning for Visual-Centric Long-Horizon Tasks [Slides]	Zhengyuan Yang
14:40 - 15:00	Coffee Break & QA
15:00 - 15: 50	See. Think. Act. Training Multimodal Agents with Reinforcement Learning [Slides]	Linjie Li
15:50 - 16: 40	Towards Multimodal AI Agent That Can See, Think and Act [Slides]	Jianwei Yang
16:40 - 17:00	Closing Remarks & QA