CVPR2023 Tutorial: Recent Advanced in Vision Foundation Models

CVPR 2023 Tutorial on "Recent Advances in Vision Foundation Models"

Visual understanding at different levels of granularity has been a longstanding problem in the computer vision community. The tasks span from image-level tasks (e.g., image classification, image-text retrieval, image captioning, and visual question answering), region-level localization tasks (e.g., object detection and phrase grounding), to pixel-level grouping tasks (e.g., image instance/semantic/panoptic segmentation). Until recently, most of these tasks have been separately tackled with specialized model designs, preventing the synergy of tasks across different granularities from being exploited.

In light of the versatility of transformers and inspired by large-scale vision-language pre-training, the computer vision community is now witnessing a growing interest in building general-purpose vision systems, also called vision foundation models, that can learn from and be applied to various downstream tasks, ranging from image-level , region-level, to pixel-level vision tasks.

In this tutorial, we will cover the most recent approaches and principles at the frontier of learning and applying vision foundation models, including (1) Visual and Vision-Language Pre-training; (2) Generic Vision Interface; (3) Alignments in Text-to-image Generation; (4) Multimodal LLMs; and (5) Multimodal Agents.

The tutorial will be a half-day event (9:00 am to 12:30pm).

Program (PDT, UTC-8)

You are welcome to join our tutorial either in-person or virtually via Zoom (please login to CVPR2023 portal to find the Zoom link). Recordings of each talk are now posted on Bilibili [Playlist] and YouTube [Playlist].

[2023/09/19] Checkout our latest survey paper on vision foundation models.


9:00 - 9:40	Opening Remarks & Visual and Vision-Language Pre-training [Slides] [Bilibili, YouTube]	Zhe Gan
9:40 - 10:20	From Representation to Interface: The Evolution of Foundation for Vision Understanding [Slides] [Bilibili, YouTube]	Jianwei Yang
10:20 - 11:00	Alignments in Text-to-Image Generation [Slides] [Bilibili, YouTube]	Zhengyuan Yang
11:00 - 11:40	Large Multimodal Models [Slides, Notes] [Bilibili, YouTube]	Chunyuan Li
11:40 - 12:10	Multimodal Agents: Chaining Multimodal Experts with LLMs [Slides] [ Bilibili, YouTube]	Linjie Li
12:10 - 12:30	Q & A