CVPR2022 Tutorial: Recent Advanced in Vision-and-Language Pre-training

CVPR 2022 Tutorial on "Recent Advances in Vision-and-Language Pre-training"

Humans perceive the world through many channels, such as images viewed by the eyes or voices heard by the ears. Though any individual channel might be incomplete or noisy, humans can naturally align and fuse information collected from multiple channels, in order to grasp the key concepts needed for a better understanding of the world. One of the core aspirations in artificial intelligence is to develop algorithms that endow computers with an ability to effectively learn from multi-modality (or multi-channel) data. This data is similar to sights and sounds attained from vision and language that help humans make sense of the world around us. For example, computers could mimic this ability by searching the most similar images for a text query (or vice versa) and by describing the content of an image using natural language.

Vision-and-Language (VL), a popular research area that sits at the nexus of Computer Vision and Natural Language Processing (NLP), aims to achieve this goal. Inspired by the great success of language model pre-training in NLP, Vision-and-Language Pre-training (VLP) has recently attracted rapidly growing attention from both communities. In this tutorial, we will cover the most recent approaches and principles at the frontier of VLP, including (1) region-feature-based and end-to-end image-text pre-training; (2) unified vision-language modeling; (3) its extension to video-language pre-training; (4) learning visual models from language supervision; and (5) visual synthesis. The tutorial will be a full-day event (9:00 am to 5:00pm) with several middle breaks.

Program (CDT, UTC-5)

Our program is divided into the morning and afternoon sessions. In the morning, we will lay our focus on image-text pre-training. During afternoon session, we shift our discussion to other topics. Recordings of each talk are now posted on Bilibili [Playlist], YouTube [Playlist] and Microsoft Research Talk Series [Playlist].

Morning Session
9:00 - 9:15	Opening Remarks [Bilibili, YouTube]	Lijuan Wang
9:15 - 10:00	Overview of Image-Text Pre-training [Slides] [Bilibili, YouTube]	Jianfeng Wang
10:00 - 10:15	Coffee Break & QA
10:15 - 11:00	Unified Image-Text Modeling [Slides] [Bilibili, YouTube]	Zhengyuan Yang
11:00 - 11:45	Advanced Topics in Image-Text Pre-training [Slides] [Bilibili, YouTube]	Zhe Gan
11:45 - 12:00	Q & A
Afternoon Session
13:00 - 13: 30	Overview of Video-Text Pre-training [ Slides] [Bilibili, YouTube]	Kevin Lin
13:30 - 14:00	Learning from Multi-channel Videos: Methods and Benchmarks [ Slides] [Bilibili, YouTube]	Linjie Li
14:00 - 14: 30	Advanced Topics in Video-Text Pre-training [ Slides] [Bilibili, YouTube]	Chung-Ching Lin
14:30 - 14:45	Coffee Break & QA
14:45 - 15: 15	VLP for Image Classification [ Slides] [Bilibili, YouTube]	Jianwei Yang
15:15 - 15:45	VLP for Object Detection [ Slides] [Bilibili, YouTube]	Pengchuan Zhang
15:45 - 16:15	Benchmarks for Computer Vision in the Wild [ Slides] [Bilibili, YouTube]	Chunyuan Li
16:15 - 17:00	VLP for Text-to-Image Synthesis [Slides] [Bilibili, YouTube]	Chenfei Wu
17:00 - 17:15	Q & A

Organizers