Recent Advances in Vision-and-Language Pre-training

In conjunction with CVPR 2022

June 19th 2022 (9:00 AM - 5:00 PM CDT)

Location: New Orleans, Louisiana

Photo by Justin Scocchio on Unsplash

CVPR 2022 Tutorial on "Recent Advances in Vision-and-Language Pre-training"

Humans perceive the world through many channels, such as images viewed by the eyes or voices heard by the ears. Though any individual channel might be incomplete or noisy, humans can naturally align and fuse information collected from multiple channels, in order to grasp the key concepts needed for a better understanding of the world. One of the core aspirations in artificial intelligence is to develop algorithms that endow computers with an ability to effectively learn from multi-modality (or multi-channel) data. This data is similar to sights and sounds attained from vision and language that help humans make sense of the world around us. For example, computers could mimic this ability by searching the most similar images for a text query (or vice versa) and by describing the content of an image using natural language.

Vision-and-Language (VL), a popular research area that sits at the nexus of Computer Vision and Natural Language Processing (NLP), aims to achieve this goal. Inspired by the great success of language model pre-training in NLP, Vision-and-Language Pre-training (VLP) has recently attracted rapidly growing attention from both communities. In this tutorial, we will cover the most recent approaches and principles at the frontier of VLP, including (1) region-feature-based and end-to-end image-text pre-training; (2) unified vision-language modeling; (3) its extension to video-language pre-training; (4) learning visual models from language supervision; and (5) visual synthesis. The tutorial will be a full-day event (9:00 am to 5:00pm) with several middle breaks.

Program (CDT, UTC-5)

Our program is divided into the morning and afternoon sessions. In the morning, we will lay our focus on image-text pre-training. During afternoon session, we shift our discussion to other topics. Recordings of each talk are now posted on Bilibili [Playlist], YouTube [Playlist] and Microsoft Research Talk Series [Playlist].

Morning Session
9:00 - 9:15 Opening Remarks   [Bilibili, YouTube] Lijuan Wang
9:15 - 10:00 Overview of Image-Text Pre-training   [Slides]   [Bilibili, YouTube] Jianfeng Wang
10:00 - 10:15 Coffee Break & QA  
10:15 - 11:00 Unified Image-Text Modeling  [Slides]   [Bilibili, YouTube] Zhengyuan Yang
11:00 - 11:45 Advanced Topics in Image-Text Pre-training   [Slides]   [Bilibili, YouTube] Zhe Gan
11:45 - 12:00 Q & A  
Afternoon Session
13:00 - 13: 30 Overview of Video-Text Pre-training   [ Slides]   [Bilibili, YouTube] Kevin Lin
13:30 - 14:00 Learning from Multi-channel Videos: Methods and Benchmarks   [ Slides]   [Bilibili, YouTube] Linjie Li
14:00 - 14: 30 Advanced Topics in Video-Text Pre-training   [ Slides]   [Bilibili, YouTube] Chung-Ching Lin
14:30 - 14:45 Coffee Break & QA  
14:45 - 15: 15 VLP for Image Classification   [ Slides]   [Bilibili, YouTube] Jianwei Yang
15:15 - 15:45 VLP for Object Detection   [ Slides]   [Bilibili, YouTube] Pengchuan Zhang
15:45 - 16:15 Benchmarks for Computer Vision in the Wild   [ Slides]   [Bilibili, YouTube] Chunyuan Li
16:15 - 17:00 VLP for Text-to-Image Synthesis   [Slides]   [Bilibili, YouTube] Chenfei Wu
17:00 - 17:15 Q & A  


Zhe Gan


Chunyuan Li


Linjie Li


Chung-Ching Lin


Kevin Lin


Jianfeng Wang


Chenfei Wu


Jianwei Yang


Zhengyuan Yang


Pengchuan Zhang

Meta AI

Lijuan Wang


Zicheng Liu


Jianfeng Gao


Juneteenth Holiday

This year, June 19 and 20 marks Juneteenth, a US holiday commemorating the end of slavery in the US, and a holiday of special significance in the US South. We encourage attendees to learn more about Juneteenth and its historical context, and to join the city of New Orleans in celebrating the Juneteenth holiday. You can find out more information about Juneteenth here:


Contact the Organizing Committee: