Humans perceive the world through many channels, such as images viewed by the eyes or voices heard by the ears.
Though any individual channel might be incomplete or noisy, humans can naturally align and fuse information collected from multiple channels, in order to grasp the key concepts needed for a better understanding of the world.
One of the core aspirations in artificial intelligence is to develop algorithms that endow computers with an ability
to effectively learn from multi-modality (or multi-channel) data. This data is similar to sights and sounds attained from
vision and language that help humans make sense of the world around us. For example, computers could mimic this ability by searching the most similar images for a text query (or vice versa) and by describing the content of an image using natural language.
Vision-and-Language (VL), a popular research area that sits at the nexus of Computer Vision and Natural Language Processing (NLP), aims to achieve this goal. Inspired by the great success of language model pre-training in NLP, Vision-and-Language Pre-training (VLP) has recently attracted rapidly growing attention from both communities.
In this tutorial, we will cover the most recent approaches and principles at the frontier of VLP, including
(1) region-feature-based and end-to-end image-text pre-training; (2) unified vision-language modeling; (3) its extension to video-language pre-training; (4) learning visual models from language supervision; and (5) visual synthesis.
The tutorial will be a full-day event (9:00 am to 5:00pm) with several middle breaks.
Our program is divided into the morning and afternoon sessions. In the morning, we will lay our focus on image-text pre-training. During afternoon session, we shift our discussion to other topics. Recordings of each talk are now posted on Bilibili [Playlist], YouTube [Playlist] and Microsoft Research Talk Series [Playlist].
Morning Session | ||
9:00 - 9:15 | Opening Remarks [Bilibili, YouTube] | Lijuan Wang |
9:15 - 10:00 | Overview of Image-Text Pre-training [Slides] [Bilibili, YouTube] | Jianfeng Wang |
10:00 - 10:15 | Coffee Break & QA | |
10:15 - 11:00 | Unified Image-Text Modeling [Slides] [Bilibili, YouTube] | Zhengyuan Yang |
11:00 - 11:45 | Advanced Topics in Image-Text Pre-training [Slides] [Bilibili, YouTube] | Zhe Gan |
11:45 - 12:00 | Q & A | |
Afternoon Session | ||
13:00 - 13: 30 | Overview of Video-Text Pre-training [ Slides] [Bilibili, YouTube] | Kevin Lin |
13:30 - 14:00 | Learning from Multi-channel Videos: Methods and Benchmarks [ Slides] [Bilibili, YouTube] | Linjie Li |
14:00 - 14: 30 | Advanced Topics in Video-Text Pre-training [ Slides] [Bilibili, YouTube] | Chung-Ching Lin |
14:30 - 14:45 | Coffee Break & QA | |
14:45 - 15: 15 | VLP for Image Classification [ Slides] [Bilibili, YouTube] | Jianwei Yang |
15:15 - 15:45 | VLP for Object Detection [ Slides] [Bilibili, YouTube] | Pengchuan Zhang |
15:45 - 16:15 | Benchmarks for Computer Vision in the Wild [ Slides] [Bilibili, YouTube] | Chunyuan Li |
16:15 - 17:00 | VLP for Text-to-Image Synthesis [Slides] [Bilibili, YouTube] | Chenfei Wu |
17:00 - 17:15 | Q & A |
This year, June 19 and 20 marks Juneteenth, a US holiday commemorating the end of slavery in the US, and a holiday of special significance in the US South. We encourage attendees to learn more about Juneteenth and its historical context, and to join the city of New Orleans in celebrating the Juneteenth holiday. You can find out more information about Juneteenth here: https://cvpr2022.thecvf.com/recognizing-juneteenth.
Contact the Organizing Committee: vlp-tutorial@googlegroups.com