A comparative study of language Transformers for video question answering

概要

With the goal of correctly answering questions about images or videos, visual question answering (VQA) has quickly developed in recent years. However, current VQA systems mainly focus on answering questions about a single image and face many challenges in answering video-based questions. VQA in video not only has to understand the evolution between video frames but also requires a certain understanding of corresponding subtitles. In this paper, we propose a language Transformer-based video question answering model to encode the complex semantics from video clips. Different from previous models which represent visual features by recurrent neural networks, our model encodes visual concept sequences with a pre-trained language Transformer. We investigate the performance of our model using four language Transformers over two different datasets. The results demonstrate outstanding improvements compared to previous work.

収録
Neurocomputing
Zekun Yang
Zekun Yang
博士後期課程学生
Noa Garcia
Noa Garcia
特任助教

Her research interests lie in computer vision and machine learning applied to visual retrieval and joint models of vision and language for high-level understanding tasks.

Chenhui Chu
Chenhui Chu
招へい准教授
中島悠太
中島悠太
准教授

コンピュータビジョン・パターン認識などの研究。ディープニューラルネットワークなどを用いた画像・映像の認識・理解を主に、自然言語処理を援用した応用研究などに従事。

関連項目