Contrastive Losses Are Natural Criteria for Unsupervised Video Summarization

概要

Video summarization aims to select a most informative subset of frames in a video to facilitate efficient video browsing. Unsupervised methods usually rely on heuristic training objectives such as diversity and representativeness. However, such methods need to bootstrap the online-generated summaries to compute the objectives for importance score regression. We consider such a pipeline inefficient and seek to directly quantify the frame-level importance with the help of contrastive losses in the representation learning literature. Leveraging the contrastive losses, we propose three metrics featuring a desirable key frame: local dissimilarity, global consistency, and uniqueness. With features pre-trained on an image classification task, the metrics can already yield high-quality importance scores, demonstrating better or competitive performance compared with past heavily-trained methods. We show that by refining the pre-trained features with contrastive learning, the frame-level importance scores can be further improved, and the model can learn from random videos and generalize to test videos with decent performance.

収録
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
Zongshang Pang
Zongshang Pang
博士後期課程学生
中島悠太
中島悠太
准教授

コンピュータビジョン・パターン認識などの研究。ディープニューラルネットワークなどを用いた画像・映像の認識・理解を主に、自然言語処理を援用した応用研究などに従事。

長原一
長原一
教授

コンピューテーショナルフォトグラフィ、コンピュータビジョンを専門とし実世界センシングや情報処理技術、画像認識技術の研究を行う。さらに、画像センシングにとどまらず様々なセンサに拡張したコンピュテーショナルセンシング手法の開発や高次元で冗長な実世界ビッグデータから意味のある情報を計測するスパースセンシングへの転換を目指す。

関連項目