VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples (CVPR2021)

3 min readSep 8, 2021

Introduction

此篇 VideoMoCo 參考 MoCo 的架構提出 temporally adversarial learning 和 temporal decay。
temporally adversarial learning 是指給定固定長度的 video frames，透過 GAN 的概念讓 generator drop out 某一些 frame，讓 discriminator 可以辨別出 drop out 後還是相同影像。
temporal decay 是指在 MoCo 的 queue 中增加 index，越久之前的資料 negative sample 影響力要越小。

Method

不同於一般影像生成的領域，這邊的 generator output 是每個 frame 的 importance，並對前 25% 的部份產生出 Drop out mask 使其與 input video 相乘產生出 contrastive learning(CL) 所需要的 augmentation，而 discriminator 其實就是 CL 的 encoder，所以會有兩個 branch 分別是輸入 drop out 前和後的影像，可以想像成是 CL 做的兩個不同的 augmentation，目的是希望辨 encode 完的 representation 相近。
generator 的 loss 是 representation 的 L1 loss，discriminator 的 loss 則是在 key 的部分有加上 temporal decay，實做上 t = 0.99999。

Experiment

最後訓練完和一般 CL 架構類似，只保留 encoder 的部份並做 downstream 的 fine-tuning，encoder 是使用 C3D。
下面是 drop out 數量的 ablation study，Decay 是 temporal decay，Adv 是 generator 的 drop out。

與 MoCo 設定相同，queue size 為 65536，encoder momentum 為 0.999，t 設為 0.99999 得到最好的 performance。
因為 queue 很大的關係所以如果 t 很小後面的 key 等於沒用了，例如 t= 0.999 時 t⁸⁰⁰⁰ ~= 0.0009，而 t = 0.99999 時 t⁶⁵⁵³⁶ ~= 0.52。

最後是 fine-tuning 結果的比較表格。

Reference

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Computer Science

Computer Vision

Written by Balin

NTUST CSIE

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams