Transfer Learning for Video Classification: Video Swin Transformer Research

I’m happy to share my recent research on video classification. This work explores how well the Video Swin Transformer (VST) generalizes across different video domains using transfer learning.

What We Studied

The computer vision field has shifted from convolutional networks to transformer architectures for video tasks. However, training transformers from scratch requires massive computational resources. We investigated whether VST, pre-trained on Kinetics-400, could effectively transfer to other video classification tasks with significantly reduced computational requirements.

We tested transfer learning performance on two datasets:

  • FCVID: Object-focused video classification
  • Something-Something: Action-focused video classification

Our approach required approximately 4x less memory than training from scratch, making state-of-the-art video classification more accessible.

Key Findings

The results revealed important insights about domain generalization:

Strong performance on similar domains: Transfer learning from Kinetics-400 to FCVID achieved 85% top-1 accuracy without retraining the entire model, matching state-of-the-art performance. Both datasets focus primarily on object classification.

Poor performance on different domains: Transfer to Something-Something yielded only 21% accuracy, highlighting the challenge when transferring from object-focused to action-focused classification tasks.

Video duration impact: VST performance decreases as video duration increases, suggesting architectural limitations for longer videos.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top