Hi, thanks for stopping by! I am now a second-year Ph.D. Student at The University of North Carolina at Chapel Hill, advised by Prof. Mohit Bansal. Previously, I did my undergraduate study at Shanghai Jiao Tong University.

While at UNC, I spent my summer time at Adobe Research (2024), Amazon Alexa (2023). Prior to UNC, I did research projects at SenseTime Research (2021), and with MIT-IBM Watson AI Lab (2021).

I am interested in wide topics in computer vision, especially in video, including video+X (language, audio, robotics) understanding & generation, trustworthy video reasoning, and robust video representation learning.

Find me here: shoubin -atsign- cs . unc . edu

πŸ”₯ News

  • 2024.09: πŸ““ One paper accepted to EMNLP main 2024. Check LLoVi for long VideoQA with LLM.
  • 2024.07: πŸ“Ή One paper accepted to ACM MM 2024. Check IVA-0 for controllable image animation.
  • 2024.06: πŸ’¬ Gave an invited talk at Google.
  • 2024.05: 🎬 Start summer intern at Adobe as Research Scientist.
  • 2023.09: ⛓️ One paper accepted to NeurIPS 2023. Check SeViLA for Video Loc+QA.
  • 2023.07: 🦴 One paper accepted to IEEE TCSVT. Check MoPRL for skeletal anomaly detection.
  • 2023.05: 🌞 Start summer intern at Amazon as Research Scientist.
  • 2022.09: β›ͺ️ Join UNC-CH MURGe-Lab .
  • 2022.06: πŸŽ“ Graduate from Shanghai Jiao Tong University (excellent graduates).
  • 2021.10: 🌟 One paper accepted to NeurIPS 2021. Check STAR for real-world situated reasoning.

πŸ“ Pre-print (*: equal contribution/co-first author)

Preprint
sym

Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level

Andong Deng, Tongjia Chen, Shoubin Yu, Wenshuo Chen, Taojiannan Yang, Lincoln Spencer, Erhang Zhang, Yapeng Tian, Ajmal Saeed Mian, Mohit Bansal, Chen Chen

(To appear)Code | Project Page

  • We present GroundMoRe, a new benchmark for novel Motion-Grounded Video Reasoning, designed to assess multimodal models’ reasoning and perception capabilities for motion understanding.
Preprint
sym

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

Ziyang Wang*, Shoubin Yu*, Elias Stengel-Eskin*, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal

Code | Project Page

  • We present VideoTree, an adaptive tree-based video presentation/prompting with simple visual clusturing for long video reasoning with LLM.
Preprint
sym

RACCooN: Remove, Add, and Change Video Content with Auto-Generated Narratives

Jaehong Yoon*, Shoubin Yu*, Mohit Bansal

Code | Project Page

  • We present RACCooN, a versatile and user-friendly video-to-paragraph-to-video framework, enables users to remove, add, or change video content via updating auto-generated narratives.
Preprint
sym

CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

Shoubin Yu*, Jaehong Yoon*, Mohit Bansal

Code | Project Page

  • We present CREMA, an efficient & modular modality-fusion framework for injecting any new modality into video reasoning.

πŸ“ Publications

EMNLP main
sym

A Simple LLM Framework for Long-Range Video Question-Answering

Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, Gedas Bertasius

Code

  • We present LLoVi, a simple yet effective framework with LLM for long-range video question-answering.
ACM MM 2024
sym

Zero-Shot Controllable Image-to-Video Animation via Motion Decomposition

Shoubin Yu, Jacob Zhiyuan Fang, Skyler Zheng, Gunnar Sigurdsson, Vicente Ordonez, Robinson Piramuthu, Mohit Bansal

Code | Homepage

  • We present IVA-0, a Image-to-Video animationor, enables precise control from users through in-place and out-of-place motion decomposition.
NeurIPS 2023
sym

Self-Chained Image-Language Model for Video Localization and Question Answering

Shoubin Yu, Jaemin Cho, Prateek Yadav, Mohit Bansal

Code | Demo | Talk

  • We propose SeViLA, which self-chained BLIP-2 for 2-stage video question-answering (localization + QA) & refine localization with QA feedback.
TCSVT 2023
sym

Regularity Learning via Explicit Distribution Modeling for Skeletal Video Anomaly Detection

Shoubin Yu, Zhongyin Zhao, Haoshu Fang, Andong Deng,Haisheng Su, Dongliang Wang, Weihao Gan, Cewu Lu, Wei Wu

Code

  • We propose MoPRL, a transformer-based model incorporated with skeletal motion prior for efficient video anomaly detection.
NeurIPS 2021
sym

STAR: A Benchmark for Situated Reasoning in Real-World Videos

Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B. Tenenbaum, Chuang Gan

Code | Project Page

  • We propose STAR, a benchmark for neural-symbolic video reasoning in real-world scenes.

πŸŽ– Honors and Awards

  • CN Patent CN114724062A, 2022
  • The Hui-Chun Chin and Tsung Dao Lee Scholar, 2020
  • CN Patent CN110969107A, 2019
  • Meritorious Award in Mathematical Contest in Modeling, 2019
  • Second Prize in Shanghai, China Undergraduate Mathematical Contest in Modeling, 2019

🧐 Service

  • Conference reviewer: CVPR, ECCV, NeurIPS, ACL, EACL, CoNLL, AAAI
  • Journal reviewer: IEEE Transactions on Circuits and Systems for Video Technology

πŸ“– Educations

sym
  • 2022.09 - Present
  • The University of North Carolina at Chapel Hill
  • Computer Science, Ph.D.
sym
  • 2017.09 - 2022.06
  • Shanghai Jiao Tong University
  • Information Security, B.Eng.

πŸ’» Internships

sym
sym