Hi, thanks for stopping by! I am now a third-year Ph.D. Student at The University of North Carolina at Chapel Hill, advised by Prof. Mohit Bansal. Previously, I did my undergraduate study at Shanghai Jiao Tong University.
I also work at Amazon (2023) / Adobe Research (2024) / Google Deepmind (2025).
My research focuses on multimodal AI, with a particular emphasis on video-centric AI modeling.
I develop and enhance models/systems capable of effectively and efficiently perceiving and inferring from the dynamic and diverse visual world. My work aims to enable AI to assist humans in understanding complex video content for advanced reasoning and manipulation, contributing to a broad spectrum of downstream applications (sports, security, medical, and educational domains), and fostering the development of more adaptable and intelligent video-based AI systems. They are:
-
Video Reasoning Benchmarks/Methods: SeViLA (NeurIPS23), CREMA (ICLR25), GroundMoRe (CVPR25), STAR (NeurIPS21)
-
Faithful Video Editing/Generative Methods: VEGGIE (ArXiv25), SAFREE (ICLR25), RACCooN (ArXiv24)
-
Efficient Video Representation / Feature Engineering: LLoVi(EMNLP24), VideoTree (CVPR25), MoPRL (TCSVT23)
Find me here: shoubin -atsign- cs . unc . edu
π₯ News
- 2025.03: π₯¦ VEGGIE is on arXiv.
- 2025.02: π¬ Gave an invited talk at Twelve Labs.
- 2025.02: π 2 papers accepted to CVPR 2025. Check VideoTree for dynamic/adaptive keyframe selection with LLM, GroundMoRe for a new motion-grounded video reasoning task.
- 2025.02: π§ Will summer intern at Google Deepmind.
- 2025.01: πΈπ¬ 3 papers accepted to ICLR 2025. Check βCREMA for video+any modality reasoning, SAFREE for training-free safe visual generation, and βοΈSRDF for human-level VL-Navigation.
- 2024.09: π 1 paper accepted to EMNLP 2024. Check LLoVi for long VideoQA with LLM.
- 2024.07: πΉ 1 paper accepted to ACMMM 2024. Check IVA-0 for controllable image animation.
- 2024.06: π¬ Gave an invited talk at Google.
- 2024.05: π¬ Summer intern at Adobe.
- 2023.09: βοΈ 1 paper accepted to NeurIPS 2023. Check SeViLA for Video Loc+QA.
- 2023.07: 𦴠1 paper accepted to IEEE TCSVT. Check MoPRL for skeletal anomaly detection.
- 2023.05: π Summer intern at Amazon.
- 2022.09: βͺοΈ Join UNC-CH MURGe-Lab .
- 2022.06: π Graduate from Shanghai Jiao Tong University (outstanding graduates).
- 2021.10: π 1 paper accepted to NeurIPS 2021. Check STAR for real-world situated reasoning.
π Pre-print (*: equal contribution/co-first author)

VEGGIE: Instructional Editing and Reasoning of Video Concepts with Grounded Generation
Shoubin Yu*, Difan Liu*, Ziqiao Ma*, Yicong Hong, Yang Zhou, Hao Tan, Joyce Chai, Mohit Bansal
- We propose VEGGIE, a unified and versatile video generative model that handles various tasks for both video concept grounding and editing according to user instructions.

RACCooN: Remove, Add, and Change Video Content with Auto-Generated Narratives
Jaehong Yoon*, Shoubin Yu*, Mohit Bansal
- We present RACCooN, a versatile and user-friendly video-to-paragraph-to-video framework, enables users to remove, add, or change video content via updating auto-generated narratives.
π Publications

Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level
Andong Deng, Tongjia Chen, Shoubin Yu, Taojiannan Yang, Lincoln Spencer, Yapeng Tian, Ajmal Saeed Mian, Mohit Bansal, Chen Chen
- We present GroundMoRe, a new benchmark for novel Motion-Grounded Video Reasoning, designed to assess multimodal modelsβ reasoning and perception capabilities for motion understanding.

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Ziyang Wang*, Shoubin Yu*, Elias Stengel-Eskin*, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal
- We present VideoTree, an adaptive tree-based video presentation/prompting with simple visual clustering for long video reasoning with LLM.

SAFREE: Train-free And Adaptive Guard For Safe Text-to-Image And Video Generation
Jaehong Yoon*, Shoubin Yu*, Vaidehi Patil, Huaxiu Yao, Mohit Bansal
- We propose SAFREE, a concept guard that can zero transfer to any visual diffusion models for safe generation.

CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
Shoubin Yu*, Jaehong Yoon*, Mohit Bansal
- We present CREMA, an efficient & modular modality-fusion framework for injecting any new modality into video reasoning.

Bootstrapping Language-guided Navigation Learning with Self-refining Data Flywheel
Zun Wang, Jialu Li, Yicong Hong, Songze Li, Kunchang Li, Shoubin Yu, Yi Wang, Yu Qiao, Yali Wang, Mohit Bansal, Limin Wang
- We present a Self-Refining Data Flywheel strategy for VLN, surpassing/approaching human performance on several benchmarks.

A Simple LLM Framework for Long-Range Video Question-Answering
Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, Gedas Bertasius
- We present LLoVi, a simple yet effective framework with LLM for long-range video question-answering.

Zero-Shot Controllable Image-to-Video Animation via Motion Decomposition
Shoubin Yu, Jacob Zhiyuan Fang, Skyler Zheng, Gunnar Sigurdsson, Vicente Ordonez, Robinson Piramuthu, Mohit Bansal
- We present IVA-0, a Image-to-Video animator, enables precise control from users through in-place and out-of-place motion decomposition.

Self-Chained Image-Language Model for Video Localization and Question Answering
Shoubin Yu, Jaemin Cho, Prateek Yadav, Mohit Bansal
- We propose SeViLA, which self-chained BLIP-2 for 2-stage video question-answering (localization + QA) & refine localization with QA feedback.

Regularity Learning via Explicit Distribution Modeling for Skeletal Video Anomaly Detection
Shoubin Yu, Zhongyin Zhao, Haoshu Fang, Andong Deng, Haisheng Su, Dongliang Wang, Weihao Gan, Cewu Lu, Wei Wu
- We propose MoPRL, a transformer-based model incorporated with skeletal motion prior for efficient video anomaly detection.

STAR: A Benchmark for Situated Reasoning in Real-World Videos
Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B. Tenenbaum, Chuang Gan
- We propose STAR, a benchmark for neural-symbolic video reasoning in real-world scenes.
π Honors and Awards
- Piepieβs (1-year-old black Shiba-Inu πΆ) Dad, 2024
- The Hui-Chun Chin and Tsung Dao Lee Scholar, 2020
- Meritorious Award in Mathematical Contest in Modeling, 2019
- Second Prize in Shanghai, China Undergraduate Mathematical Contest in Modeling, 2019
π§ Service
- Conference reviewer: CVPR, ECCV, NeurIPS, ICLR, ICML, AISTATS, ARR (ACL, EMNLP, CoNLL, NACCAL, EACL), AAAI
- Journal reviewer: IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), IEEE Transactions on Neural Networks and Learning Systems (TNNLS), IEEE Transactions on Multimedia (TMM)
π Educations

- 2022.09 - Present
- The University of North Carolina at Chapel Hill
- Computer Science, Ph.D.

- 2017.09 - 2022.06
- Shanghai Jiao Tong University
- Information Security, B.Eng.
π» Internships

- 2024.05 - 2024.11, Research Scientist Intern
- work with Difan Liu, Yicong Hong, Yang Zhou, Hao Tan

- 2023.05 - 2023.11, Research Scientist Intern
- work with Jocob Zhiyuan Fang, Robinson Piramuthu