Hi, thanks for stopping by! I am now a third-year Ph.D. Student at The University of North Carolina at Chapel Hill, advised by Prof. Mohit Bansal. I did my undergraduate study at Shanghai Jiao Tong University.

I also work at Amazon (2023) / Adobe Research (2024) / Google Deepmind (2025).

My research focuses on multimodal AI, exploring how we can enable AI models to perceive and understand the world in a way similar to humans.

I develop and enhance models/systems capable of effectively and efficiently perceiving and inferring from the dynamic and diverse multimodal world. My work aims to enable AI to assist humans in understanding complex multimodal content for advanced reasoning and manipulation, contributing to a broad spectrum of downstream applications (sports, security, medical, and educational domains), and fostering the development of more adaptable and intelligent video-based AI systems. They are:

Find me here: shoubin -atsign- cs . unc . edu

๐Ÿ”ฅ News

  • 2025.06: ๐ŸŒŠ 1 papers accepted to ICCV 2025. Check VEGGIE for MLLM+Diffusion for multi-skill instructional video editing.
  • 2025.05: ๐Ÿง  Summer intern at Google Deepmind.
  • 2025.03: ๐Ÿฅฆ VEGGIE is on arXiv.
  • 2025.02: ๐Ÿ’ฌ Gave an invited talk at Twelve Labs.
  • 2025.02: ๐Ÿ‘€ 2 papers accepted to CVPR 2025. Check VideoTree for dynamic/adaptive keyframe selection with LLM, GroundMoRe for a new motion-grounded video reasoning task.
  • 2025.01: ๐Ÿ‡ธ๐Ÿ‡ฌ 3 papers accepted to ICLR 2025. Check โ˜•CREMA for video+any modality reasoning, SAFREE for training-free safe visual generation, and โ›“๏ธSRDF for human-level VL-Navigation.
  • 2024.09: ๐Ÿ““ 1 paper accepted to EMNLP 2024. Check LLoVi for long VideoQA with LLM.
  • 2024.07: ๐Ÿ“น 1 paper accepted to ACMMM 2024. Check IVA-0 for controllable image animation.
  • 2024.06: ๐Ÿ’ฌ Gave an invited talk at Google.
  • 2024.05: ๐ŸŽฌ Summer intern at Adobe.
  • 2023.09: โ›“๏ธ 1 paper accepted to NeurIPS 2023. Check SeViLA for Video Loc+QA.
  • 2023.07: ๐Ÿฆด 1 paper accepted to IEEE TCSVT. Check MoPRL for skeletal anomaly detection.
  • 2023.05: ๐ŸŒž Summer intern at Amazon.
  • 2022.09: โ›ช๏ธ Join UNC-CH MURGe-Lab .
  • 2022.06: ๐ŸŽ“ Graduate from Shanghai Jiao Tong University (outstanding graduates).
  • 2021.10: ๐ŸŒŸ 1 paper accepted to NeurIPS 2021. Check STAR for real-world situated reasoning.

๐Ÿ“ Pre-print (*: equal contribution/co-first author)

Preprint
sym

4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time

Ziqiao Ma, Xuweiyi Chen, Shoubin Yu, , Sai Bi, Kai Zhang, Chen Ziwen, Sihan Xu, Jianing Yang, Zexiang Xu, Kalyan Sunkavalli, Mohit Bansal, Joyce Chai, Hao Tan

Code | Project Page

  • We introduce 4D-LRM, a data-driven 4D reconstruction model that takes sparse input views at any time and renders arbitrary novel view-time combinations.
Preprint
sym

MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation

Shoubin Yu*, Yue Zhang*, Ziyang Wang, Jaehong Yoon, Mohit Bansal

Code

  • We introduce MEXA, a general and training-free multimodal reasoning framework via dynamic multi-expert skill selection, aggregation and deep reasoning.
Preprint
sym

Movie Facts and Fibs (MF2): A Benchmark for Long Movie Understanding

Emmanouil Zaranis, Antรณnio Farinhas, Saul Santos, Beatriz Canaverde,โ€ฆ*Shoubin Yu*, et al.

Code

  • We propose MF2, a new benchmark for evaluating whether models can comprehend, consolidate, and recall key narrative information from full-length movies
Preprint
sym

Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization

Jialu Li*, Shoubin Yu*, Han Lin*, Jaemin Cho, Jaehong Yoon, Mohit Bansal

Code | Project Page

  • We propose Video-MSG, a training-free guidance method for T2V generation with multimodal planning and structured noise.
Preprint
sym

RACCooN: Remove, Add, and Change Video Content with Auto-Generated Narratives

Jaehong Yoon*, Shoubin Yu*, Mohit Bansal

Code | Project Page

  • We present RACCooN, a versatile and user-friendly video-to-paragraph-to-video framework, enables users to remove, add, or change video content via updating auto-generated narratives.

๐Ÿ“ Publications

ICCV 2025
sym

VEGGIE: Instructional Editing and Reasoning of Video Concepts with Grounded Generation

Shoubin Yu*, Difan Liu*, Ziqiao Ma*, Yicong Hong, Yang Zhou, Hao Tan, Joyce Chai, Mohit Bansal

Code | Project Page

  • We propose VEGGIE, a unified and versatile video generative model that handles various tasks for both video concept grounding and editing according to user instructions.
CVPR 2025
sym

Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level

Andong Deng, Tongjia Chen, Shoubin Yu, Taojiannan Yang, Lincoln Spencer, Yapeng Tian, Ajmal Saeed Mian, Mohit Bansal, Chen Chen

Code | Project Page

  • We present GroundMoRe, a new benchmark for novel Motion-Grounded Video Reasoning, designed to assess multimodal modelsโ€™ reasoning and perception capabilities for motion understanding.
CVPR 2025
sym

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

Ziyang Wang*, Shoubin Yu*, Elias Stengel-Eskin*, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal

Code | Project Page

  • We present VideoTree, an adaptive tree-based video presentation/prompting with simple visual clustering for long video reasoning with LLM.
ICLR 2025
sym

SAFREE: Train-free And Adaptive Guard For Safe Text-to-Image And Video Generation

Jaehong Yoon*, Shoubin Yu*, Vaidehi Patil, Huaxiu Yao, Mohit Bansal

Code | Project Page

  • We propose SAFREE, a concept guard that can zero transfer to any visual diffusion models for safe generation.
ICLR 2025
sym

CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

Shoubin Yu*, Jaehong Yoon*, Mohit Bansal

Code | Project Page

  • We present CREMA, an efficient & modular modality-fusion framework for injecting any new modality into video reasoning.
ICLR 2025
sym

Bootstrapping Language-guided Navigation Learning with Self-refining Data Flywheel

Zun Wang, Jialu Li, Yicong Hong, Songze Li, Kunchang Li, Shoubin Yu, Yi Wang, Yu Qiao, Yali Wang, Mohit Bansal, Limin Wang

Code

  • We present a Self-Refining Data Flywheel strategy for VLN, surpassing/approaching human performance on several benchmarks.
EMNLP 2024
sym

A Simple LLM Framework for Long-Range Video Question-Answering

Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, Gedas Bertasius

Code

  • We present LLoVi, a simple yet effective framework with LLM for long-range video question-answering.
ACM MM 2024
sym

Zero-Shot Controllable Image-to-Video Animation via Motion Decomposition

Shoubin Yu, Jacob Zhiyuan Fang, Skyler Zheng, Gunnar Sigurdsson, Vicente Ordonez, Robinson Piramuthu, Mohit Bansal

Code | Homepage

  • We present IVA-0, a Image-to-Video animator, enables precise control from users through in-place and out-of-place motion decomposition.
NeurIPS 2023
sym

Self-Chained Image-Language Model for Video Localization and Question Answering

Shoubin Yu, Jaemin Cho, Prateek Yadav, Mohit Bansal

Code | Demo | Talk

  • We propose SeViLA, which self-chained BLIP-2 for 2-stage video question-answering (localization + QA) & refine localization with QA feedback.
TCSVT 2023
sym

Regularity Learning via Explicit Distribution Modeling for Skeletal Video Anomaly Detection

Shoubin Yu, Zhongyin Zhao, Haoshu Fang, Andong Deng, Haisheng Su, Dongliang Wang, Weihao Gan, Cewu Lu, Wei Wu

Code

  • We propose MoPRL, a transformer-based model incorporated with skeletal motion prior for efficient video anomaly detection.
NeurIPS 2021
sym

STAR: A Benchmark for Situated Reasoning in Real-World Videos

Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B. Tenenbaum, Chuang Gan

Code | Project Page

  • We propose STAR, a benchmark for neural-symbolic video reasoning in real-world scenes.

๐Ÿง Service

  • Conference reviewer: CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML, AISTATS, ARR (ACL, EMNLP, CoNLL, NACCAL, EACL), AAAI
  • Journal reviewer: IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), IEEE Transactions on Neural Networks and Learning Systems (TNNLS), IEEE Transactions on Multimedia (TMM)

๐Ÿ“– Educations

sym
  • 2022.09 - Present
  • The University of North Carolina at Chapel Hill
  • Computer Science, Ph.D.
sym
  • 2017.09 - 2022.06
  • Shanghai Jiao Tong University
  • Information Security, B.Eng.

๐Ÿ’ป Internships

sym
  • 2024.05 - 2025.03, Student Researcher
sym
  • 2024.05 - 2025.03, Research Scientist Intern
sym
  • 2023.05 - 2023.11, Research Scientist Intern