Publications

You can also find my papers on my Google Scholar profile.

You can also find my papers on my Semantic Scholar profile.

HawkEye: Training Video-Text LLMs for Grounding Text in Videos

Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, Dongyan Zhao

Published in arXiv, 2025

One of the first video-text LLMs that can perform temporal video grounding in a fully text-to-text manner, and InternVid-G, a large-scale video-text dataset for video grounding training.

Download here

Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding

Yueqian Wang, Xiaojun Meng, Yuxuan Wang, Jianxin Liang, Qun Liu, Dongyan Zhao

Published in AAAI 2025, 2025

Defined tasks related to ``multimodal multi-party conversation understanding’, collected the Friends-MMC dataset from TV series, and introduced baseline models..

Download here

VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

Yueqian Wang, Xiaojun Meng, Yuxuan Wang, Jianxin Liang, Jiansheng Wei, Huishuai Zhang, Dongyan Zhao

Published in arXiv, 2025

Proposed MMDuet, a video-text MLLM for real-time interaction, which autonomously decides its response timing during video playback, and its training dataset MMDuetIT.

Download here

STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering

Yueqian Wang, Yuxuan Wang, Kai Chen, Dongyan Zhao

Published in AAAI Conference on Artificial Intelligence, 2025

A neural module network (NMN) based method for videoqa with long videos and complicated questions.

Download here

Understanding Multimodal Hallucination with Parameter-Free Representation Alignment

Yueqian Wang, Jianxin Liang, Yuxuan Wang, Huishuai Zhang, Dongyan Zhao

Published in arXiv, 2025

Proposed a parameter-free and training-free method for analyzing the quantity of information within image representations, and its applications in multimodal hallucination ascription.

Download here

Overview of the NLPCC 2023 Shared Task 10: Learn to Watch TV: Multimodal Dialogue Understanding and Response Generation

Yueqian Wang, Yuxuan Wang, Dongyan Zhao

Published in Natural Language Processing and Chinese Computing, 2025

Hosted a shared task about video dialogue understanding and prediction at NLPCC 2023.

Download here

SMASH: Improving SMAll Language Models’ Few-SHot Ability with Prompt-Based Distillation

Yueqian Wang, Chang Liu, Kai Chen, Xi Wang, Dongyan Zhao

Published in Findings of the Association for Computational Linguistics: EMNLP, 2025

Prompt-based learning and distillation for small transformer encoder-based language models.

Download here

LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding

Yuxuan Wang, Yueqian Wang, Pengfei Wu, Jianxin Liang, Dongyan Zhao, Zilong Zheng

Published in arXiv:2402.16050, 2024

Intergrating optical flow for relevant content selection to improve video-text LLMs’ abilities on videoqa.

Download here

VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions

Yuxuan Wang, Zilong Zheng, Xueliang Zhao, Jinpeng Li, Yueqian Wang, Dongyan Zhao

Published in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

A large-scale video dialogue corpus collected from TV series with scene and segment transistion annotation.

Download here