Xinyi Chen

I am a first-year PhD student at Fudan University, jointly advised by Prof. Bowen Zhou and Prof. Xin Peng. In parallel with my PhD, I work as an intern at the Embodied AI Center, Shanghai AI Laboratory, supervised by Jiangmiao Pang and Yilun Chen. Prior to my PhD, I earned my B.Eng degree at Nanjing University with honors.

I am actively exploring robotic manipulation and the integration of vision-language models (VLMs) with robotics. Always happy to chat, collaborate, or just make new friends—drop me a message anytime!

news

Jul 15, 2025	Joined the PhD program at Fudan University after graduating from Nanjing University.
Feb 27, 2025	Two papers have been accepted by CVPR 2025.
Jul 22, 2024	I started my internship at Shanghai AI Laboratory .

selected publications

CVPR 2025

RoboGround: Robotic Manipulation with Grounded Vision-Language Priors

Haifeng Huang^*, Xinyi Chen^*, Yilun Chen, Hao Li, and 5 more authors

In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), Jun 2025

Abs HTML PDF Code

Recent advancements in robotic manipulation have highlighted the potential of intermediate representations for improving policy generalization. In this work, we explore grounding masks as an effective intermediate representation, balancing two key advantages: (1) effective spatial guidance that specifies target objects and placement areas while also conveying information about object shape and size, and (2) broad generalization potential driven by large-scale vision-language models pretrained on diverse grounding datasets. We introduce RoboGround, a grounding-aware robotic manipulation system that leverages grounding masks as an intermediate representation to guide policy networks in object manipulation tasks. To further explore and enhance generalization, we propose an automated pipeline for generating large-scale, simulated data with a diverse set of objects and instructions. Extensive experiments show the value of our dataset and the effectiveness of grounding masks as intermediate guidance, significantly enhancing the generalization abilities of robot policies.
CVPR 2025

GENMANIP: LLM-driven Simulation for Generalizable Instruction-Following Manipulation

Ning Gao^*, Yilun Chen^*, Shuai Yang^*, Xinyi Chen^*, and 6 more authors

In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), Jun 2025

Abs HTML PDF Code

Robotic manipulation in real-world settings remains challenging, especially regarding robust generalization. Existing simulation platforms lack sufficient support for exploring how policies adapt to varied instructions and scenarios. Thus, they lag behind the growing interest in instruction-following foundation models like LLMs, whose adaptability is crucial yet remains underexplored in fair comparisons. To bridge this gap, we introduce GenManip, a realistic tabletop simulation platform tailored for policy generalization studies. It features an automatic pipeline via LLM-driven task-oriented scene graph to synthesize large-scale, diverse tasks using 10K annotated 3D object assets. To systematically assess generalization, we present GenManip-Bench, a benchmark of 200 scenarios refined via human-in-the-loop corrections. We evaluate two policy types: (1) modular manipulation systems integrating foundation models for perception, reasoning, and planning, and (2) end-to-end policies trained through scalable data collection. Results show that while data scaling benefits end-to-end methods, modular systems enhanced with foundation models generalize more effectively across diverse scenarios. We anticipate this platform to facilitate critical insights for advancing policy generalization in realistic conditions. All code will be made publicly available.