Leading the research and development of next-generation, end-to-end Audio LLMs, specializing in advanced speech understanding, interactive systems, and reinforcement learning.
- Step-Audio R1 & Step MPS (Project Lead): Ushered in the “Deepseek R1 moment” for audio LLMs by developing China’s #1 speech reasoning model, directly benchmarking Gemini 2.5 Pro in perception and reasoning. Pioneered the revolutionary Step MPS (Mind-Paced Speaking) “dual-brain” framework, a global-first solution that enables complex CoT reasoning and highly empathetic, human-like interaction with zero additional latency, achieving true real-time “thinking-while-speaking.”
- Step-Audio 2 (Lead of Speech Understanding): Led the development of the world’s first industrial-grade end-to-end audio LLM with deep thinking capabilities. Introduced Chain-of-Thought (CoT) reasoning and audio reinforcement learning into speech models for the first time, achieving SOTA performance across ASR, paralinguistic understanding (emotion, tone, music), and reasoning tasks for both open-source and proprietary models. [arXiv:2507.16632]
- Step EditX (Co-Project Lead): Defined a new paradigm of instruction-based “conversational creation” for audio editing. Developed a groundbreaking model capable of zero-shot TTS, style transfer (30+ styles), emotion enhancement (14+ emotions), and one-click restoration. Achieved the industry’s first semantic-level, context-aware audio editing (insertion, deletion, modification) based on natural language prompts, ensuring perfect preservation of timbre and prosody.