Chronological Thinking in Full-Duplex Spoken Dialogue Language Models
Abstract Recent advances in spoken dialogue language models (SDLMs) reflect growing interest in shifting from turn-based to full-duplex systems, where the models continuously …

Fei Tian is an Audio LLM Researcher at StepFun, pioneering the next generation of speech AI. He was instrumental in developing groundbreaking projects including Step-Audio, Step-Audio 2, Step-Audio R1, and Step-MPS. His work introduced China’s leading speech reasoning model (benchmarking Gemini 2.5 Pro), the revolutionary “thinking-while-speaking” framework, and the integration of Chain-of-Thought (CoT) reasoning into the world’s first industrial-grade audio LLM. Previously at ByteDance, he spearheaded the architectural evolution of speech models for core products like TikTok and CapCut. Fei is passionately committed to contributing his expertise to the journey toward Artificial General Intelligence.
MS Speech Processing
Nanjing University
Visiting Scholar
University of Technology Sydney
BS Physics and Acoustic
Nanjing University
Abstract Recent advances in spoken dialogue language models (SDLMs) reflect growing interest in shifting from turn-based to full-duplex systems, where the models continuously …
Abstract This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating …
Abstract Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to …
Abstract Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face …
Leading the research and development of next-generation, end-to-end Audio LLMs, specializing in advanced speech understanding, interactive systems, and reinforcement learning.
The world's first industrial-grade end-to-end audio LLM with deep thinking capabilities, achieving SOTA performance across multiple understanding and dialogue tasks.