Fei Tian 🚀

Fei Tian

Audio LLM Researcher

StepFun

Professional Summary

Fei Tian is an Audio LLM Researcher at StepFun, pioneering the next generation of speech AI. He was instrumental in developing groundbreaking projects including Step-Audio, Step-Audio 2, Step-Audio R1, and Step-MPS. His work introduced China’s leading speech reasoning model (benchmarking Gemini 2.5 Pro), the revolutionary “thinking-while-speaking” framework, and the integration of Chain-of-Thought (CoT) reasoning into the world’s first industrial-grade audio LLM. Previously at ByteDance, he spearheaded the architectural evolution of speech models for core products like TikTok and CapCut. Fei is passionately committed to contributing his expertise to the journey toward Artificial General Intelligence.

Education

MS Speech Processing

Nanjing University

Visiting Scholar

University of Technology Sydney

BS Physics and Acoustic

Nanjing University

Interests

Speech Understanding Interactive Speech Systems Speech Generation Reinforcement Learning in Speech Large Language Models
Featured Publications
Chronological Thinking in Full-Duplex Spoken Dialogue Language Models featured image

Chronological Thinking in Full-Duplex Spoken Dialogue Language Models

Abstract Recent advances in spoken dialogue language models (SDLMs) reflect growing interest in shifting from turn-based to full-duplex systems, where the models continuously …

avatar
Fei Tian
•
Read more
Step-Audio 2 Technical Report featured image

Step-Audio 2 Technical Report

Abstract This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating …

avatar
Fei Tian
•
Read more
Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model featured image

Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

Abstract Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to …

avatar
Fei Tian
•
Read more
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction featured image

Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

Abstract Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face …

avatar
Fei Tian
•
Read more
Recent Publications

Experience

  1. Researcher, Audio LLM

    StepFun

    Leading the research and development of next-generation, end-to-end Audio LLMs, specializing in advanced speech understanding, interactive systems, and reinforcement learning.

    • Step-Audio R1 & Step MPS (Project Lead): Ushered in the “Deepseek R1 moment” for audio LLMs by developing China’s #1 speech reasoning model, directly benchmarking Gemini 2.5 Pro in perception and reasoning. Pioneered the revolutionary Step MPS (Mind-Paced Speaking) “dual-brain” framework, a global-first solution that enables complex CoT reasoning and highly empathetic, human-like interaction with zero additional latency, achieving true real-time “thinking-while-speaking.”
    • Step-Audio 2 (Lead of Speech Understanding): Led the development of the world’s first industrial-grade end-to-end audio LLM with deep thinking capabilities. Introduced Chain-of-Thought (CoT) reasoning and audio reinforcement learning into speech models for the first time, achieving SOTA performance across ASR, paralinguistic understanding (emotion, tone, music), and reasoning tasks for both open-source and proprietary models. [arXiv:2507.16632]
    • Step EditX (Co-Project Lead): Defined a new paradigm of instruction-based “conversational creation” for audio editing. Developed a groundbreaking model capable of zero-shot TTS, style transfer (30+ styles), emotion enhancement (14+ emotions), and one-click restoration. Achieved the industry’s first semantic-level, context-aware audio editing (insertion, deletion, modification) based on natural language prompts, ensuring perfect preservation of timbre and prosody.
  2. Researcher, Audio LLM

    ByteDance
    • Led the R&D of the subtitle generation system for core products including TikTok, Douyin, CapCut and Jianying, serving millions of daily requests.
    • Spearheaded the integration of Seed Audio LLM to enhance features like personalized captions, context-aware adaptation, and text normalization (ITN).
    • Architected the evolution of speech models from Transformer to Seed Audio LLM, reducing error rates by over 20% and increasing user satisfaction scores to 4.5/5.0.
    • Pioneered a multi-modal speech translation model based on Seed Audio LLM that outperformed Google, Gemini & Qwen, raising dubbing quality scores from 40% to over 85%.

Education

  1. MS Speech Processing

    Nanjing University
  2. Visiting Scholar

    University of Technology Sydney
  3. BS Physics and Acoustic

    Nanjing University
Selected Projects
Step-Audio 2 featured image

Step-Audio 2

The world's first industrial-grade end-to-end audio LLM with deep thinking capabilities, achieving SOTA performance across multiple understanding and dialogue tasks.

Read more