[Coming soon!] Step EditX: Next-Generation Conversational Speech Editing Model

Abstract
Step EditX is a groundbreaking next-generation speech editing model that completely transforms traditional tool-based audio post-processing into natural language instruction-based “conversational creation.” Users can perform comprehensive intelligent editing of audio—from content to style, from emotion to coloring—simply through text prompts. Step EditX not only possesses powerful zero-shot TTS capabilities, over 14 types of emotion enhancement, and more than 30 style transfer options, but also features precise “one-click audio enhancement” functionality that intelligently repairs various audio imperfections and extracts target voices. Its most significant breakthrough lies in the model’s deep understanding of text-level addition, deletion, and modification instructions, enabling context-aware speech regeneration that corrects content while perfectly preserving the speaker’s timbre and prosody. This marks the first entry of speech editing into the era of true “semantic-level” intelligent operations.