Abstract
Generating expressive and controllable human speech is one of the core goals of generative artificial intelligence, but its progress has long been constrained by two fundamental challenges: the deep entanglement of speech factors and the coarse granularity of existing control mechanisms. To overcome these challenges, we propose a novel framework called MF-Speech, which consists of two core components: MF-SpeechEncoder and MF-SpeechGenerator. MF-SpeechEncoder acts as a factor purifier, adopting a multi-objective optimization strategy to decompose the original speech signal into highly pure and independent representations of content, timbre, and emotion. Subsequently, MF-SpeechGenerator functions as a conductor, achieving precise, composable, and fine-grained control over these factors through dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN). Experiments demonstrate that in the highly challenging multi-factor compositional speech generation task, MF-Speech significantly outperforms current state-of-the-art methods, achieving a lower word error rate (WER = 4.67%), superior style control (SECS = 0.5685, Corr = 0.68), and the highest subjective evaluation scores (nMOS = 3.96, sMOSemotion = 3.86, sMOSstyle = 3.78). Furthermore, the learned discrete factors exhibit strong transferability, demonstrating their significant potential as a general-purpose speech representation.
Proposed Approach
The MF-Speech framework is composed of MF-SpeechEncoder and MF-SpeechGenerator, working together to achieve disentanglement and compositional control.
Figure 1. The overall architecture of MF-Speech.
1. Speech Reconstruction
| No | Source | Target | StyleVC | DDDMVC | NS2VC | FAcodeC | MF-Speech |
|---|---|---|---|---|---|---|---|
| 1 | |||||||
| 2 |
2. Multi-Factor Compositional Speech Generation
Each output combines content, timbre, and emotion from different sources.
| No | Content | Timbre | Emotion | StyleVC | DDDMVC | NS2VC | FAcodeC | MF-Speech |
|---|---|---|---|---|---|---|---|---|
| 1 | ||||||||
| 2 |