Voice Conversion 项目笔记（含从VCC 2016匿名比赛深挖的各前沿方法性能对比）

来源：互联网发布：十大神优化游戏编辑：程序博客网时间：2024/05/29 15:24

voice conversion 基本架构：

voice conversion 任务主要由两个步骤构成，特征提取与特征参数转换，对于这两个步骤，都有相应的常用的技术，这两个步骤中常用的技术各种排列组合，就产生了众多VC系统，以下做小汇总。

STEP1：Feature extraction STEP2：Feature conversionFeatureExtraction toolkitsdescription LSF（Line Spectral Frequency）STRAIGHTSTRAIGHT is an MATLAB Lib design for VC.GMM/JDGMMMGC（mel-generalized coeffi-
cient ）STRAIGHTDNNLF0(log f0)AHOCODERAhocoder parameterizes speech waveforms into three different streamsRNN(BLSTM)MCP(mel-spectgram)AHOCODERseq2seq/(with Attention)MVF（maximum voiced frequency）AHOCODERGRU DBN VQ Mixture of Factor Analyzer parameter generation algorithm with global varianceHTSHMM-based Speech Synthesis System

其中,LF0为语音基频log变换，为主要的语音转换参数，是表征不同人，不同性别的最重要参数之一。

其他参数为语音的高阶分量，控制合成语音的细节。

衡量标准：

常用的两种：基于主观评分的MOS和基于Mel谱失真的MCD

MOS（Mean Opinion Score） 主观评分 分值1-5 由受试者主观打分 一般有GMM方法的MOS分作为beseline，对于不同的测试可以考虑把GMM的分数对齐从而在不同的MOS测试中统一基准

打分基准：

4－5分 优秀(excelent) 很好，听的清楚，延迟很小，交流流畅

3－4分 良好(good) 稍差，听的清楚，延迟小，交流欠缺顺畅，有点杂音

2－3分 一般(fair) 还可以，听不太清，有一定延迟，可以交流

1－2分 差(poor) 勉强，听不太清，延迟较大，交流重复多次1分以下 很差(bad) 极差，听不懂，延迟大，交流不通畅

MCD（Mel-cepstral distortion）[dB] 越低越好

例：

seq2seq with LSTM:

Variational Auto-encoder:

Deep sequence-to-sequence Attention Model :

VCC2016:

各组综合分数对比：

各组方案对比（红圈为MOS>2.5 similarity>60%的方案，紫色横线表示不会公开方法的组别）

其中G，L,O 组

很多都采用了STRAIGHT工具包进行MGC特征的提取

转换方法上，常采用GMM或者LSTM方法

开源的组转换结果样例：

效果较好的：

略逊的：

各组的报告，包括使用的参数，工具包，及其方法：

各组来源：

现在可用的转换方法：

用不同的神经网络对不同参数进行转换

用AHOcoder进行特征提取：

LF0：LSTM

MVF：DNN

MCP：GRU

转换结果样例：（基于VCC2016 数据）

基于LSF LF0 UV的方法：

转换结果样例：

还在下载

基于GMM的转换：

利用SPTK提取MCEP参数，对MCEP参数用24阶GMM进行转换：

转换结果样例：

各机构的Demo：

Voice Conversion 方法，链接如下：

LyreBird : https://lyrebird.ai/demo

微软：DNN VC
香港科技大学：BLSTM VC
印度OHSU：Joint AE VC
日本东京大学：GMM VC
法国tut：基于DKPL回归
Voice Morphing：Voice Morphing
Adobe VOCO：http://ieeexplore.ieee.org/abstract/document/7472761/

TTS方法，链接如下：

ModelTalker ：(采集语音以进行语音合成):https://www.modeltalker.org/
日本Kobayashi 实验室：Speaker-Independent HMM-Based Voice Conversion
爱丁堡大学：Listening test materials for “A study of speaker adaptation for DNN-based speech synthesis”
TOKUDA and NANKAKU LABORATORY
Princeton VOCO：http://gfx.cs.princeton.edu/pubs/Jin_2017_VTI/【5月11日发表】【10分钟音频内寻找最接近音素】

建议：

在VCC2016比赛中，效果较好的组在语音特征提取的环节都采用了STRAIGHT工具包，应对此进行进一步的探索

未开源及展示论文的方案可能由于经过cherry-picked的调参，复现可能有一定困难

开源的方案主要是都以LSTM作为主要模型，用于转换LF0，MCEP等特征

现有的转换方法中，以基于GMM的MCEP特征转换作为基准，利用LSTM转换LF0的这种方法性能较好，其MOS评分应在2-3分左右，普分辨率较低

可以用STRAIGHT替换AHOcoder进行特征提取，并使用LSTM进行LF0和MCEP的转换（尝试研究G组的方案）。

参考论文及其性能展示：

常用传统转换技术对比（不包括各类神经网络方法）：

https://arxiv.org/pdf/1612.07523.pdf

IEEE（只看了一下，包括其中G组的方案）:

Voice conversion using deep neural networks with speaker-independent pre-training

http://ieeexplore.ieee.org/document/7078543/

Dictionary update for NMF-based voice conversion using an encoder-decoder network

http://ieeexplore.ieee.org/document/7918382/

Text-independent voice conversion using deep neural network based phonetic level features

http://ieeexplore.ieee.org/document/7918382/

F0 transformation techniques for statistical voice conversion with direct waveform modification with spectral differential

http://ieeexplore.ieee.org/document/7846338/

Enhancing a glossectomy patient's speech via GMM-based voice conversion

http://ieeexplore.ieee.org/document/7820909/

Deep neural network based voice conversion with a large synthesized parallel corpus

http://ieeexplore.ieee.org/document/7820716/

SEQ2SEQ类:

https://arxiv.org/pdf/1704.02360.pdf

Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities

Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks

https://arxiv.org/pdf/1704.00849.pdf

VAE：

Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder

https://arxiv.org/pdf/1610.04019.pdf

 DNN：

4 Proposed Framework: Deep sequence-to-sequence Attention Model 

寻找对应组：

I组：NII National Institute of Informatics：

http://cs.joensuu.fi/pages/tkinnu/webpage/pdf/i-vector-VC_icassp2017.pdf

NON-PARALLEL VOICE CONVERSION USING I-VECTOR PLDA: TOWARDS UNIFYING SPEAKER VERIFICATION AND TRANSFORMATION

L组：西北工业大学，谢磊：

http://www.nwpu-aslp.org/lxie/papers/ssw9_PS1-4_Huang.pdf

An Automatic Voice Conversion Evaluation Strategy Based on Perceptual Background Noise Distortion and Speaker Similarity

J组：NU-NAIST：

F0 transformation techniques for statistical voice conversion with direct waveform modification with spectral differential

http://ieeexplore.ieee.org/abstract/document/7846338/

C组：UTSC-NELLSILP：（jun du）

https://www.researchgate.net/publication/307889222_The_USTC_System_for_Voice_Conversion_Challenge_2016_Neural_Network_Based_Approaches_for_Spectrum_Aperiodicity_and_F0_Conversion

未知：University College London voice conversion challenge：

Hulk组【G组】：Phone-aware LSTM-RNN for voice conversion

http://ieeexplore.ieee.org/abstract/document/7877819/

B组：Zhizheng WU：

CSTR The Centre for Speech Technology Research, The University of Edinburgh, UK 

http://ieeexplore.ieee.org/abstract/document/7472761/

https://link.springer.com/article/10.1007/s11042-014-2180-2

F组：HCCL-CUHK Human-Computer Communications Laboratory

The Chinese University of Hong Kong, Hong Kong

http://www1.se.cuhk.edu.hk/~lfsun/ICME2016_Lifa_Sun.pdf

日本人的复现：

https://github.com/sesenosannko/ppg_vc

【一个用DNN的组（O（也有可能是C））】VoiceKontrol Center for Spoken Language Understanding (CSLU), Oregon Health & Science University, Portland, OR, USA

S.H. Mohammadi, A. Kain, Semi-supervised Training of a Voice Conversion Mapping Function using Joint-Autoencoder, Interspeech (To Appear), 2015.

S.H. Mohammadi, A. Kain, Voice Conversion Using Deep Neural Networks With Speaker-Independent Pre-Training, 2014 IEEE Spoken Language Technology Workshop (SLT), 2014.

复现：

https://github.com/shamidreza/dnnmapper

还没找到的：

1.

CASIA-NLPR-Taogroup National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China

2.

Team Initiator Tsinghua University, Beijing, China 

3.

AST Academia Sinica, Taipei, Taiwan 

https://arxiv.org/pdf/1610.04019.pdf

https://github.com/JeremyCCHsu/vae4vc

github:

RNN:

https://github.com/Pooja-Donekal/Voice-Conversion

LSTM:(还不错)（F组）

https://github.com/sesenosannko/ppg_vc

Pitchlinear conversion：

https://github.com/cc786537662/VC

相位加速：

https://github.com/PJunhyuk/voice-conversion

DNN+mfcc（C/O组）

https://github.com/shamidreza/dnnmapper

STRAIGHT，F0，GMM

https://github.com/hiromu/VoiceConversion

GMM

https://github.com/r9y9/VoiceConversion.jl

VAE:

https://github.com/JeremyCCHsu/vae4vc

5月18日更新：

LyreBird的在ICLR 2017上的论文：

SAMPLERNN: AN UNCONDITIONAL END-TO-END

NEURAL AUDIO GENERATION MODEL

https://arxiv.org/pdf/1612.07837.pdf

法语语音转换公司CandyVoice：

https://candyvoice.com/demos/voice-conversion

基于发声系统建模的发声系统建模器festvox:flite

http://www.festvox.org/transform/

Flite

TTS Demo:http://tts.speech.cs.cmu.edu:8083/

princeton VOCO :http://gfx.cs.princeton.edu/pubs/Jin_2017_VTI/

1.先用相同性别/年龄段的TTS进行发音

2.基于CUTE技术进行波形拟合和编辑——跟adobe VOCO师出同源

这就是VOCO，都是Zeyu Jin的东西

http://gfx.cs.princeton.edu/pubs/Jin_2016_CAC/CUTE-icassp_2016.pdf

阅读全文

0 0