Voice Conversion Challenge 2020
Thank you very much for submitting your VC systems!
We are glad to invite you to participate in the 3rd Voice Conversion Challenge to compare different voice conversion systems and approaches using the same voice data.
Voice conversion (VC) refers to digital cloning of a person's voice; it can be used to modify audio waveform so that it appear as if spoken by someone else (target) than the original speaker (source). VC is useful in many applications, such as customizing audio book and avatar voices, dubbing, movie industry, teleconferencing, singing voice modification, voice restoration after surgery, and cloning of voices of historical persons. Since VC technology involves identity conversion, it can also be used to protect the privacy of the individual in social media and sensitive interviews, for instance. For the same reason, VC also enables spoofing (fooling) voice biometric systems and has therefore potential security implications. The VCC2020 challenge, similar to the two earlier editions of the challenge, does not focus on any particular application but aims at improving the core VC technology itself using common data, metrics and baseline systems provided by the organizers. The challenge is open to any interested individual or team. Any potential technological advances resulting from the challenge can be used in any of the above applications. We expect the results to be useful in defining future directions in both security and privacy aspects of voice.
The previous challenges can be accessed below:
Tasks of the 3rd Challenge
The objective is speaker conversion, which is a well-known basic problem in voice conversion. We plan to prepare two tasks based on nonparallel training:
We focus on 24 kHz speech and signal-to-signal conversion strategies. No transcriptions will be provided for the test set, and the use of manual annotations is NOT allowed. Participants are free of using additional data (for training purposes).
- 1st task: voice conversion within the same language
- In training, a sentence set uttered by the source speaker is different from that uttered by the target speaker, but they are still the same language. Moreover, only a small number of sentences are shared between these two sentence sets.
- In conversion, the source speaker's voice is converted as if it is uttered by the target speaker while keeping linguistic contents unchanged.
- We will provide voices of 4 source and 4 target speakers (consisting of both female and male speakers) from fixed corpora as training data. Each speaker utters a sentence set consisting of 70 sentences in English. Only 20 sentences are parallel and the other 50 sentences are nonparallel between the source and target speakers.
- Using these data sets, voice conversion systems for all speaker-pair combinations (16 speaker-pairs in total) will be developed by each participant.
- 2nd task: cross-lingual voice conversion
- In training, a sentence set uttered by the source speaker is totally different from that uttered by the target speaker as a language of the source speaker is different from that of the target speaker.
- In conversion, the source speaker's voice in the source language is converted as if it is uttered by the target speaker while keeping linguistic contents unchanged.
- We will also provide voices of other 6 target speakers (consisting of both female and male speakers) from fixed corpora as training data. The source speakers are the same as in the 1st task. Each target speaker utters another sentence set consisting of around 70 sentences in a different language; 2 target speakers utter in Finnish, 2 target speakers utter in German, and 2 target speakers utter in Mandarin.
- Using these nonparallel data sets, voice conversion systems for all speaker-pair combinations (24 speaker-pairs in total) will be developed by each participant.
- Other voices of the same source speakers in English will be provided later as test data consistsing of around 25 sentences for each speaker. Each participant will generate converted voices from them using the developed 16 conversion systems for the 1st task or 24 conversion systems for the 2nd task.
- The resulting converted voice sets will be evaluated in terms of perceived naturalness and similarity through listening tests.
Please check the rules section for more detailed information.
- In the 2020 challenge you are allowed to mix and combine different source speaker's data to train speaker-independent models.
- In the 2020 challenge you may use orthographic transcriptions of the released training data to train your voice conversion systems. Note that we will not provide orthographic transcriptions of speech data in the evaluation set.
- In the 2020 challenge you may perform manual annotations of the released training data. However, we will not allow you to perform manual annotations of speech data in the evaluation set.
- In the 2020 challenge listening tests will use natural speech at 24 kHz sampling frequency as the reference signal.
The tentative schedule is as follows:
Our timeline has shifted taking into account date changes of INTERSPEECH.
- March 9th, 2020: release of training data
May 11th May 22nd, 2020: release of evaluation data
May 18th May 29th, 2020: deadline to submit the converted audio.
July 20th July 31st, 2020: notification of results
July 31st Aug. 31st, 2020: deadline to submit workshop papers (midnight AoE)
Aug. 14th Sep. 30th, 2020: notification of acceptance
Sep. 14th-18th Oct 25th-29th 2020: INTERSPEECH 2020, Shanghai, China
Sep. 19th Oct. 30th, 2020: Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020 at Shanghai International Studies University, 550 Dalian Road (W), Shanghai 200083, China.
How to Participate?
There is no fee for registration. Please register your team at the following page by March 9th, 2020 if you want to participate in the challenge.
- Registration has been closed. Thank you for your registration!
For measuring the progress of VC technology, we have built a few baseline systems including the top system of the previous challenge on the new database. We have prepared a few sets of the converted voice samples generated using these baseline systems so that all participants can understand how to build basic systems and have more time to improve their own systems.
- Top system of VCC 2018
- CycleVAE + Parallel WaveGAN
- Seq-to-Seq based on Cascade ASR + TTS w/ ESPnet
- Tomoki Toda & Wen-Chin Huang (Nagoya University)
- Junichi Yamagishi & Yi Zhao (National Institute of Informatics)
- Tomi Kinnunen (University of Eastern Finland)
- Zhenhua Ling (University of Science and Technology of China)
- Rohan Kumar Das & Xiaohai Tian (National University of Singapore)
Contact information: vcc2020__at__vc-challenge.org