Voice Conversion Challenge 2020

Thank you for participating Voice Conversion Challenge 2020!

Freely available materials
- Voice Conversion Challenge 2020 database v1.0 [here]
Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020
- Website [here]
- Proceedings [here]
  - Challenge overview and results [here]
  - Analysis of results [here]

We are glad to invite you to participate in the 3rd Voice Conversion Challenge to compare different voice conversion systems and approaches using the same voice data.

Voice conversion (VC) refers to digital cloning of a person's voice; it can be used to modify audio waveform so that it appear as if spoken by someone else (target) than the original speaker (source). VC is useful in many applications, such as customizing audio book and avatar voices, dubbing, movie industry, teleconferencing, singing voice modification, voice restoration after surgery, and cloning of voices of historical persons. Since VC technology involves identity conversion, it can also be used to protect the privacy of the individual in social media and sensitive interviews, for instance. For the same reason, VC also enables spoofing (fooling) voice biometric systems and has therefore potential security implications. The VCC2020 challenge, similar to the two earlier editions of the challenge, does not focus on any particular application but aims at improving the core VC technology itself using common data, metrics and baseline systems provided by the organizers. The challenge is open to any interested individual or team. Any potential technological advances resulting from the challenge can be used in any of the above applications. We expect the results to be useful in defining future directions in both security and privacy aspects of voice.

Tasks of the 3rd Challenge

The objective is speaker conversion, which is a well-known basic problem in voice conversion. We plan to prepare two tasks based on nonparallel training:

1st task: voice conversion within the same language
- In training, a sentence set uttered by the source speaker is different from that uttered by the target speaker, but they are still the same language. Moreover, only a small number of sentences are shared between these two sentence sets.
- In conversion, the source speaker's voice is converted as if it is uttered by the target speaker while keeping linguistic contents unchanged.
- We will provide voices of 4 source and 4 target speakers (consisting of both female and male speakers) from fixed corpora as training data. Each speaker utters a sentence set consisting of 70 sentences in English. Only 20 sentences are parallel and the other 50 sentences are nonparallel between the source and target speakers.
- Using these data sets, voice conversion systems for all speaker-pair combinations (16 speaker-pairs in total) will be developed by each participant.
2nd task: cross-lingual voice conversion
- In training, a sentence set uttered by the source speaker is totally different from that uttered by the target speaker as a language of the source speaker is different from that of the target speaker.
- In conversion, the source speaker's voice in the source language is converted as if it is uttered by the target speaker while keeping linguistic contents unchanged.
- We will also provide voices of other 6 target speakers (consisting of both female and male speakers) from fixed corpora as training data. The source speakers are the same as in the 1st task. Each target speaker utters another sentence set consisting of around 70 sentences in a different language; 2 target speakers utter in Finnish, 2 target speakers utter in German, and 2 target speakers utter in Mandarin.
- Using these nonparallel data sets, voice conversion systems for all speaker-pair combinations (24 speaker-pairs in total) will be developed by each participant.
Other voices of the same source speakers in English will be provided later as test data consistsing of around 25 sentences for each speaker. Each participant will generate converted voices from them using the developed 16 conversion systems for the 1st task or 24 conversion systems for the 2nd task.
The resulting converted voice sets will be evaluated in terms of perceived naturalness and similarity through listening tests.

We focus on 24 kHz speech and signal-to-signal conversion strategies. No transcriptions will be provided for the test set, and the use of manual annotations is NOT allowed. Participants are free of using additional data (for training purposes).

In the 2020 challenge you are allowed to mix and combine different source speaker's data to train speaker-independent models.
In the 2020 challenge you may use orthographic transcriptions of the released training data to train your voice conversion systems. Note that we will not provide orthographic transcriptions of speech data in the evaluation set.
In the 2020 challenge you may perform manual annotations of the released training data. However, we will not allow you to perform manual annotations of speech data in the evaluation set.
In the 2020 challenge listening tests will use natural speech at 24 kHz sampling frequency as the reference signal.

Please check the rules section for more detailed information.

Timeline

The tentative schedule is as follows:

March 9th, 2020: release of training data
~~May 11th~~ May 22nd, 2020: release of evaluation data
~~May 18th~~ May 29th, 2020: deadline to submit the converted audio.
~~July 20th~~ July 31st, 2020: notification of the first temporal results
Aug. 25th, 2020: notification of the final results
~~July 31st~~ ~~Aug. 31st~~ Sep. 7th, 2020: deadline to submit workshop papers (midnight AoE)
~~Aug. 14th~~ Sep. 30th, 2020: notification of acceptance
~~Sep. 14th-18th~~ Oct 25th-29th 2020: INTERSPEECH 2020, Online
~~Sep. 19th~~ Oct. 30th, 2020: Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, Online

Our timeline has shifted taking into account date changes of INTERSPEECH.

How to Participate?

There is no fee for registration. Please register your team at the following page by March 9th, 2020 if you want to participate in the challenge.

Registration has been closed. Thank you for your registration!

Baseline Systems

For measuring the progress of VC technology, we have built a few baseline systems including the top system of the previous challenge on the new database. We have prepared a few sets of the converted voice samples generated using these baseline systems so that all participants can understand how to build basic systems and have more time to improve their own systems.

Top system of VCC 2018
- Converted voice samples
CycleVAE + Parallel WaveGAN
- Converted voice samples
- Freely available codes and scripts
Seq-to-Seq based on Cascade ASR + TTS w/ ESPnet
- Converted voice samples
- Freely available codes and scripts

Paper Submission

We plan to hold a joint workshop for the Blizzard Challenge and Voice Conversion Challenge 2020. All participants are invited to submit one paper that summarizes their system and shows some results. Each participant can select one of the following two paper categories.

Technical paper: Original contributions are required.
System description paper: Original contributions are NOT required.

Please follow the INTERSPEECH 2020 guidelines and templates (maximum 4 pages + 1 page for reference) when preparing your paper. All papers can be submitted via the following website until ~~Aug. 31st~~ Sep. 7th, 2020.

Paper submission: Easychair website
NOTE: Please show your paper category and your team index (T**) as "Keywords" in the submission page, such as "Technical paper of T01 in VCC2020" or "System description paper of T01 in VCC2020."

Since the joint workshop is an an ISCA-approved workshop, our proceeding including your papers will be added to ISCA archive. DOI will also be assigned to each paper. Following the ISCA rules, we will review the submitted papers and return review comments to the authors.

Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020

The Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020 will be held online as a satellite workshop of INTERSPEECH 2020. The workshop is open to all and we encourage participation from anyone interested in speech synthesis and voice conversion. If you are interested in participating in the workshop, please visit the workshop website and make the workshop registration.

Participation: Workshop website

Summary

Overview and results
- Z. Yi, W.-C. Huang, X. Tian, J. Yamagishi, R.K. Das, T. Kinnunen, Z. Ling, T. Toda, "Voice Conversion Challenge 2020 -- intra-lingual semi-parallel and cross-lingual voice conversion --" Proc. Joint workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 80-98, Oct. 2020.
  [Paper]
- R.K. Das, T. Kinnunen, W.-C. Huang, Z. Ling, J. Yamagishi, Z. Yi, X. Tian, T. Toda, "Predictions of subjective ratings and spoofing assessments of Voice Conversion Challenge 2020 submissions," Proc. Joint workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 99-120, Oct. 2020.
  [Paper]
Freely available materials
- Z. Yi, W.-C. Huang, X. Tian, J. Yamagishi, R.K. Das, T. Kinnunen, Z. Ling, T. Toda, "Voice Conversion Challenge 2020 database v1.0"
  [https://github.com/nii-yamagishilab/VCC2020-database]

Organizers

Tomoki Toda & Wen-Chin Huang (Nagoya University)
Junichi Yamagishi & Yi Zhao (National Institute of Informatics)
Tomi Kinnunen (University of Eastern Finland)
Zhenhua Ling (University of Science and Technology of China)
Rohan Kumar Das & Xiaohai Tian (National University of Singapore)

Contact information: vcc2020__at__vc-challenge.org