Voice Conversion Challenge 2020 Rules

Registration

Please register from the following page if you want to participate in the challenge.
- Registration page
There is no registration fee.

Voice Data Provided in the Challenge

For the 1st task, the organizers will provide a data set consisting of 4 source and 4 target speakers' voices. 70 sentences are uttered by each speaker in English, that is, a total 560 utterances are included in the data set.
For the 2nd task, the organizers will provide another data set consisting of other 6 target speakers' voices. 70 sentences are uttered by each target speaker in Finnish, German or Mandarin, for a total of 420 utterances in the data set. For the source speakers, the same 70 utterances as in the 1st task will be used.
Not only waveform files but also manual transcription corresponding to these utterances are included in both data sets.
After registration, a password for downloading the data sets will be issued.
A text file (README) describing more detailed information will also be included in the data set. Please read it carefully.

Tasks of the Challenge

The speaker conversion tasks of this challenge are:
- 1st task: non-parallel training in the same language (English)
- 2nd task: non-parallel training over different languages (English-Finnish, English-German, and English-Mandarin)
Participants can participate both or either tasks.
Training step for the 1st task
- For the 1st task, each participant needs to develop voice conversion systems for all source and target speaker pairs using up to 70 utterances including 20 parallel utterances and 50 non-parallel utterances in English for each speaker as training data.
- In total, 16 conversion systems (i.e., 4 sources by 4 targets) will be developed.
Training step for the 2nd task
- For the 2nd task, each participant needs to develop voice conversion systems for all source and target speaker pairs using up to 70 utterances for each speaker (i.e., in English for the source speakers, and in Finnish, German, or Mandarin for the target speakers) as training data.
- In total, 24 conversion systems (i.e., 4 sources by 6 targets) will be developed.
Conversion step (for both hub and spoke tasks)
- Another voice data set of the same 4 source speakers will be provided later, which consists of 25 utterances in English. for each source speaker, for a total of 100 utterances.
- Each participant needs to convert these source speakers' voice samples into individual target speaker's voices while keeping linguistic contents (i.e., English) unchanged with the developed conversion systems.
- In total, 400 converted voice samples (25 utterances times 16 speaker pairs) will be generated for the 1st task, and 600 converted voice samples (25 utterances times 24 speaker pairs) will be generated for the 2nd task.
- These converted voice samples will be submitted to the organizers, and then they will be evaluated in listening tests in terms of naturalness and speaker similarity. They will be also evaluated with some objective evaluation measures
Instructions
- No manual edition or modification is allowed in the conversion step. Participants can manually optimize individual conversion systems in the training step, but they cannot do so in the conversion step (e.g., even manual tuning of the system parameters is NOT allowed in the conversion step).
- The use of manual annotations (such as phoneme information, phoneme boundary, linguistic information, etc.) on the evaluation data sets is NOT allowed. Automatic speech recognition systems may be used to generate automatic transcriptions. On the other hand, manual annotations CAN be used for the training data sets.
- Any acoustic features including suprasegmental and duration features may be transformed.
- Participants are free to use additional data for training purposes. All speakers' voices in the data sets provided by the organizers can also be used to develop a conversion system for a certain speaker pair. However, the use of the original EMIME dataset is NOT allowed.
- Participants are also free to discard some utterances from the data set in the training step.
- It is not permissible for a single participant to submit multiple entries in each task because the listening test will become unmanageable. Participants involved in joint projects or consortia who wish to submit multiple systems, please ask the organisers in advance for confirmation.
- Participants need to complete a form giving the general technical specification of their developed conversion system to facilitate easy cross-system comparisons (e.g. is it a GMM-based system? does it convert prosodic features? etc).
- If you have any doubt about how to apply these rules, please contact the organizers (vcc2020__at__vc-challenge.org) immediately.

Expert Listeners for Listening Tests

Each participant needs to recruit at least several volunteer listeners as expert listeners for each of the evaluation tests (on naturalness and speaker similarity). Native speakers are preferable but not necessary.
The organisers would also appreciate assistance in advertising the Challenge as widely as possible (e.g., to your students or colleagues).

Retention of Submitted Voice Samples

Any voice samples that you submit for evaluation will be retained by the Voice Conversion Challenge 2020 organizers for future use.
When participants submit the converted voices, they will be asked to give the organizers permission to publically distribute the submitted voices and the corresponding listening test results in an anonymized form. We really appreciate if all participants approve this consent agreement!

Paper Submissions

We would like to ask each perticipant to submit a paper describing their entry. We are trying to make an opportunity for participants to present their papers. We will have more information later.

[back to Voice Conversion Challenge page]
Contact information: vcc2020__at__vc-challenge.org