Singing Voice Conversion Challenge 2023
Thank you for participating in the first Singing Voice Conversion Challenge (SVCC)!
The challenge has ended.
- Challenge overview and results [here]
- How to access the dataset? [here]
- Challenge special session at the ASRU 2023 [here]
Voice conversion (VC) refers to the digital cloning of a person's voice; it can be used to modify audio waveform so that it appear as if spoken by someone else (target) than the original speaker (source). The voice conversion challenge (VCC) series aims to advance and compare different methods to approach the core VC technology using a common dataset, metrics and baseline systems provided by the organizers. With the rapid progress in the various essential modules in a VC system (including acoustic modeling, waveform synthesis, etc.), in the latest VCC, the top system showed an impressive performance, with its generated speech samples very close to human voice in terms of naturalness and similarity. We feel it is time to move our focus from fundamental technologies to more sophisticated applications.
Therefore, we are pleased to announce the first singing voice conversion challenge (SVCC). Singing voice conversion (SVC), extending the definition of normal VC, aims at converting the singing voice of a source singer to that of a target singer without changing the contents. The main applications of SVC lie in entertainment: new tools for virtual youtubers, singing voice beutifying in karaokes, or even singing-aid for the disabled. SVC is considered more challenging than VC, as singing voice is generally harder to model than speech, and data collection is more difficult. Moreover, during conversion, while the music score is considered part of the contents that must not be changed, certain singing styles such as viberato can be considered to be singer-dependent. Each of these prosody-related factors need to be modeled properly. From the community point of view, SVC is the intersection of speech processing and music process. We hope to attract attention from researcher in both communities to facilitate interdisciplinary research.
The previous VCCs can be accessed below:
Tasks of this Challenge
The objective is singer conversion. We plan to prepare two tasks:
- 1st task: Any-to-one, in-domain singing voice conversion
- In training, we provide a training set of each of the 2 target singers (1 female and 1 male singers). There are around 130 to 170 singing utterances for each target singer. No training data of the source singers will be provided.
- In conversion, the source speaker's singing voice needs to be converted as if it was sung by the target singer while keeping the contents unchanged. The test data consists around 24 song phrases for each of the 2 source singers. Participants need to generate converted samples for all singer-pair combinations (4 singer pairs in total).
- 2nd task: Any-to-one, cross-domain singing voice conversion
- In training, we provide a training set of each of the 2 target speakers (1 female and 1 male speakers). There are 130 to 170 speech utterances for each target speaker. No training data of the source singers will be provided.
- In conversion, the source singer's singing voice needs to be converted as if it was sung by the target speaker while keeping the contents unchanged. The test data consists around 24 song phrases for each of the 2 source singers. Participants need to generate converted samples for all singer-pair combinations (4 singer pairs in total).
We focus on 24 kHz singing voice and signal-to-signal conversion strategies. No transcriptions will be provided for the test set, and the use of manual annotations is NOT allowed.
Please note that for this challenge, to facilitate reproducible research, any additional data used for training needs to be publically available. Please only use datasets described in a curated list maintained by the organizers.
Please check the rules section for more detailed information.
- In this challenge you are allowed to mix and combine different singer's data to train singer-independent models.
- In this challenge you may use orthographic transcriptions of the released training data to train your voice conversion systems. Note that we will not provide orthographic transcriptions of speech data in the evaluation set.
- In this challenge you may perform manual annotations of the released training data. However, we will not allow you to perform manual annotations of speech data in the evaluation set.
- In this challenge listening tests will use natural audio samples at 24 kHz sampling frequency as the reference signal during the final evaluation.
Timeline
The tentative schedule is as follows:
- Feb. 17th, 2023: release of training data
- Apr. 21st, 2023: release of evaluation data
- Apr. 28th, 2023: deadline to submit the converted audio.
- Jun. 16th, 2023: notification of the results
- Jul. 3rd, 2023: deadline to submit workshop papers (midnight AoE)
- Jul. 28th, 2023: notification of acceptance
Baseline Systems
We provide baseline systems. Participants that are new to the singing voice conversion field are welcomed to utilize the open-sourced starter kit for this challenge. We have prepared a few sets of the converted samples generated using these baselines to help participants develop their systems.
Evaluation
Following previous VCCs, the main evaluation campaign will be a large-scale subjective evaluation conducted by recruiting human listeners to assess the quality of all the submitted systems. We will be evaluating the naturalness and similarity of the converted samples.
Challenge special session at ASRU 2023
The SVCC2023 is a challenge special session at ASRU 2023. Please attend ASRU 2023 and come to our poster to listen to the challenge smuuary. PArticipating teams whose paper got accepted will also present their work there.
Organizers
- Tomoki Toda & Wen-Chin Huang & Lester Violeta (Nagoya University)
- Songxiang Liu (Tencent AI Lab)
- Jiatong Shi (Carnegie Mellon University)
Contact information: svcc2023__at__vc-challenge.org