Contents - 目次(もくじ)
- 1 【Explained in video】The Magic of Modern Times:How to use RVC WebUI: – Create your own AI voice changer!
- 2 The speed of change is amazing… RVC WebUI – AI Voice Changer Tutorial
- 3 【RVC v2 model】How to start and install RVC WebUI
– Last update: September 3, 2024
- 3.1 【Step 1: Check the GPU】
- 3.2 【Step 2: Installation of dependencies】
- 3.3 【Step 3: Clone the RVC WebUI repository from GitHub】
- 3.4 【Step 4: Update to the latest status】
- 3.5 【Step 5: Preparation of Pre-trained models, etc.】
- 3.6 【Step 6: Download “hubert_base.pt” & “rmvpe.pt”】
- 3.7 【Step 7: Mount Google Drive】
- 3.8 【Step 8: Prepare the dataset and the audio file to be converted to voice in Google Drive.】
- 3.9 【Step 9: Conversion of duplicate file names】
- 3.10 【Step 10: Launch the RVC WebUI】
- 4 【RVC v2 model】How to use RVC WebUI: Training (Creating a trained model with original dataset) – Train
- 5 【RVC v2 model】How to use RVC WebUI: Inference (Voice Conversion) – Model inference
- 6 Q&A displayed on the RVC WebUI
:As of May 30, 2023
- 6.1 Q1:ffmpeg error/utf8 error.
- 6.2 Q2:Cannot find index file after “One-click Training”.
- 6.3 Q3:Cannot find the model in “Inferencing timbre” after training
- 6.4 Q4:How to share a model/How to use others’ models?
- 6.5 Q5:Connection Error.
- 6.6 Q6:WebUI popup ‘Expecting value: line 1 column 1 (char 0)’.
- 6.7 Q7:How to train and infer without the WebUI?
- 6.8 Q8:Cuda error/Cuda out of memory.
- 6.9 Q9:How many total_epoch are optimal?
- 6.10 Q10:How much training set duration is needed?
- 6.11 Q11:What is the index rate for and how to adjust it?
- 6.12 Q12:How to choose the gpu when inferring?
- 6.13 Q13:How to use the model saved in the middle of training?
- 6.14 Q14:File/memory error(when training)?
- 7 Comments on RVC WebUI explanatory videos and examples of responses
– As of Feb 8, 2024
- 7.1 Launch the RVC WebUI – already supported:RVC WebUI tutorial code for RVC v2 compatible version is not available
- 7.2 Train & Model inference:Can you create a tutorial video for the RVC WebUI that converts to a singing voice?(AI Cover)
- 7.3 Train & Model inference:How can I reuse a set of files related to previous trained models in the RVC WebUI?
【Explained in video】The Magic of Modern Times:How to use RVC WebUI:
– Create your own AI voice changer!
【The Magic of Modern Times】
How to use RVC WebUI(RVC v2 model supported)- Tutorial for AI Beginners
Video viewing time: 35 min. 27 sec.
I hope that you will use the explanatory video of the RVC WebUI, which we plan to create in the future, when it is difficult to understand textual information alone.
【Confirmed event: May 30, 2023】
The May 30, 2023 update:
1.Vocal remover is now written in Chinese.(Translation was not yet implemented when the English version was made.)
→ As of June 2, 2023, already translated into English.
2.Addition of “crepe” and “Protect voiceless consonant and breath, less artifact. Protect voiceless consonant and breath, less artifact.” in inference
was done.
At the time of confirmation,
1.“RuntimeError: Failed to load audio: ffmpeg error (see stderr output for detail)” was displayed when loading a dataset file during training(Train), but training was possible.
2.In “Model Inference”, if “select pitch extraction algorithm” was set to “creep”, it seemed to cause an error.
→ I have modified the Google Colaboratory code for the May 31, 2023 update so that “crepe” can be used.
3.The “Export Onnx” function seemed to cause an error when trying to convert an RVC v2 model.
→ The bug was fixed in the June 16, 2023 update.
I will see how it progresses.
(The Google Coalaboratory code has been modified because an error occurred with the RVC WebUI update of the original version and it could not be started. If you are using the previous code, please use the latest version of Google Coalaboratory’s code).
【Confirmed event: June 7, 2023】 → already settled
The June 7, 2023 update:
The following error occurred when I followed the steps in the article
Error Output:
File “/content/Retrieval-based-Voice-Conversion-WebUI/train/process_ckpt.py”, line 20, in savee
hpt.data.filter_length // 2 + 1,
NameError: name ‘hpt’ is not defined
seemed to appear.
The reason seemed to be that the file
・“process_ckpt.py”
was just after an update and
・“hpt”
was mistakenly defined as
・“hps”.
An example of a tentative response until the developer is aware of the typo is,Double-click the file
・“process_ckpt.py”
(location: “process_ckpt.py” in “train” of the “Retrieval-based-Voice-Conversion-WebUI” file)
on Google Colaboatory and edit screen.
It seemed to work correctly when I corrected
・“hps”
in the code
・“def savee(ckpt, sr, if_f0, name, epoch, version, hps):”
in line 11 to
・“hpt”
and saved it.
If you have difficulty with this explanation, please check the status of the original RVC WebUI code,
・process_ckpt.py – RVC-Project/Retrieval-based-Voice-Conversion-WebUI | GitHub
on GitHub below to see if the above typo has been corrected before running the program.
【PS: Tutorial video for error handling is available】
I have explained the steps of the response example in the video.
Please find it useful in cases where textual information is difficult to understand.
:【エラー対応例】RVC WebUIの「NameError: name ‘hpt’ is not defined」- v2 Model Errors and Examples of Solutions by RehabC – デジタルで、遊ぶ。(YouTube動画)
Viewing time: 1 min. 34 sec.
→ Eventually, the “hpt” description seemed to be corrected to “hps”.
【Confirmatory event: June 16, 2023】
Error when starting RVC WebUI
AttributeError: module ‘gradio.components’ has no attribute ‘FormComponent’. Did you mean: ‘IOComponent’?
is now fixed so that it does not appear.
【Confirmatory event: July 18, 2023】
Error when starting RVC WebUI
ModuleNotFoundError: No module named ‘faiss’
is now fixed so that it does not appear.
【Confirmatory event: Dec 26, 2023】
I have modified the code in 【Step 2: Installation of dependencies】.
→ According to the person who provided feedback in the comments, as of December 26, 2023, “Train” was able to do so.(Depends on the type of runtime)
*The free version is not available. At this time (as of December 2023), you will need to pay a fee.
→ Addendum – as of December 27, 2023:
Successful training output during “Train”
・INFO:(Model_Name):Training is done. The program is closed.
・INFO:(Model_Name):saving final ckpt:Success.
・UserWarning: resource_tracker.py:224:.
usr/lib/python3.10/multiprocessing/resource_tracker.py:224:
UserWarning: resource_tracker: There appear to be 20 leaked semaphore
objects to clean up at shutdown
warnings.warn(‘resource_tracker: There appear to be %d ‘
but the pth, index, and npy files were not created.
【Confirmed event: Jan 28, 2024】
I have modified the code for steps 5 and 6 (modified the download location).
*The free version cannot be used. At this time (as of January 2024), billing is required.
【Confirmed event: Feb 7, 2024】
Corrected the code in step 2 for “gradio_client” related errors.
I have been informed by viewer that both Train and Inference are possible.
*The free version cannot be used. At this time (as of February 2024), billing is required.
【Confirmed event: Mar 9, 2024】
Corrected the code in step 2 for “chex” & “numpy” related errors.
ERROR: pip’s dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
chex 0.1.85 requires numpy>=1.24.1, but you have numpy 1.23.5 which is incompatible.
*The free version cannot be used. At this time (as of Mar 2024), billing is required.
【Confirmed event: Mar 18, 2024】
Corrected the code in step 2 for “PyTorch” related errors.
*The free version cannot be used. At this time (as of Mar 2024), billing is required.
【Confirmed event: Apr 4, 2024】
Corrected the code in step 2 for “optax” & “pandas-stubs” related errors.
*The free version cannot be used. At this time (as of Apr 2024), billing is required.
【Confirmed event: Jun 13, 2024】
Corrected the code in step 2 for “numba” related errors.
*The free version cannot be used. At this time (as of Jun 2024), billing is required.
【Confirmed event: July 27, 2024】
Corrected the code in step 2 for “omegacon” related errors(“ERROR: Cannot install fairseq and fairseq==0.12.2”).
*The free version cannot be used. At this time (as of July 2024), billing is required.
【Confirmed event: August 30・31, 2024】
Corrected the code in step 2 for “typing-extensions=4.6.0” related errors.
And added installation of “aria2” and “torchfcpe”.
*The free version cannot be used. At this time (as of August 2024), billing is required.
【Confirmed event: September 1, 2024】
The version of “numpy” was updated compared to a month ago, so the code was modified to allow downgrading to the “numpy==1.23.5” version.
I also specified the same version of the library as the library version of the dependencies of the code of the original RVC WebUI.
*The free version cannot be used. At this time (as of September 2024), billing is required.
【Confirmed event: Sep 3, 2024】
Fixed code to downgrade to “numpy==1.24.4” since “numpy==1.23.5” cannot avoid dependency conflict errors.
Also, the version of “tensorflow” was updated to “tensorflow==2.17.0” compared to one month ago, so the code was modified to allow downgrading to the “tensorflow==2.15.0” version.
*The free version cannot be used. At this time (as of September 2024), billing is required.
→ As of September 9, 2024:
According to information given to us by viewers of the tutorial video, it is recommended that the following steps be performed instead of using “One-click Training” when training.
・The process proceeds individually by clicking each processing button.
* Steps: Process data → Feature extraction → Train Model
I have also received report of being able to change voices by
・Using a one-minute audio file
for inference.
The speed of change is amazing…
RVC WebUI – AI Voice Changer Tutorial
Some of you may want to use the RVC WebUI,
the AI voice changer released in April 2023, but are saddened by the fact that you don’t know how on earth to use it….
Less than two months after the RVC WebUI was launched to the public, the features available in the RVC WebUI were
added or ….
disappeared or ….
This is a glimpse of the ingenuity and interest of Japanese and other developers.
I myself have been publishing tutorial articles and videos in an effort to lower the barriers to voice AI, but I struggle daily to keep up with the changes while thinking about the articles, videos, and other content.
The RVC v1 model used a pre-trained model trained on an English data set, which inevitably tended to produce an English accent. However, the RVC v2 model(retrieval-based-voice-conversion version 2) has been designed to discriminate high-resolution speech, and Japanese pronunciation is now clearer than in the RVC v1 model, even when using a pre-trained model trained on an English data set.
The RVC v2 model, which is more powerful than the RVC v1 model, has also been released, and I have been spending my days thinking,
“I wonder if there should be an RVC WebUI tutorial program, article, or video for the RVC v2 model…”
I finally have the energy to get to work.
Therefore, in order to make it easier for those interested in the RVC WebUI with the official RVC pre-trained model version 2 (RVC v2 model), which was released in mid-May 2023, to enjoy the high performance AI voice changer, I will summarize how to start the RVC WebUI using Google Colaboratory, which is available on any type of computer, including Windows and Mac.
And I will summarize how to use Google Colaboratory, which can be used on any type of computer, to launch RVC WebUI and create an original AI voice changer (how to do it).
I hope this series of information will help you get started with the “AI Voice Changer”.
Sample Code Links & Program Licenses
I have released sample code with explanations on how to use it so that you can easily try RVC WebUI.
I hope this will be of some help to first-time AI students around the world who find the original RVC WebUI Google Colaboratory code difficult to understand.
【RVC v2 model supported】
English Version
RVC WebUI Tutorial:
v2-RVC-WebUI-for-AI-beginners.ipynb(The MIT License)| Google Colaboratory
* The Gradio-related error that appeared on June 16, 2023 has already been addressed, so please run Google “Save a copy in Drive” and launch RVC WebUI with the tutorial code from the last updated version after June 16, 2023.
License for the sample code “RVC-WebUI-for-AI-beginners.ipynb”:
The MIT License
Copyright 2023 child programmer
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Download the audio material used in this tutorial
【About the audio material used in this tutorial】
Before creating your own original AI voice changer of your choice, many of you may first want to learn a series of operating procedures to try out.
I searched for information and found that some people have given us permission to use the voice changers, so I have used the following audio for training in this tutorial.
AI Voice Changer Training Voice (female voice) audio download:
あみたろの声素材(セリフ素材)一括ダウンロード | あみたろの声素材工房
(PCM44,100Hz/16-bit/monaural WAV format)
Credit Information:あみたろの声素材工房 https://amitaro.net/
In addition, I have made the audio material before converting to “Amitaro’s” voice available for download from this page, so please use it if necessary.
Download audio material of a voice (male voice) testing inferences:
Download: Voice material (male voice) to try model inference (AI voice change)
(Sample rate 48,000Hz/24-bit/stereo WAV format)
* Licensing of sample audio:audio material of a voice (male voice) testing inferences
Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
The voice material before conversion was partially extracted from this audio.
The source of the voice material for testing model inferences:
People’s Speech Dataset(CC-BY and CC-BY-SA)| MLCommons
【RVC v2 model】How to start and install RVC WebUI
– Last update: September 3, 2024
【Step 1: Check the GPU】
If you are unable to check your GPU with the following commands, go to the Google Colaboratory menu
「Runtime – Change runtime type – Hardware accelerator」
and select “GPU”, then save and try running the code again.
Run Code
!nvidia-smi
!nvcc -V
!free -h
Output Result
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 51C P8 10W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
total used free shared buff/cache available
Mem: 12Gi 636Mi 9.0Gi 1.0Mi 3.1Gi 11Gi
Swap: 0B 0B 0B
【Step 2: Installation of dependencies】
As of December 26, 2023, I have modified the code for installing dependencies using the information in “requirements.txt” in the original version of the RVC WebUI to avoid errors when installing dependencies.
→ As of September 3, 2024:
The code has been modified to the specification to downgrade to “numpy==1.24.4” because the dependency conflict error cannot be avoided with “numpy==1.23.5”.
Run Code
# As of September 3, 2024:Specify “tensorflow==2.15.0” (downgrade from “tensorflow==2.17.0”) and adjust dependencies
!pip3 install tensorflow==2.15.0 tf-keras==2.15.1 tensorstore==0.1.45 orbax-checkpoint==0.4.4
# As of July 27, 2024:Downgrade Google Colaboratory pip version pip==24.1.2 to counter errors such as “fairseq 0.12.2 depends on omegaconf<2.1"
!python3 -m pip install --upgrade pip==22.3
# As of July 27, 2024:Specify the version of "omegaconf"
!pip3 install omegaconf==2.0.6
!pip3 install jedi==0.19.1
!apt-get -y install build-essential python3-dev ffmpeg
!pip3 install torch==2.1.0 torchtext==0.16.0 torchvision==0.16.0 torchaudio==2.1.0 # Downgrade PyTorch version to 2.1.0 + adjusted dependencies
!pip3 install optax==0.2.1 # Specify the version of optax that chex can use
!pip3 install chex==0.1.7 # Specify the version of chex that numpy can use
!pip3 install pandas-stubs==2.0.1.230501 # Specify the version of pandas-stubs that numpy can use
# As of September 1, 2024:Disable the following codes
# # As of July 27, 2024:"typeguard" related error countermeasures
# # As of August 30, 2024: Code position changed to address “typing-extensions=4.6.0” related errors
# !pip3 install typeguard==3.0.2 inflect==6.0.5 albumentations==1.3.1 albucore==0.0.5
# As of September 3, 2024:Specify “numpy==1.24.4”, “numba>=0.57.0”, and “llvmlite==0.43.0” to avoid dependency conflict errors
!pip3 install numpy==1.24.4 numba>=0.57.0 llvmlite==0.43.0
# As of September 1, 2024: albucore removed due to “numpy==1.23.5” conflict error
# As of September 3, 2024: → Restored albucore with changes to “numpy==1.24.4”.
!pip3 install typeguard==3.0.2 inflect==6.0.5 albumentations==1.3.1 albucore==0.0.5
# As of September 1, 2024:The following measures
# rmm-cu12 24.4.0 requires numba>=0.57, but you have numba 0.56.4 which is incompatible.
# cudf-cu12 24.4.1 requires numba>=0.57, but you have numba 0.56.4 which is incompatible.
# albucore 0.0.5 requires numpy>=1.24.4, but you have numpy 1.23.5 which is incompatible.
# Uninstall rmm-cu12, cudf-cu12, and albucore
# As of September 3, 2024:When uninstalled, the following display will appear when starting RVC WebUI
# Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
# Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
# Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
# Therefore, disable the following code to stop uninstallation
# !pip3 uninstall rmm-cu12 cudf-cu12 albucore -y
!pip3 install kaleido==0.2.1 fastapi==0.88 cohere==4.39 tiktoken==0.5.2 tensorflow-probability==0.20.1 typing-extensions==4.6.0 openai==1.5.0 python-multipart==0.0.6 uvicorn==0.25.0
# As of September 1, 2024:Disable the following code
# # Changed "numba==0.56.4" to "numba" to avoid numba conflict error. numpy version changed to unspecified.
# !pip3 install faiss-cpu==1.7.3 fairseq gradio==3.34.0 ffmpy==0.3.1 ffmpeg-python>=0.2.0 praat-parselmouth>=0.4.2 pyworld==0.3.2 numpy numba librosa==0.9.1 tensorboardX tensorboard onnxruntime-gpu==1.16.3 onnx==1.15.0 torchcrepe==0.0.20 python-dotenv>=1.0.0 av
# As of September 1, 2024:Since the version of numpy has been upgraded since the end of August 2024, “numpy==1.23.5” and “numba==0.56.4” should be specified as the version (downgrade).
# As of September 3, 2024:I get the following conflict error
# rmm-cu12 24.4.0 requires numba>=0.57
# cudf-cu12 24.4.1 requires numba>=0.57
# albucore 0.0.5 requires numpy>=1.24.4
# Remove numpy/numba description from here and move code
!pip3 install faiss-cpu==1.7.3 fairseq gradio==3.34.0 ffmpy==0.3.1 ffmpeg-python>=0.2.0 praat-parselmouth>=0.4.2 pyworld==0.3.2 librosa==0.9.1 tensorboardX tensorboard onnxruntime-gpu==1.16.3
!pip3 install onnx==1.15.0 torchcrepe==0.0.20 python-dotenv>=1.0.0 av
!pip3 install gradio_client==0.8.1 # Downgrade gradio_client==0.9.0
# 【Nullification】Removed "llvmlite==0.39.0" for numba conflict error avoidance purposes
# → As of September 1, 2024:Since “numba==0.56.4” was specified, “llvmlite==0.39.0” was adopted.
# 【Nullification:As of September 3, 2024】As of September 1, 2024:nullification -
# !pip3 install llvmlite==0.39.0
!pip3 install joblib>=1.1.0 scipy fairseq==0.12.2 Cython pydub>=0.25.1 soundfile>=0.12.1 Jinja2>=3.1.2 json5 Markdown matplotlib>=3.7.0 matplotlib-inline>=0.1.3 Pillow>=9.1.1 resampy>=0.4.2 scikit-learn
!pip3 install tqdm>=4.63.1 tornado>=6.1 Werkzeug>=2.2.3 uc-micro-py>=1.0.1 sympy>=1.11.1 tabulate>=0.8.10 PyYAML>=6.0 pyasn1>=0.4.8 pyasn1-modules>=0.2.8 fsspec>=2022.11.0 absl-py>=1.2.0 audioread uvicorn>=0.21.1
!pip3 install colorama>=0.4.5 httpx
# As of August 30・31, 2024:“aria2”and "torchfcpe" added due to code changes in the original RVC WebUI
!pip3 install aria2 torchfcpe
print('(Version at the time of execution)') # As of execution on July 27, 2024:python 3.10.12
import platform
print('python ' + platform.python_version())
【Step 3: Clone the RVC WebUI repository from GitHub】
Copy the program “Retrieval-based-Voice-Conversion-WebUI” from GitHub to Google Colaboratory.
The program uses the original version of “Retrieval-based-Voice-Conversion-WebUI”, which is under development daily.
Run Code
!git clone https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI
%cd /content/Retrieval-based-Voice-Conversion-WebUI
【Step 4: Update to the latest status】
Run Code
!git pull
【Step 5: Preparation of Pre-trained models, etc.】
In the “Preparation of Pre-trained models, etc.” section of this program, as of May 2023, only “40kHz” RVC v2 models are supported.
When new sample frequency models are released, I plan to enable the relevant codes as appropriate.
Run Code
!apt -y install -qq aria2
#pretrained__v1 pre-training model
# January 28, 2024: I have corrected the placement of the download files.
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/D32k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained -o D32k.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/D40k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained -o D40k.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/D48k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained -o D48k.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/G32k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained -o G32k.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/G40k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained -o G40k.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/G48k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained -o G48k.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/f0D32k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained -o f0D32k.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/f0D40k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained -o f0D40k.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/f0D48k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained -o f0D48k.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/f0G32k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained -o f0G32k.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/f0G40k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained -o f0G40k.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/f0G48k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained -o f0G48k.pth
#pretrained_v2 pre-training model:As of May-June 2023, only 40 kHz is supported. As new sample frequency models are released, I will activate the appropriate codes as appropriate.
#→ It appeared that other sample frequency models were released on June 18, 2023, so I activated them.
# January 28, 2024: I have corrected the placement of the download files.
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained_v2/D32k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained_v2 -o D32k.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained_v2/D40k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained_v2 -o D40k.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained_v2/D48k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained_v2 -o D48k.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained_v2/G32k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained_v2 -o G32k.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained_v2/G40k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained_v2 -o G40k.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained_v2/G48k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained_v2 -o G48k.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained_v2/f0D32k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained_v2 -o f0D32k.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained_v2/f0D40k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained_v2 -o f0D40k.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained_v2/f0D48k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained_v2 -o f0D48k.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained_v2/f0G32k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained_v2 -o f0G32k.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained_v2/f0G40k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained_v2 -o f0G40k.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained_v2/f0G48k.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained_v2 -o f0G48k.pth
#trained model for sound source separation(vocal remove)
# January 28, 2024: I have corrected the placement of the download files.
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/uvr5_weights/HP2-人声vocals+非人声instrumentals.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/uvr5_weights -o HP2-人声vocals+非人声instrumentals.pth
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/uvr5_weights/HP5-主旋律人声vocals+其他instrumentals.pth -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/uvr5_weights -o HP5-主旋律人声vocals+其他instrumentals.pth
【Step 6: Download “hubert_base.pt” & “rmvpe.pt”】
*Note:
As of the start of the release as of the confirmation on July 11, 2023, selecting “rmvpe” (the latest RMVPE vocal pitch extraction algorithm) in “Model Inference: select the pitch extraction algorithm” will result in an error.
I will see how it progresses.
*Note:
As of July 30, 2023 confirmation, I have confirmed that the”Train:select the pitch extraction algorithm” is usable.(”rmvpe” and “rmvpe_gpu” )
Run Code
# January 28, 2024: I have corrected the placement of the download files.
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/hubert_base.pt -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/hubert/ -o hubert_base.pt
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/rmvpe.pt -d /content/Retrieval-based-Voice-Conversion-WebUI/assets/rmvpe/ -o rmvpe.pt
【Step 7: Mount Google Drive】
You can use folders and files on Google Drive.
After executing the following code, you will be asked for permission, so “Allow” with your Google Drive Google account.
Run Code
from google.colab import drive
drive.mount('/content/drive')
【Step 8: Prepare the dataset and the audio file to be converted to voice in Google Drive.】
In Google Drive’s “MyDrive”, prepare a
・”dataset”
in the “MyDrive” folder of Google Drive.
Also, upload the voice files (WAV or MP3 format) that you would like to try inference (voice conversion).
【Folder Structure】
dataset
|— 〜1.wav
|— 〜2.wav
|— 〜3.wav
|— 〜4.wav
・
・
・
|— 〜10.wav
*As an example, the “dataset” folder should contain several audio files in WAV format separated by short sentences (sentences up to the punctuation point).
*In this tutorial, I was training with “10” audio files of “1-3 seconds”. If you want to convert audio in earnest, increasing the number of audio files and the number of learning times (number of Epochs) may reduce the machine sound and other problems.
(It seems that a GPU with more memory is better.)
*The official RVC Q&A states that
the recommended audio duration for a data set
”appears to be 10-50 minutes.”
”For high quality datasets, 5-10 minutes seems to be recommended.”
For the number of training epochs,
”20 to 30 epochs (number of training epochs) seems to be recommended for low-quality, noisy audio.”
(Increasing the number of training epochs with low quality audio does not improve the quality of the audio.)
”For high quality audio, the acceptable training frequency is 200 epochs.”
To create audio files to be trained, I have seen a number of ways to use audio editing applications such as
・Audacity
and other audio editing applications to create the audio files to be trained.
How long should the audio files for each data set be? I am not sure, but the producer explains
“Use less than 10 minutes vocal to fast train a voice conversion model!”
I looked up related information on the Internet, and also looked at currently distributed audio corpora, and found that
・each file contains only a few seconds of audio
So, after learning how to use the RVC WebUI to some extent, please try various trial-and-error methods.
【Step 9: Conversion of duplicate file names】
Rename duplicate files in the dataset (“dataset” folder).
Run Code
!ls -a /content/drive/MyDrive/dataset
!rename 's/(\w+)\.(\w+)~(\d*)/$1_$3.$2/' /content/drive/MyDrive/dataset/*.*~*
【Step 10: Launch the RVC WebUI】
Training and model inference (voice conversion) is performed on the web interface.
You can use the RVC WebUI by clicking on the
Running on public URL: https://〜.gradio.live
URL that appears after running the following code
Run Code
%cd /content/Retrieval-based-Voice-Conversion-WebUI
# %load_ext tensorboard
# %tensorboard --logdir /content/Retrieval-based-Voice-Conversion-WebUI/logs
!python3 infer-web.py --colab --pycmd python3
【RVC v2 model】How to use RVC WebUI: Training
(Creating a trained model with original dataset) – Train
*Click on image to enlarge. The settings are described in the description below. Please refer to them as necessary.
Click on the “Model inference” tab and configure as follows
(This is an example.)
【Step:1】
Fill in the experimental configuration. Experimental data is stored in the ‘logs’ folder, with each experiment having a separate folder. Manually enter the experiment name path, which contains the experimental configuration, logs, and trained model files.
Enter the experiment name:
(Name of the output trained model)
amitaro
*This is an example.
Target sample rate:
(Sampling Rates)
40k
*To use the RVC v2 model, select “40k“.
(As of May 2023)
→ The tutorial code since the last updated version on June 20, 2023 already supports “32k” and “48k” RVC v2 models.
Whether the model has pitch guidance (required for singing, optional for speech):
(Setting Pitch Guidance)
true
*Note: Cautions when attempting a real-time voice change
If the setting here is set to “false”, the model will be “pitchless” and the speech conversion accuracy will be slightly inferior when performing real-time speech conversion. If you want to improve the voice conversion accuracy, try selecting “true”. However, if you want to reduce the load on your computer when using the VC Client, the “pitchless” model seems to be one of the candidates. Please adjust according to the specifications of your computer.
It appears that the “pitchless” model will be supported starting with VC Client v.1.5.2.5 (April 26, 2023).
If you want to do real-time voice conversion with VC Client, please try to download the latest version of VC Client at the time you try real-time voice change.
Model architecture version (v2 version only supports 40k sample rate for testing purposes):
(Set version of RVC pre-trainend model – RVC v2 only supports 40kHz sample frequency)
v2
*The RVC v2 model can only be used if “40k” is selected in the “Target sample rate:” (sampling rate/sample frequency) setting.
(As of May 2023)
→ The tutorial code since the last updated version on June 20, 2023 already supports “32k” and “48k” RVC v2 models.
Number of CPU processes used for pitch extraction and data processing:
2
*This is an example.
【Step:2a】
Automatically traverse all files in the training folder that can be decoded into audio and perform slice normalization. Generates 2 wav folders in the experiment directory. Currently, only single-singer/speaker training is supported
Enter the path of the training folder:
(Specify the path of the dataset folder to be trained)
/content/drive/MyDrive/dataset
*Example of placing a dataset folder named “dataset” in “MyDrive” in Google Drive
Please specify the speaker/singer ID:
(Identification number setting)
0
*When training a second time, set the ID to a different ID than the first time.
(e.g., 1st study – ID 0. 2nd study – ID 1)
【Step:2b】
Use CPU to extract pitch (if the model has pitch), use GPU to extract features (select GPU index)
Enter the GPU index(es) separated by ‘-‘, e.g., 0-1-2 to use GPU 0, 1, and 2:
0
GPU information:
0 Tesla T4
Select the pitch extraction algorithm (‘pm’: faster extraction but lower-quality speech; ‘dio’: improved speech but slower extraction; ‘harvest’: better quality but slower extraction):
harvest
*This is an example.
*As of July 30, 2023 confirmation. In the settings here.
・”rmvpe” (the latest RMVPE vocal pitch extraction algorithm)
・”rmvpe_gpu” (GPU-specified version of the latest RMVPE vocal pitch extraction algorithm)
*RMVPE:
RMVPE: A Robust Model for Vocal Pitch Estimation in Polyphonic Music – Published on June 27, 2023 | arxiv
etc. were added.
It was also confirmed that both “rmvpe” and “rmvpe_gpu” could be used.
The training time seemed to be faster for both than for “harvest“.
When set to “rmvpe_gpu“, the following
It seemed that a GPU needed to be specified.
The default setting of “0-0” seemed to work. Please try different algorithms.
【Step:3】
Fill in the training settings and start training the model and index
Save frequency (save_every_epoch):
(Frequency of storage of trained status)
5
*This is an example.
Total training epochs (total_epoch):
(Number of training – Number of epochs)
10
*This is an example.
*If you have more time, increasing the number of training may improve the quality of voice conversion.
*If you want to improve the quality of the audio conversion, try increasing the “number of audio files” and “audio duration”.
*According to the official RVC Q&A, the acceptable number of epochs is “200 times”.
Batch size per GPU:
(Batch size for each graphics card/GPU)
*How much of the dataset to be trained, and how much to process per.
3
*This is an example.
Save only the latest ‘.ckpt’ file to save disk space:
No
*This is an example.
Cache all training sets to GPU memory. Caching small datasets (less than 10 minutes) can speed up training, but caching large datasets will consume a lot of GPU memory and may not provide much speed improvement:
No
*This is an example.
Save a small final model to the ‘weights’ folder at each save point:
No
*This is an example.
Load pre-trained base model G path.:
pretrained_v2/f0G40k.pth
*Example setting when RVC v2 model (40 kHz) is specified.
The file is automatically reflected according to the setting.
Load pre-trained base model D path.:
pretrained_v2/f0D40k.pth
*Example setting when RVC v2 model (40 kHz) is specified.
The file is automatically reflected according to the setting.
Enter the GPU index(es) separated by ‘-‘, e.g., 0-1-2 to use GPU 0, 1, and 2:
0
Once you have made the settings, click on the
After a few moments, you should see something like the following
・“One-click training.”
button.
After a few moments, you should see something like the following
Output results on RCV WebUI
* “Output message” in “Train”
step 1: processing data
python3 trainset_preprocess_pipeline_print.py /content/drive/MyDrive/dataset 40000 2 /content/Retrieval-based-Voice-Conversion-WebUI/logs/amitaro False
step2a:正在提取音高
python3 extract_f0_print.py /content/Retrieval-based-Voice-Conversion-WebUI/logs/amitaro 2 harvest
step 2b: extracting features
python3 extract_feature_print.py cuda:0 1 0 0 /content/Retrieval-based-Voice-Conversion-WebUI/logs/amitaro v2
step 3a: model traning started
write filelist done
python3 train_nsf_sim_cache_sid_load_pretrain.py -e amitaro -sr 40k -f0 1 -bs 3 -g 0 -te 10 -se 5 -pg pretrained_v2/f0G40k.pth -pd pretrained_v2/f0D40k.pth -l 0 -c 0 -sw 0 -v v2
Training complete. Logs are available in the console, or the 'train.log' under experiment folder
(1072, 768),27
training index
adding index
成功构建索引, added_IVF27_Flat_nprobe_1_amitaro_v2.index
all processes have been completed!
Output results on v2-RVC-WebUI-for-AI-beginners.ipynb
*Output result of “【Step 10: Launch the RVC WebUI】”
NFO:amitaro:====> Epoch: 1 [2023-05-28 11:56:54] | (0:00:31.598615)
/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py:139: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
INFO:amitaro:====> Epoch: 2 [2023-05-28 11:56:56] | (0:00:02.127939)
INFO:amitaro:====> Epoch: 3 [2023-05-28 11:56:59] | (0:00:02.087378)
INFO:amitaro:====> Epoch: 4 [2023-05-28 11:57:01] | (0:00:02.215266)
INFO:amitaro:Saving model and optimizer state at epoch 5 to ./logs/amitaro/G_25.pth
INFO:amitaro:Saving model and optimizer state at epoch 5 to ./logs/amitaro/D_25.pth
INFO:amitaro:====> Epoch: 5 [2023-05-28 11:57:12] | (0:00:10.780564)
INFO:amitaro:====> Epoch: 6 [2023-05-28 11:57:14] | (0:00:02.139899)
INFO:amitaro:====> Epoch: 7 [2023-05-28 11:57:16] | (0:00:02.086639)
INFO:amitaro:====> Epoch: 8 [2023-05-28 11:57:18] | (0:00:02.081577)
INFO:amitaro:====> Epoch: 9 [2023-05-28 11:57:21] | (0:00:02.653771)
INFO:amitaro:Saving model and optimizer state at epoch 10 to ./logs/amitaro/G_50.pth
INFO:amitaro:Saving model and optimizer state at epoch 10 to ./logs/amitaro/D_50.pth
INFO:amitaro:====> Epoch: 10 [2023-05-28 11:57:28] | (0:00:07.700858)
INFO:amitaro:Training is done. The program is closed.
INFO:amitaro:saving final ckpt:Success.
The output training model can be downloaded on Google Colaboratory. Output in the “weights” folder.
The trained model named
・「amitaro.pth」
*Example of setting the name “amitaro” in “Experiment name:”.
*As of February 2024, the “weights” folder was located in the “assets” folder of the “Retrieval-based-Voice-Conversion-WebUI” folder.
is output in the “weights” folder.
If necessary, download it to your local environment (your own computer) and try real-time voice changes, etc.
VC Client – client software for AI-based real-time voice changer:
w-okada/voice-changer(The MIT License)| GitHub
Let’s try Text-To-Speech synthesis using the RVC trained model (only “RVC v1 model” is supported)
:【VG WebUI】How to TTS(Text-To-Speech) with the RVC WebUI trained model – Introduction to TTS with RVC
【RVC v2 model】How to use RVC WebUI: Inference (Voice Conversion)
– Model inference
*Click on image to enlarge. The settings are described in the description below. Please refer to them as necessary.
Click on the “Model inference” tab and configure as follows
(This is an example.)
After clicking the “Refresh voice list and index path” button,
Inferencing voice:
amitaro.pth
*Example of setting the name “amitaro” in “Experiment name:”.
can be set.
Select Singer/Speaker ID:
(Identification number setting)
0
*Example when “Specify Singer/Speaker ID:” (identification ID setting) is set to “0” during training
Recommended +12 key for male to female conversion, and -12 key for female to male conversion. If the sound range goes too far and the voice is distorted, you can also adjust it to the appropriate range by yourself.
Transpose (integer, number of semitones, raise by an octave: 12, lower by an octave: -12):
+12
*In this tutorial, the voice is set to +12 to convert from a male voice to a female voice.
*If you use the RVC v2 model and set the pitch model to “harvest” in the “Select the algorithm for pitch extraction:” section that you are about to set, the voice may become higher at times, so try tweaking these settings as necessary.
Enter the path of the audio file to be processed (default is the correct format example):
/content/drive/MyDrive/originalvoice.wav
*Example of placing an audio file named “originalvoice.wav” in “MyDrive” of Google Drive.
Select the pitch extraction algorithm (‘pm’: faster extraction but lower-quality speech; ‘harvest’: better bass but extremely slow; ‘crepe’: better quality but GPU intensive):
crepe
*As of May 30, 2023, selecting “crepe” seemed to result in an error.
→ I have modified the Google Colaboratory code so that “crepe” can be used.
(Updated version of May 31, 2023 – )
*At the time of the July 11, 2023 update to support “rmvpe” (the latest RMVPE vocal pitch extraction algorithm), selecting “rmvpe” seemed to cause an error.
I will monitor the progress in the future.
If >=3: apply median filtering to the harvested pitch results. The value represents the filter radius and can reduce breathiness.:
3
*f0:fundamental frequency
*median filter:Probably noise reduction
Path to the feature index file. Leave blank to use the selected result from the dropdown:
blank(null)
*This is an example.
*There are other ways to “Copy and paste the path to the “added〜.index” file in the “logs” folder of the “Retrieval-based-Voice-Conversion-WebUI” folder into the input field”
Auto-detect index path and select from the dropdown:
logs/amitaro/added_IVF27_Flat_nprobe_1_amitaro_v2.inde
*This is an example.
*Example of a case where “Input experiment name:” (the name of the output training model) is set to “amitaro” during training.
*
*「Retrieval-based-Voice-Conversion-WebUI」のフォルダの「logs」内の「added〜.index」のファイルのパスが自動的に反映されます
*If the “Experiment name:” (the name of the output training model) is “amitaro2” for the second training, select the file
“logs/amitaro2/added_~_amitaro2_v2.index”
Search feature ratio:
0.5
*The default setting seems to be set to “0.76”.(“0.88” is the default setting for the May 30, 2023 update)
What happens to the audio by changing the value? Try different things.
As far as I tried a few times, setting here to “1” seemed to cause some noises in the middle…
When I turned it down to around “0.5” it didn’t bother me.
*The closer the “Search feature ratio” is to “1”, the more it seems to be biased toward the features (tone quality) of the “Trained model” trained this time.
Resample the output audio in post-processing to the final sample rate. Set to 0 for no resampling:
48000
*This is an example.
*If you do not want to resample (i.e., you want the converted audio to be “40kHz” with the tutorial settings), set the setting here to “0”.
Use the volume envelope of the input to replace or mix with the volume envelope of the output. The closer the ratio is to 1, the more the output envelope is used:
1
*This is an example.
*Perhaps this is a setting that allows the volume envelope (setting of audio volume relative to time) to be set closer to the original audio when the setting here is set to “less than 1”. In the tutorial, I set it closer to the output.
What happens to the audio when you change the value? Please try it out.
Protect voiceless consonants and breath sounds to prevent artifacts such as tearing in electronic music. Set to 0.5 to disable. Decrease the value to increase protection, but it may reduce indexing accuracy:
0.33
(一例です)
*The default setting seems to be set to “0.33”.
What happens to the audio by changing the value? Try different things.
Note:voiceless consonant
:Consonants without vocal fold vibration.
When settings are complete, click the
・Convert
button.
Inference is completed in a few seconds to 20 seconds.
*The first inference takes about 20 seconds; the second and subsequent inferences take a few seconds.
I only had it learn with 10 files of about 1-3 seconds, but when we listened to it, it was converted to a feminine voice.
It seemed to depend on the occasion, but as far as I tested with Amitaro’s voice, it seemed that when using the RVC v2 model and converting the voice with “harvest”, the keys were sometimes higher.
Therefore, it seemed necessary to set the key lower in some cases.
Please try to create your own original voice changer through trial and error, referring to the instructions in this article.
The output voice file can be downloaded at
*「ダウンロード:Download」
「Export audio (Click on the three dots in the bottom right corner to download)」
In addition, it will be saved in the “TEMP” folder with a file name like “audio.wav” (example file) below,
so please download it to your local environment (your computer) if necessary and use it.
In addition, a voice-changed voice file is added when a new inference is made.
【Confirmed event: as of June 20, 2023】
There seemed to be a bug on Google Colaboratory where files were not output.
On the RVC WebUI, it seemed to be possible to download the audio file.
The above is a lengthy explanation of how to use it.
I am also interested in the progress of future development.
Q&A displayed on the RVC WebUI
:As of May 30, 2023
For reference, RVC WebUI-related information is posted on this page for easy searching and research.
The official RVC Q&A (the FAQ you can see when you start the original version of RVC WebUI) is based on the text displayed when I run the code
・As of May 30, 2023.
I hope that this will provide you with a starting point for solving any questions you may have about the RVC WebUI.
Q1:ffmpeg error/utf8 error.
It is most likely not a FFmpeg issue, but rather an audio path issue;
FFmpeg may encounter an error when reading paths containing special characters like “spaces” and “()”, which may cause an “FFmpeg error”;
and when the training set’s audio contains “Chinese paths”, writing it into filelist.
txt may cause a “utf8 error”.
Q2:Cannot find index file after “One-click Training”.
If it displays “Training is done. The program is closed,” then the model has been trained successfully, and the subsequent errors are fake;
The lack of an ‘added’ index file after “One-click training” may be due to the training set being too large, causing the addition of the index to get stuck;
this has been resolved by using batch processing to add the index, which solves the problem of memory overload when adding the index.
As a temporary solution, try clicking the “Train Index” button again.
Q3:Cannot find the model in “Inferencing timbre” after training
Click “Refresh timbre list”(“Refresh voice list and index path”)and check again;
if still not visible, check if there are any errors during training and send screenshots of
・the console
・web UI, and logs/experiment_name/*.log
to the developers for further analysis.
The pth files stored in rvc_root/logs/experiment_name are not meant for sharing or inference, but for storing the experiment checkpoits for reproducibility and further training. The model to be shared should be the 60+MB pth file in the weights folder;
In the future, weights/exp_name.pth and logs/exp_name/added_xxx.index will be merged into a single weights/exp_name.zip file to eliminate the need for manual index input; so share the zip file, not the pth file, unless you want to continue training on a different machine;
Copying/sharing the several hundred MB pth files from the logs folder to the weights folder for forced inference may result in errors such as missing f0, tgt_sr, or other keys. You need to use the ckpt tab at the bottom to manually or automatically (if the information is found in the logs/exp_name), select whether to include pitch infomation and target audio sampling rate options and then extract the smaller model. After extraction, there will be a 60+ MB pth file in the weights folder, and you can refresh the voices to use it.
Q5:Connection Error.
You may have closed the console (black command line window).
Note:
In the case of Google Colaboratory, the program may have been disconnected from Google Colaboratory or the program in 【Step 10: Launch the RVC WebUI】 may have been stopped.
Q6:WebUI popup ‘Expecting value: line 1 column 1 (char 0)’.
Please disable system LAN proxy/global proxy and then refresh.
Q7:How to train and infer without the WebUI?
Training script:
You can run training in WebUI first, and the command-line versions of dataset preprocessing and training will be displayed in the message window.
Inference script:
https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/myinfer.py
e.g.
runtime\python.exe myinfer.py 0 “E:\codes\py39\RVC-beta\todo-songs\1111.wav” “E:\codes\py39\logs\mi-test\added_IVF677_Flat_nprobe_7.index” harvest “test.wav” “weights/mi-test.pth” 0.6 cuda:0 True
f0up_key=sys.argv[1]
input_path=sys.argv[2]
index_path=sys.argv[3]
f0method=sys.argv[4]#harvest or pm
opt_path=sys.argv[5]
model_path=sys.argv[6]
index_rate=float(sys.argv[7])
device=sys.argv[8]
is_half=bool(sys.argv[9])
Q8:Cuda error/Cuda out of memory.
There is a small chance that there is a problem with the CUDA configuration or the device is not supported; more likely, there is not enough memory (out of memory).
For training, reduce the batch size (if reducing to 1 is still not enough, you may need to change the graphics card); for inference, adjust the x_pad, x_query, x_center, and x_max settings in the config.py file as needed. 4G or lower memory cards (e.g. 1060(3G) and various 2G cards) can be abandoned, while 4G memory cards still have a chance.
Q9:How many total_epoch are optimal?
If the training dataset’s audio quality is poor and the noise floor is high, “20-30 epochs” are sufficient. Setting it too high won’t improve the audio quality of your low-quality training set.
If the training set audio quality is high, the noise floor is low, and there is sufficient duration, you can increase it. “200” is acceptable (since training is fast, and if you’re able to prepare a high-quality training set, your GPU likely can handle a longer training duration without issue).
Q10:How much training set duration is needed?
A dataset of around “10min to 50min” is recommended.
With guaranteed high sound quality and low bottom noise, more can be added if the dataset’s timbre is uniform.
For a high-level training set (lean + distinctive tone), “5min to 10min” is fine.
There are some people who have trained successfully with 1min to 2min data, but the success is not reproducible by others and is not very informative.
This requires that the training set has a very distinctive timbre (e.g. a high-frequency airy anime girl sound) and the quality of the audio is high; Data of less than 1min duration has not been successfully attempted so far. This is not recommended.
Q11:What is the index rate for and how to adjust it?
Note:
Perhaps? I am not sure, but you may be referring to the “Search feature ratio:” setting for model inference.
I have experienced symptoms like “tone leakage” depending on this setting…
If the tone quality of the pre-trained model and inference source is higher than that of the training set, they can bring up the tone quality of the inference result, but at the cost of a possible tone bias towards the tone of the underlying model/inference source rather than the tone of the training set, which is generally referred to as “tone leakage”.
The index rate is used to reduce/resolve the timbre leakage problem. If the index rate is set to 1, theoretically there is no timbre leakage from the inference source and the timbre quality is more biased towards the training set. If the training set has a lower sound quality than the inference source, then a higher index rate may reduce the sound quality. Turning it down to 0 does not have the effect of using retrieval blending to protect the training set tones.
If the training set has good audio quality and long duration, turn up the total_epoch, when the model itself is less likely to refer to the inferred source and the pretrained underlying model, and there is little “tone leakage”, the index_rate is not important and you can even not create/share the index file.
Q12:How to choose the gpu when inferring?
In the “config.py” file, select the card number after “device cuda:”.
The mapping between card number and graphics card can be seen in the graphics card information section of the training tab.
Q13:How to use the model saved in the middle of training?
Save via model extraction at the bottom of the ckpt processing tab.
Q14:File/memory error(when training)?
Too many processes and your memory is not enough. You may fix it by:
1、decrease the input in field “Threads of CPU”.
2、pre-cut trainset to shorter audio files.
Comments on RVC WebUI explanatory videos and examples of responses
– As of Feb 8, 2024
If there are any questions, etc., from viewers of the tutorial video, I will add them.
Launch the RVC WebUI – already supported:RVC WebUI tutorial code for RVC v2 compatible version is not available
【As of February 8, 2024】
The tutorial code has been unavailable for a while since 2023, when only billing users were allowed to use the RVC WebUI at Google Colaboratory,
However, after the last update on January 28, 2024, the tutorial code in “Step 5” and “Step 6” was modified (change the download location) in the tutorial code after the last update on January 28, 2024.
After that, it became usable.
Then, on February 7, 2024
・Errors related to “gradio_client”
Fix the code in “Step 2” (downgrade gradio_client).
After fixed the code in “Step 2” , the RVC WebUI was able to perform both “Train” and “Inference” again.
Train & Model inference:Can you create a tutorial video for the RVC WebUI that converts to a singing voice?(AI Cover)
– Information as of July 25, 2023 –
If you want to perform a singing voice change, set the
・”Train”
setting
・”Wherever the model has pitch guidance (required for singing, optional for speech):”
to
・”True”.
Also, from what I know at this point, if you want to convert singing voices, it would seem that it would be better to create a dataset from the singing voices.
Specifically, when creating a dataset, it would be better to have voice data in the low and high (falsetto) registers.
The reason why I have not released a demo of the singing voice conversion method publicly in the tutorial video is that there is a rights-related issue.
In the future, I will check to see if there are any materials that clear the rights issue, and if it seems possible, I will consider a tutorial video.
【RVC WebUI Tutorial】
Reuse of previously trained models from RVC WebUI – How to use AI Voice changer
Video viewing time: 4 min. 47 sec.
I have found a way to reuse a set of files related to previous trained models in the RVC WebUI, and I have summarized the procedure in a video.
【Contents:Reuse of previously trained models from RVC WebUI】
0:00 Introduction
0:31 Step 1:How to use past trained model
1:06 Step 2:Launch the RVC WebUI
1:36 Step 3: Place files
2:50 Step 4:Perform Model inference
by 子供プログラマー – Child Programmer
How to use the RVC WebUI – AI Voice Changer Tutorial by Child Programmer
An Introductory Course for Japanese Artificial Intelligence Programmers (Machine Learning) by Child Programmer
How to use Vocal Remover in RVC WebUI
:【RVC WebUIの使い方】ボーカルリムーバー編:AIボイスチェンジャー用学習データセット作成のためのボーカルと音楽の分離
Vocal Remover: Separating Vocals and Music for Creating a Training Dataset for AI Voice Changer