【VG WebUI】How to TTS(Text-To-Speech) with the RVC WebUI trained model – Introduction to TTS with RVC

Table of Contents

Text-To-Speech with RVC AI Voice Changer Tutorial

【TTS with RVC】How to Text To Speech with the RVC WebUI trained model
– Tutorial for Voice AI beginners

Video viewing time： 15 min. 11 sec.

I would appreciate it if you could make use of the explanatory videos when it is difficult to understand textual information alone.
＊【Status as of July 28, 2024】I was able to modify the code so that VG WebUI could be used again.

【Additional information：November 17&18, 2023】

I have modified the RVC v1&v2 model compatible version of the notebook so that the VG WebUI can be used.

【Additional information：July 28, 2024】

VG WebUI was not available due to changes in the Google Colaboratory environment.
Thereafter, I have modified the RVC v1&v2 model compatible version of the notebook so that the VG WebUI can be used.

Voice AI is hot! But is it still too expensive to get started?
VG WebUI – Tutorial on Text To Speech with AI Voice Changer

AI Voice Changer thing released in April 2023

RVC WebUI：
RVC-Project（previously known as liujing04）/Retrieval-based-Voice-Conversion-WebUI（The MIT License）| GitHub

Using a trained model created by,

Text-To-Speech (makes a computer speak a specified sentence)
＊Text-To-Speech: TTS

VG WebUI：
log1stics/voice-generator-webui（The MIT License）| GitHub

This is a tutorial on how to use.

I hope that this series of information will help you get started with text-to-speech with a trained model created by AI Voice Changer.

Some people are attempting real-time voice changes on social networking sites, YouTube videos, and online articles, indicating a high level of interest in voice AI from April 2023.
I imagine that decades from April 2023, when this article was first written, when people look back on the past,

✔︎They will recognize that April 2023 was the turning point when the general public began to enjoy the benefits of voice AI.
（Democratization of voice AI?）

I hope that this series of information will encourage you to try Text-To-Speech using trained data created with the RVC WebUI.

Sample Code Links & Program Licenses

To make it easy for you to try VG WebUI, I have released sample code with explanations on how to use it.
I hope this will be helpful for beginning AI students in English-speaking countries who want to start Text-To-Speech synthesis with the AI Voice Changer trained model.

English Version
VG WebUI Tutorial：
VG-WebUI-TTS-for-AI-beginners.ipynb（The MIT License）| Google Colaboratory

【Additional information：August 13, 2023】

Tutorial code for the VG WebUI, which also supports the use of “RVC v2 model”, is now available.

RVC v2 model supported – Tutorial code for how to use VG WebUI （For text-to-speech）：
v2-VG-WebUI-TTS-for-AI-beginners.ipynb（The MIT License）| Google Colaboratory

【How to run the code on Google Colaboratory】

To begin with, to run the program on Google Colaboratory, follow the steps in

①Sign in with your Google account
②Save a copy in Drive（Save “VG-WebUI-TTS-for-AI-beginners.ipynb”）
＊ From the Google Colaboratory menu “File – Save a copy in Drive”

rename it to your choice, and then run the following code.

To run the code, just have them do the steps in

・Run all
＊ From the Google Colaboratory menu “Runtime – Run all”

Once you get the gist of it, the procedure to start the RVC WebUI is easy.
If the trained model is not reflected when VG WebUI is launched, try launching VG WebUI again after placing the trained model in place.

License for the sample code “VG-WebUI-TTS-for-AI-beginners.ipynb”：

The MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

How to start the VG WebUI
– Last update: July 28, 2024

【Step 1: Creating a trained model】

In advance, create a trained model with the RVC WebUI.
If you do not know how to use RVC WebUI, please use this tutorial video.

RVC WebUI tutorial video：
RVC WebUI tutorial video for RVC v2 model supported version
【The Magic of Modern Times】How to use RVC WebUI（RVC v2 model supported）- Tutorial for AI Beginners by RehabC（YouTube）
Video viewing time： 35 min. 27 sec.
＊In the VG WebUI, if you want to use the original trained model, select the “RVC v1 model” in the train settings.

【Step 2: Mount Google Drive】

You can use folders and files on Google Drive.
After executing the following code, you will be asked for permission, so “Allow” with your Google Drive Google account.

Run Code：

from google.colab import drive
drive.mount('/content/drive')

【Step 3: Prepare a folder with the trained model in Google Drive.】

In Google Drive’s “MyDrive”, prepare a folder called

・”amitaro”

in ofthat contains trained model created with RVC WebUI.

【Folder Structure】
amitaro
　|— amitaro.pth（Required file）
　|— added.index（Not a required file）
　|— total_fea.npy（Not a required file）

＊Example of using training data named “amitaro.pth”
＊Download the file named “〜.pth” in the “weights” folder of the “Retrieval-based-Voice-Conversion-WebUI” folder.
＊Download the file named “added〜.index” in the “logs” folder of the “Retrieval-based-Voice-Conversion-WebUI” folder. After downloading, rename the file “added〜.index” to “added.index”.
＊Download the file “total_fea.npy” in the “logs” section of the “Retrieval-based-Voice-Conversion-WebUI” folder
＊Watch the RVC WebUI tutorial video to see where each file is located.

After uploading to Google Drive is complete Move the “amitaro” folder (the folder containing the trained model, etc.) to the “models” folder.
＊Location of the “models” folder:
In the “vc” folder of the “voice-generator-webui” folder, there is a folder named “models”. if you run the code one by one, you will be able to see the “Step 4-1: Setup” program after running it.
＊If you “Run all”, you will see the folder “voice-generator-webui” after the “Step 4-1: Setup” program runs.

【Step 4：Setup （GPU required）】

If you are unable to check your GPU with the following commands, go to the Google Colaboratory menu

「Runtime – Change runtime type – Hardware accelerator」

and select “GPU”, then save and try running the code again.

Run Code：

!nvidia-smi
!nvcc -V
!free -h

# As of July 28, 2024: Downgraded Google Colaboratory "pip" version "pip==24.1.2" to counter errors such as “fairseq 0.12.2 depends on omegaconf<2.1"
!python3 -m pip install --upgrade pip==22.3
# As of July 28, 2024: "omegaconf" version designation
!pip3 install omegaconf==2.0.6
# As of July 28, 2024: Countermeasure for "typeguard" related errors and “Cannot install fairseq and fairseq==0.12.2"
!pip3 install typeguard==3.0.2 inflect==6.0.5 albumentations==1.3.1 albucore==0.0.5
# As of July 28, 2024: dDwngrade "PyTorch" version to 2.0.0 + Adjust dependencies (As of July 28, 2024: torch 2.3.1+cu121)
!pip3 install torch==2.0.0 torchtext==0.15.1 torchvision==0.15.1 torchaudio==2.0.1


!pip uninstall -y cmake
!wget https://github.com/Kitware/CMake/releases/download/v3.22.1/cmake-3.22.1-linux-x86_64.tar.gz
!tar xf cmake-3.22.1-linux-x86_64.tar.gz
!rm cmake-3.22.1-linux-x86_64.tar.gz
!PATH=$PATH:/content/cmake-3.22.1-linux-x86_64/bin
import os
os.environ["PATH"] += ":/content/cmake-3.22.1-linux-x86_64/bin"

!apt-get install espeak
# As of July 28, 2024: Changed to use the latest version of “langchain” and added installation of “langchain-community” to prevent error “No module named ‘langchain_community’”.
!pip3 install langchain langchain-community

# As of July 28, 2024: Errors “pydantic-core 2.19.0 requires typing-extensions!=4.7.0,>=4.6.0,” “ImportError: cannot import name ‘cast’ from 'typing_ extensions'” Change "typing-extensions==4.5.0" to "typing-extensions==4.7.0" for countermeasure "typing-extensions==4.5.0" to typing-extensions==4.7.1"
!pip3 install kaleido cohere openai tiktoken typing-extensions==4.7.1 fastapi python-multipart uvicorn tensorflow-probability==0.20.1 gradio==3.34.0 tensorboardX

!git clone https://github.com/log1stics/voice-generator-webui
%cd voice-generator-webui

folder = r"/content/voice-generator-webui/vc/models"
if not os.path.exists(folder):
  os.mkdir(folder)

!chmod +x setup.sh
!./setup.sh

As of July 28, 2024, when 【Step 4: Setup (GPU required)】 is executed, the following output appears at the end.

# Output
performance hint: core.pyx:7:5: Exception check on ‘maximum_path_each’ will always require the GIL to be acquired.
Possible solutions:
1. Declare ‘maximum_path_each’ as ‘noexcept’ if you control the definition and you’re sure you don’t want the function to raise exceptions.
2. Use an ‘int’ return type on ‘maximum_path_each’ to allow an error code to be returned.
performance hint: core.pyx:38:6: Exception check on ‘maximum_path_c’ will always require the GIL to be acquired.
Possible solutions:
1. Declare ‘maximum_path_c’ as ‘noexcept’ if you control the definition and you’re sure you don’t want the function to raise exceptions.
2. Use an ‘int’ return type on ‘maximum_path_c’ to allow an error code to be returned.
performance hint: core.pyx:42:21: Exception check after calling ‘maximum_path_each’ will always require the GIL to be acquired.
Possible solutions:
1. Declare ‘maximum_path_each’ as ‘noexcept’ if you control the definition and you’re sure you don’t want the function to raise exceptions.
2. Use an ‘int’ return type on ‘maximum_path_each’ to allow an error code to be returned.

It seemed that VG WebUI could be used without any special measures.

【Step 5: Launch the VG WebUI】

Performs Text-To-Speech on the web interface. You can use the VG WebUI by clicking on the

Running on public URL: https://〜.gradio.live

URL displayed after executing the following code.

If the trained model is not reflected, try running this code again.

【Confirmation】
When I tried to use the “RVC v2 model” (such as the original version of RVC WebUI), a “vc” related error occurred and it seemed to be unusable. Please use the “RVC v1 model” if you wish to use a trained model created with the RVC WebUI.

Run code：

!python3 webui.py --colab

How to use the VG WebUI：Perform TTS with sample voice data

Click on the “Generate” tab and configure as follows
(This is an example.)

Text：
（Enter the text you want to Text-To-Speech）
Decades from now, when we look back on the history of AI, the year 2023 will be recognized as a turning point for the spread of voice AI.
（English Example Sentences）

Languages：
（Languages you want to output in Text-To-Speech）
en
＊Setting for English
＊Set “ja” for Japanese

Speaker：
（Specify the audio data of the speaker/speakers of your choice）
speaker1

Speed：
（Speed setting）
1
＊The closer to “0” the slower the speech rate becomes, and the closer to “2” the higher the speech rate becomes, based on “1”.

Voice Conversion：
No conversion
（Settings without AI voice changer）

Once the settings are made, click

・「Generate From Text」
（Execute Text-To-Speech synthesis）

・「Output Audio」
（Text-To-Speech results）

Click the triangular playback button in the “Output Audio” section to listen to the audio.
Next, let’s try Text-To-Speech using the AI Voice Changer training model that you have prepared.

How to use the VG WebUI：AI Voice Changer

Click on the “Generate” tab and configure as follows
(This is an example.)

Languages：
（Languages you want to output in Text-To-Speech）
en
＊Setting for English
＊Set “ja” for Japanese

Speaker：
（Specify the audio data of the speaker/speakers of your choice）
speaker1

Speed：
（Speed setting）
1
＊The closer to “0” the slower the speech rate becomes, and the closer to “2” the higher the speech rate becomes, based on “1”.

Voice Conversion：
（Choose to use the original AI voice changer）
amitaro
（Setting up to use AI voice changer）
＊Example configuration for using a trained model named “amitaro”（“amitaro.pth”）

VC Setteings：
（Selection of various settings for voice conversion）

Pitch：
（Pitch and key settings）
12
＊Specify “0” when the key of the voice is almost the same as when converting from a female voice to a female voice, or from a male voice to a male voice, etc.
＊Specify “-12” to lower the key when converting from a female voice to a male voice. (An example)
＊When converting from a male voice to a female voice, set to “12” to raise the key. (An example)
＊However, for a voice change to a much higher voice like Amitaro’s, it seemed better to choose a female voice and set the key to plus 12 keys.

Pitch Method pm: speed-oriented, harvest: accuracy-oriented：
harvest
＊Please select “pm” or “harvest” according to your preference

Once the settings are made, click

・「Generate From Text」
（Execute text-to-speech synthesis）

・「Output Audio」
（Text-To-Speech results）

Click the triangular playback button in the “Output Audio” section to listen to the audio.

If you want to save the audio, click on the dotted symbol to the right of

to download it.

I also tried

・“the latest RMVPE vocal pitch extraction algorithm”

implemented in the original version of RVC WebUI as of July 2023.
Personally, I felt that the sample rate set to “40kHz” and the “harvest” pitch extraction algorithm was closer to Amitaro’s voice quality.

After much trial and error, it seems that changing the originally prepared speaker will change the voice changed voice, so can you convert it to your preferred voice? Please try various trial and error to see if you can convert the voice to your liking.

First-time voice synthesizers may be puzzled by the series of operations, but once you confirm how to operate the program, you will be fine the next time.

I thank the developer for releasing such a useful program.

Although I believe that voice AI is still in its infancy, I believe that being able to try out this kind of program will give people a chance to get in touch with the advances in AI, and may lead to some people becoming interested in machine learning programming.

I’m looking forward to future developments in the field of voice AI.

VG WebUI Q&A：Comments on the Tutorial Video and Examples of Responses
– Last updated：August 3, 2023

If you would like to comment or otherwise, I will add to this page as necessary.

TTS with RVC Q&A – About RVC models：
What is the difference between RVC v1 and RVC v2 model? Is RVC V2 better than V1？ Or are they the same？

According to one of the RVC WebUI developer, the RVC v2 model is

・”Theoretically better quality.”

【Source】
What is the difference between the v1 and v2 models? #531 – RVC-Project/Retrieval-based-Voice-Conversion-WebUI | GitHub

Also, according to a person who contributed to the development of the RVC WebUI,

・”The RVC v2 model has better Japanese pronunciation.”

【Source」
続・RVCのモデルを日本語向けに事前学習する – @nadare | Qiita
（Continued: Pre-learning the RVC model for Japanese）

TTS with RVC Q&A – Error and Response Example：
NameError: name ‘vc’ is not defined

In the current VG WebUI, Only

・“RVC v1 model”

is supported.
As an example, if possible, with the following RVC WebUI tutorial code

・RVC WebUI tutorial code for the RVC v2 model described in the video：
v2-RVC-WebUI-for-AI-beginners.ipynb（The MIT License）| Google Colaboratory

In the Train setup, please try to train with the “RVC v1 model” selected.

TTS with RVC Q&A – About VG WebUI：
Does it support multiple languages, or does it only support JP and ENG?

This program supports JP and ENG only.

TTS with RVC Q&A – About VG WebUI：
How to do it for song. What are the steps?

I had never considered having them sing in this program.
I still have a lot to learn about using TTS to make them sing.
I have not tried it yet, but come to think of it, you have put together an article here

日本語の音素・アクセントに関して – log1stics/voice-generator-webui | GitHub
（On Japanese phonemes and accents）
＊I don’t know if this will help.

on how to convert the pitch, etc., so please refer to it if necessary.

TTS with RVC Q&A – About VG WebUI：
If you have a tutorial on how to use it with “Lip Movement”

I had never thought about utilizing “Lip Movement”.
I guess I still need to learn more about this one myself.
I have looked into it and found that there are various studies related to lip movement.

An Application to Convert Lip Movement into Readable Text – 2022 | arxiv

by 子供プログラマー – Child Programmer

RVC WebUIの使い方 – AIボイスチェンジャーチュートリアル | 一覧ページ

How to use the RVC WebUI – AI Voice Changer Tutorial | List

日本人のための人工知能プログラマー入門講座（機械学習）| 一覧ページ

An Introductory Course for Japanese Artificial Intelligence Programmers (Machine Learning) by Child Programmer

Text-To-Speech with RVC AI Voice Changer Tutorial

Voice AI is hot! But is it still too expensive to get started? VG WebUI – Tutorial on Text To Speech with AI Voice Changer

Sample Code Links & Program Licenses

How to start the VG WebUI – Last update: July 28, 2024

【Step 1: Creating a trained model】

【Step 2: Mount Google Drive】

【Step 3: Prepare a folder with the trained model in Google Drive.】

【Step 4：Setup （GPU required）】

【Step 5: Launch the VG WebUI】

How to use the VG WebUI：Perform TTS with sample voice data

How to use the VG WebUI：AI Voice Changer

VG WebUI Q&A：Comments on the Tutorial Video and Examples of Responses – Last updated：August 3, 2023

TTS with RVC Q&A – About RVC models： What is the difference between RVC v1 and RVC v2 model? Is RVC V2 better than V1？ Or are they the same？

TTS with RVC Q&A – Error and Response Example： NameError: name ‘vc’ is not defined

TTS with RVC Q&A – About VG WebUI： Does it support multiple languages, or does it only support JP and ENG?

TTS with RVC Q&A – About VG WebUI： How to do it for song. What are the steps?

TTS with RVC Q&A – About VG WebUI： If you have a tutorial on how to use it with “Lip Movement”

関連する投稿