A VITS-Based Low-Resource Speech Synthesis System with Pitch Control
TTS, Low-resourced, VITS, PESQ, Objective
In recent years, end-to-end text-to-speech (TTS) has made significant progress. These models enable single-stage training and parallel sampling. However, their audio quality still falls short compared to traditional TTS systems. Moreover, generating speech with fine-grained prosody control remains a challenge. This work presents a pitch-controllable end-to-end TTS system based on the VITS architecture. The proposed method produces more natural-sounding audio than existing two-stage models. For efficient and high-fidelity speech synthesis, a vocoder is employed. Since speech signals consist of sinusoidal waves with different periods, the vocoder enhances the overall audio quality. The model is trained on 13 hours of phonetically balanced single-speaker Bangla speech data. It achieves a PESQ score of 1.26 on a scale of -0.5 to 4.5 in objective evaluation. Compared to existing non-commercial Bangla TTS systems, the proposed VITS-based approach demonstrates superior naturalness.
" A VITS-Based Low-Resource Speech Synthesis System with Pitch Control ", IJSDR - International Journal of Scientific Development and Research (www.IJSDR.org), ISSN:2455-2631, Vol.10, Issue 9, page no.a577-a580, September-2025, Available :https://ijsdr.org/papers/IJSDR2509069.pdf
Volume 10
Issue 9,
September-2025
Pages : a577-a580
Paper Reg. ID: IJSDR_304920
Published Paper Id: IJSDR2509069
Downloads: 000120
Research Area: Science and Technology
Country: Sylhet, Sylhet, Bangladesh
ISSN: 2455-2631 | IMPACT FACTOR: 9.15 Calculated By Google Scholar | ESTD YEAR: 2016
An International Scholarly Open Access Journal, Peer-Reviewed, Refereed Journal Impact Factor 9.15 Calculate by Google Scholar and Semantic Scholar | AI-Powered Research Tool, Multidisciplinary, Monthly, Multilanguage Journal Indexing in All Major Database & Metadata, Citation Generator
Publisher: IJSDR(IJ Publication) Janvi Wave