Creating Custom Voice Text-to-Speech with Kvoicewalk and Kokoro-Web

Creating Custom Voice Text-to-Speech with Kvoicewalk and Kokoro-Web

Introduction

Modern text-to-speech systems have evolved beyond generic robotic voices to enable personalized voice synthesis. This tutorial demonstrates how to create a custom TTS system using two powerful open-source tools: kvoicewalk for voice cloning and kokoro-web for local TTS deployment. Starting with just a 30-second voice recording, you can generate a custom voice model and integrate it into a fully functional local TTS service.

The workflow combines PyTorch-based voice cloning with ONNX-optimized inference, resulting in a system capable of generating speech in your own voice from any text input. This approach is particularly valuable for content creation, accessibility tools, personal assistants, and anyone seeking voice synthesis without relying on cloud services.

Prerequisites

Before beginning, ensure you have the following installed:

Python 3.12 or higher with virtual environment support – Node.js (for running kokoro-web) – ffmpeg for audio processing (install via `winget install ffmpeg` on Windows) – Recording software (Windows Sound Recorder or equivalent) – Hardware: CPU-only compatible, no GPU required

Basic familiarity with command-line operations and Python package management is helpful but not required.

Part 1: Voice Cloning with Kvoicewalk

Recording Your Voice Sample

The quality of your final voice model depends heavily on your source recording. For this tutorial, a 30-second sample was recorded reading a passage with diverse phonemes:

“The old lighthouse keeper never imagined that one day he’d be guiding ships from the comfort of his living room, but with modern technology and the array of cameras, he did just that. Sipping tea while the storm raged outside, the gulls shrieked overhead. His weathered hands, once callous from climbing the tower stairs daily, now rested gently on the keyboard as he monitored cargo vessels and fishing boats navigating the treacherous coastal waters.”

Record in a quiet environment with consistent speaking pace and clear pronunciation. Save the file in any common format (M4A, WAV, MP3).

Setting Up Kvoicewalk

Clone the kvoicewalk repository and create a virtual environment:

“`bash cd kvoicewalk python -m venv .venv_312 .venv_312\Scripts\activate pip install -r requirements.txt “`

Converting Audio Format

Kvoicewalk requires WAV format at 24kHz sample rate. If your recording is in M4A or other formats, convert it using ffmpeg:

“`bash python -c “import librosa; import soundfile as sf; y, sr = librosa.load(‘./in/my_voice.m4a’, sr=24000, mono=True); sf.write(‘./in/my_voice.wav’, y, 24000)” “`

This conversion addresses a common issue where the `soundfile` library cannot directly process M4A files, requiring ffmpeg-based transcoding through librosa.

Running Voice Cloning

Execute the transcription and voice cloning process:

“`bash python main.py –target_text “placeholder” –target_audio ./in/my_voice.wav –transcribe_start “`

The transcription uses OpenAI’s Whisper model to extract text from your audio, saving it to `./texts/my_voice.txt`. Next, run the voice cloning with iterative optimization:

“`bash python main.py –target_text ./texts/my_voice.txt –target_audio ./out/converted_audio/my_voice.wav –step_limit 200 “`

Understanding Training Metrics

The voice cloning process generates multiple `.pt` (PyTorch tensor) files with quality metrics in their filenames. Here’s the progression from iterative training:

First run (392 steps): Score 76.87, Similarity 0.68 – Second run (+200 steps): Score 79.99, Similarity 0.72 – Third run (+100 steps): Score 81.64, Similarity 0.74

Higher scores indicate better alignment with your target voice. The similarity metric (0.0-1.0) represents cosine similarity between voice embeddings. Each training iteration takes approximately 20-40 minutes depending on system performance.

Gender Alignment Considerations

An interesting discovery during training involved the model’s alignment with existing voice profiles. Training with all 57 base voices resulted in female-aligned outputs (closest to `af_heart` at 79.9% similarity), while training exclusively with 25 male voices produced male-aligned outputs (closest to `pm_santa` at 78.9% similarity). For more accurate results matching your actual voice characteristics, use the `–voice_folder` parameter to limit training to appropriate gender profiles:

“`bash python main.py –target_text ./texts/my_voice.txt –target_audio ./out/converted_audio/my_voice.wav –step_limit 200 –voice_folder “./voices_male” “`

Part 2: Local TTS with Kokoro-Web

Installing Kokoro-Web

Kokoro-web is an ONNX-based TTS system supporting 60+ voices across multiple languages. Clone the repository and install dependencies:

“`bash cd kokoro-web npm install “`

The development server provides full TTS functionality with automatic model downloading and browser-based or API-based execution modes.

Converting Voice Format

Kvoicewalk outputs PyTorch `.pt` tensors, while kokoro-web requires Float32 binary `.bin` files. Create a conversion script:

“`python

scripts/convert-kvoicewalk.py

import torch import numpy as np import sys

pt_path = sys.argv[1] bin_path = sys.argv[2]

voice = torch.load(pt_path, weights_only=True) voice_np = voice.cpu().numpy().astype(np.float32)

with open(bin_path, ‘wb’) as f: f.write(voice_np.tobytes())

print(f”Converted {pt_path} to {bin_path}”) print(f”Shape: {voice_np.shape}, Size: {voice_np.nbytes} bytes”) “`

Run the conversion:

“`bash python scripts/convert-kvoicewalk.py my_new_voice_91_81.64_0.74_my_voice.pt static/voices/my_voice.bin “`

Kokoro voice files must contain exactly 510 chunks of 256-dimensional Float32 embeddings (522,240 bytes total).

Integrating Custom Voice

Modify `src/lib/shared/resources/voices.ts` to register your custom voice:

“`typescript { id: “my_voice”, name: “My Custom Voice”, lang: englishUs, gender: genderMale, targetQuality: “A”, overallGrade: “A”, } “`

Update `src/lib/shared/resources/index.ts` to load local voice files:

“`typescript const localPath = `/voices/${voice.id}.bin`; try { const response = await fetch(localPath); if (response.ok) { console.log(`Using local voice file: ${localPath}`); return await response.arrayBuffer(); } } catch (e) { console.log(`Local voice not found, downloading from Hugging Face`); } “`

Running the TTS Service

Start the development server:

“`bash npm run dev “`

Access the web interface at `http://localhost:5173`. Select “My Custom Voice” from the dropdown, enter text, and click generate. The first generation downloads the Kokoro-82M model (~86-326 MB depending on quantization) and caches it in the browser’s Cache API storage.

Troubleshooting Common Issues

M4A File Handling

If encountering errors with M4A files, ensure ffmpeg is properly installed and accessible in your system PATH. The `soundfile` library lacks native M4A support, requiring ffmpeg-based transcoding.

Windows Numba Crashes

Kvoicewalk may crash on Windows due to numba JIT compilation issues with `librosa.feature.chroma_stft`. The solution involves commenting out problematic feature extraction or disabling numba JIT:

“`python import os os.environ[‘NUMBA_DISABLE_JIT’] = ‘1’ “`

Voice File Validation

Kokoro-web requires voice files to be exact multiples of 4 bytes (Float32 size). Validate your converted file:

“`javascript const data = fs.readFileSync(‘my_voice.bin’); const float32Array = new Float32Array(data.buffer); console.log(`Valid: ${data.length % 4 === 0}`); console.log(`Chunks: ${float32Array.length / 256}`); “`

Files containing NaN or Infinity values will cause 0kb generation errors. Ensure proper conversion from PyTorch tensors using `.cpu().numpy().astype(np.float32)`.

Conclusion

This workflow demonstrates how a 30-second voice recording can be transformed into a fully functional custom TTS system. The iterative training process with kvoicewalk achieved a quality score of 81.64 with 0.74 similarity after 300 total training steps, while gender-aligned training with male-only voices produced a 77.88 score optimized for masculine characteristics.

The resulting system provides local, privacy-preserving text-to-speech without cloud dependencies. Applications include content creation for videos and podcasts, accessibility tools for individuals with speech impairments, personalized virtual assistants, and educational materials with consistent narration.

Future improvements could involve longer training durations (1000-2000 steps), higher quality source recordings with diverse emotional ranges, and fine-tuning on specific vocabulary or speaking styles. The modular architecture allows easy experimentation with different voice profiles and quality optimization techniques.

References

1. Kvoicewalk Contributors. “Kvoicewalk – Voice Cloning Random Walk Optimizer” (2024). GitHub Repository. https://github.com/remixer-dec/kvoicewalk

2. Hexgrad and Contributors. “Kokoro-82M Text-to-Speech Model” (2024). Hugging Face Model Repository. https://huggingface.co/hexgrad/Kokoro-82M

3. PyTorch Foundation. “PyTorch Tensor Operations Documentation” (2024). PyTorch Official Documentation. https://pytorch.org/docs/stable/tensors.html

4. FFmpeg Team. “FFmpeg Audio Conversion Guide” (2024). FFmpeg Documentation. https://ffmpeg.org/ffmpeg-all.html#Audio-Options

5. ONNX Runtime Team. “ONNX Runtime for Web and Node.js” (2024). Microsoft ONNX Runtime Documentation. https://onnxruntime.ai/docs/get-started/with-javascript.html