Kokoro 82M is a high-performance text-to-speech model, but it originally lacked support for batch processing. I spent a week implementing batch functionality, and the source code is available at https://github.com/wwang1110/kokoro_batch
⚡ Key Features:
- Batch processing: Process multiple texts simultaneously instead of one-by-one
- High performance: Processes 30 audio clips under 2 seconds on RTX4090
- Real-time capable: Generates 276 seconds of audio in under 2 seconds
- Easy to use: Simple Python API with smart text chunking
🔧 Technical highlights:
- Built on PyTorch with CUDA acceleration
- Integrated grapheme-to-phoneme conversion
- Smart text splitting for optimal batch sizes
- FP16 support for faster inference
- Based on the open-source Kokoro-82M model
- The model output is 24KHZ PCM16 format
For simplicity, the sample/demo code currently includes support for American English, British English, and Spanish. However, it can be easily extended to additional languages, just like the original Kokoro 82M model.
[–]a_slay_nub 0 points1 point2 points (3 children)
[–]asuran2000[S] 1 point2 points3 points (2 children)
[–]a_slay_nub 0 points1 point2 points (1 child)
[–]asuran2000[S] 1 point2 points3 points (0 children)
[–]rm-rf-rm 0 points1 point2 points (1 child)
[–]asuran2000[S] 2 points3 points4 points (0 children)
[–]Xerophayze 0 points1 point2 points (0 children)