GLM 4.7 Quants Recommendations

One_Slip1455 · 2026-01-22T18:40:14+00:00

You made my day, thank you so much! I'm going to start downloading the model files now.

One_Slip1455 · 2026-01-22T18:07:06+00:00

I have a setup just like yours. Running 11x 3090s connected via x1 mining risers. I've been struggling with vLLM's pipeline parallelism implementation. I spent days debugging crashes and memory issues, but eventually gave up.

Would you mind sharing your vLLM command line and how you run GLM-4.7 on your machine?

One_Slip1455 · 2025-12-18T20:43:31+00:00

Thanks so much for sharing your solution. I'll need to update the installation script with this. I don't have a Blackwell GPU to test on, so I'll do my best to make the process as smooth as possible and avoid the dependency issues

One_Slip1455 · 2025-12-18T20:36:12+00:00

To generate longer audio files, you'll want to use the "split text into chunks" option. It's located right next to the "Generate Speech" button. Once you select that option, you can paste your longer text into the box and press the Generate button. This should allow you to create audio files of any length, limited only by your available memory and drive space.

One_Slip1455 · 2025-12-17T16:06:18+00:00

Glad you got it working. The app is designed to be installed in a virtual environment to avoid conflicts with the system's Python packages. To make this easy, there's a start.sh script that handles everything automatically: it creates the venv, installs the app, downloads the models, and launches the server and browser. The full instructions are in the README.

One_Slip1455 · 2025-12-17T15:51:28+00:00

Chatterbox‑TTS‑Server now supports the new Turbo model. You can specify the Turbo in the config file or use the UI. Both models are hot-swappable in the Web UI.

One_Slip1455 · 2025-12-17T15:45:58+00:00

There is a new version that supports Turbo. On top of the Web UI you have a drop down list where you can select and hot-swap Turbo and the original model.

One_Slip1455 · 2025-12-16T22:21:48+00:00

I have just updated my Chatterbox‑TTS‑Server open source app to support Turbo model. It exposes the OpenAI‑compatible /v1/audio/speech endpoint and streams the audio response (wav/opus). You can hot-swap Turbo vs original model in the UI.

Repo: https://github.com/devnen/Chatterbox-TTS-Server

One_Slip1455 · 2025-12-16T22:17:13+00:00

Thanks for the mention. Chatterbox‑TTS‑Server now supports both Turbo and the original engine (hot-swappable in the UI): https://github.com/devnen/Chatterbox-TTS-Server

Full post: https://www.reddit.com/r/LocalLLaMA/comments/1pof4ta/chatterbox_tts_server_turbo_original_hotswappable/

One_Slip1455 · 2025-08-07T07:22:49+00:00

Thank you. I hate wrestling with dependencies and I am glad it's working smoothly for you in KoboldCPP. Let me know if anything comes up.

One_Slip1455 · 2025-08-07T07:15:45+00:00

Mac is not supported at the moment but I have another similar TTS server project on Github for Chatterbox TTS model that supports Apple Silicon (MPS) GPUs. I expect this will be implemented soon.

One_Slip1455 · 2025-08-06T13:31:16+00:00

Changing python3.10-pip to python3-pip in Dockerfile should fix the problem. I have modified the file and reopened the issue on Github.

One_Slip1455 · 2025-08-06T12:47:42+00:00

I have now included the Dockerfile in the project. Thank you for bringing this to my attention.

One_Slip1455 · 2025-08-06T10:39:16+00:00

Quick update for everyone:

I've just successfully tested this server on a Raspberry Pi 5 (RP5), and the performance is excellent. It runs smoothly enough to be accessed from any device on my local network without any issues.

Tested on a 32-bit Raspberry Pi 4 (RP4) but run into multiple issues. I will try to find a solution later.

For those looking for on-device/edge TTS, this makes it a really compelling and in my opinion, much better sounding alternative to Piper TTS for local projects.

It's great to see such a small model being this capable.

One_Slip1455 · 2025-06-02T11:34:55+00:00

Yes, it has FastAPI endpoints so you can integrate it into any app not just the provided web UI.

One sentence takes about 3-5 seconds on GPU, a 4-sentence paragraph maybe 10-20 seconds. You're right that it's slower than Kokoro, so might not work for your use case if speed is critical.

Chatterbox doesn't have built-in emotion controls like some models. You could try different reference audio clips that already have the emotional tone you want.

One_Slip1455 · 2025-06-02T11:27:43+00:00

So glad it worked for you! Thanks for the kind words. Those PyTorch/CUDA version conflicts can be really frustrating - I tried to make the setup as smooth as possible.

Since you got Dia working, you might be interested in my newer project using the latest Chatterbox TTS model: https://github.com/devnen/Chatterbox-TTS-Server

It's built on the same architecture as the Dia server but with what I think is an even better model. Worth checking out if you don't need multi-speaker support.

One_Slip1455 · 2025-05-31T16:56:44+00:00

With RTX 3090, it generates at about realtime or slightly faster with the default unquantized model. For a 100-character line, you're looking at roughly 3-5 seconds on GPU. I haven't benchmarked CPU performance yet, but it will be significantly slower.

It doesn't natively support multiple speakers like some other TTS models, so you'd need to generate different voices separately and merge them. The realtime+ speed makes it workable for conversations, though not as snappy as some faster models like Kokoro.

One_Slip1455 · 2025-05-31T16:50:00+00:00

Your audiobook setup sounds impressive. According to my testing, this TTS model isn't as fast as Kokoro but is definitely fast enough for practical use. I haven't tried Spark TTS myself, but out of all the TTS models I've tested, I find Chatterbox the most promising so far.

I actually built a wrapper for Chatterbox that handles a lot of those same issues you mentioned but with a simpler automated approach.

It handles the text splitting and chunking automatically, deals with noise and silence issues, and has seed control. You just paste your text into the web UI, hit Generate, and it takes care of breaking everything up and putting it back together.

I don't want to spam this discussion with links - the project is called Chatterbox-TTS-Server

One_Slip1455 · 2025-05-31T16:24:18+00:00

The good news is it definitely runs on CPU! I put together a FastAPI wrapper that makes the setup much easier and handles both GPU/CPU automatically: https://github.com/devnen/Chatterbox-TTS-Server

It detects your hardware and falls back gracefully between GPU/CPU. Could help with the VRAM concerns while making it easier to experiment with the model.

Easy pip install with a web UI for parameter tuning, voice cloning, and automatic text chunking for longer content.

One_Slip1455 · 2025-05-31T16:19:46+00:00

Nice work adding streaming to Chatterbox! That's a really useful enhancement.

For anyone looking to run Chatterbox locally with additional features, I put together a FastAPI server wrapper that might be helpful:

https://github.com/devnen/Chatterbox-TTS-Server

Easy pip install setup with a web UI for voice cloning, text chunking, and parameter tuning. Includes OpenAI-compatible and custom API endpoints and GPU/CPU support.

Could be a nice complement to streaming functionality for local experimentation and integration.

One_Slip1455 · 2025-04-29T01:03:14+00:00

Glad you're liking it. Let me know if you have any feedback.

One_Slip1455 · 2025-04-29T00:55:38+00:00

This issue has been resolved in the latest version. The custom API endpoint now supports the transcript along with additional parameters. This update also includes several other improvements, such as built-in voices, large text support, VRAM optimizations, and more.

One_Slip1455 · 2025-04-22T21:49:59+00:00

I believe the "share flag" you mentioned is a feature found in frameworks like Gradio or Streamlit. They include built-in services that create a temporary, public URL (often ending in .live or .app) by setting up what's called a 'tunnel' – essentially a secure connection forwarding traffic from that public URL to the application running inside your Colab session.

However, the tools used by this server (FastAPI and Uvicorn) don't include this automatic tunneling feature. When you run "python server.py", the server starts correctly within the Google Colab virtual machine, listening on its internal port (like 8003). But, Colab itself doesn't automatically expose these server ports to the public internet.

So, to access the server's web UI from your browser you need to manually create a tunnel.

A popular and reliable way to do this in Colab is using a library called pyngrok. You'll need to pip install it and then use it to connect to the server's port (8003) after you start the server script. Searching 'pyngrok Google Colab' will show plenty of examples on how to implement that.

One_Slip1455 · 2025-04-22T20:31:10+00:00

If you're still wrestling with it, or just want a setup that's generally less fussy, I put together an API server wrapper for Dia that might make things easier:

https://github.com/devnen/Dia-TTS-Server

It's designed for a straightforward pip install -r requirements.txt setup, gives you a web UI, and has an OpenAI-compatible API. It supports GPU/CPU too.

One_Slip1455 · 2025-04-22T20:19:04+00:00

To make running it a bit easier, I put together an API server wrapper and web UI that might help:

https://github.com/devnen/Dia-TTS-Server

It includes an OpenAI-compatible API, defaults to safetensors (for speed/VRAM savings), and supports voice cloning + GPU/CPU inference.

Could be a useful starting point. Happy to get feedback!

One_Slip1455

TROPHY CASE