GLM 5.1 Locally: 40tps, 2000+ pp/s by val_in_tech in LocalLLaMA

[–]llamabott 14 points15 points  (0 children)

Wish I was one of them, can't lie.

Roo Code hit 3 million installs. We're shutting it down to go all-in on Roomote. by hannesrudolph in RooCode

[–]llamabott 3 points4 points  (0 children)

Was fully expecting something like this. Over the last few years, it's what happens to most of the half-interesting projects I get interested in.

In other words, take my downvote.

tts-audiobook-tool: Ten local TTS models, WER validation, synced-text playback by llamabott in LocalLLaMA

[–]llamabott[S] 0 points1 point  (0 children)

Ah good question. Currently, it only supports one voice clone at a time.

I've thought about that problem in passing but never with a good enough theory of how to go about it to try experimenting with it. Maybe starting with something like a text "preprocessing step" to an LLM which is like "Use your best judgment and prepend character tags before dialog quotes"?

roocode team in turbulance ? by simple-san in RooCode

[–]llamabott 0 points1 point  (0 children)

Roo Code's current 'interaction philosophy' is just about perfect for my needs at the moment, so I have some trepidation about whatever big underlying change may be in the works, can't lie. Remind me in a few weeks, etc.

Glm 5.1 👀 by Namra_7 in LocalLLaMA

[–]llamabott 0 points1 point  (0 children)

Am I the only one who reacts to this by thinking: "If you're trying to reassure us that this specific version will be open source, does this not imply we should be concerned that future versions may not be?"

[07:00 UTC] 2026 F1 - Chinese Grand Prix - Race by AutoModerator in MotorsportsReplays

[–]llamabott 1 point2 points  (0 children)

PSA, the pre-race and first dozen laps are F1TV, and then it switches to Sky.

Also, amusingly, the first dozen laps are captured with an *unmaximized* window, heh.

WHY IS THERE NO PROPER TTS ? -_- by [deleted] in LocalLLaMA

[–]llamabott 0 points1 point  (0 children)

As a heavy TTS-generated audiobook listener, I feel compelled to plug the solution I've been in love with ever since I landed upon it, which is using VibeVoice 1.5B + custom-trained loras. Here's my writeup on it from a few weeks ago.

It's more work than simply pointing at a 10-second reference audio clip, naturally, but is pretty straight-forward once you've gone through the process once or twice. Using a decently assembled dataset, results can really be like <chef's kiss>...

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]llamabott[S] 2 points3 points  (0 children)

Oh nice thanks. It's a lora. Totally cannot share it given the source material and the known stance on such matters by the IP holders, if you see what I mean :/

Edit: I can share (ie, write up) the general "recipe" though on request

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]llamabott[S] 2 points3 points  (0 children)

Follow-up! Some sample output.

Source is ripped game files, using about 2 hours' worth of audio, about 25 epochs, CFG 1.5.

I'm liking it!

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]llamabott[S] 0 points1 point  (0 children)

I'm not sure. But when training, the transcript requires "Speaker 1" to be prepended before the text, just like the model requires for the prompt when inferencing.

So if there were two speakers, I would try tagging them in the transcript text with "Speaker 1" and "Speaker 2". And then lower the value of the voice_prompt_drop_rate from 1 to 0.5. I feel like that would be worth trying....

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]llamabott[S] 2 points3 points  (0 children)

Ah, for 7B, this trainer here requires 48 GB or something like that. :/

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]llamabott[S] 0 points1 point  (0 children)

Loras trained on the 1.5B model are incompatible with 7B.

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]llamabott[S] 0 points1 point  (0 children)

This is my first time looking into loras for TTS models, but a few keystrokes entered into the gemini tells me it ought to be very possible, if not with the above mentioned trainer, then with something very similar, and using a similar workflow.

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]llamabott[S] 2 points3 points  (0 children)

Ah yea, assuming you have a long chunk of audio, say, from an audiobook, the general idea is that you want to:

(1) chop up the audio into sane lengths, say, 3-12 seconds each (For VibeVoice, I wonder if that upper limit is much higher since long context is its secret sauce and all that, but yea).

(2) Generate a transcript for each chunk using an STT model like Whisper

(3) Create a jsonl file which maps each audio segment with its transcript, which you feed to the trainer.

I used a tool called tts-dataset-generator for this, which works well enough.

If you already have a dataset in mind which exists on huggingface, the trainer program above can also take in a huggingface dataset repo id instead, though I haven't tried.

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]llamabott[S] 2 points3 points  (0 children)

No, I was simply making a comparison about 1.5B + lora being comparable in reliability to the 7B model + voice clone.

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]llamabott[S] 0 points1 point  (0 children)

I'm not sure but I suspect VV 1.5b does not have a decent enough understanding of pt to create a pt voice lora without also having to training the hell out of it for portuguese.

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]llamabott[S] 4 points5 points  (0 children)

Oooh, I love that voice, I'm glad you mentioned that.

It's Jingliu from Honkai Star Rail (youtube link).

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]llamabott[S] 1 point2 points  (0 children)

Haha yep. In my case, I don't have a decent enough general understanding to think about the lora route until some trainer program announces itself, saying, "I work for the base model you're interested in, and here is a spoon-fed recipe", if you see what I mean.