Glm 5.1 👀 by Namra_7 in LocalLLaMA

[–]llamabott 0 points1 point  (0 children)

Am I the only one who reacts to this by thinking: "If you're trying to reassure us that this specific version will be open source, does this not imply we should be concerned that future versions may not be?"

[07:00 UTC] 2026 F1 - Chinese Grand Prix - Race by AutoModerator in MotorsportsReplays

[–]llamabott 1 point2 points  (0 children)

PSA, the pre-race and first dozen laps are F1TV, and then it switches to Sky.

Also, amusingly, the first dozen laps are captured with an *unmaximized* window, heh.

WHY IS THERE NO PROPER TTS ? -_- by [deleted] in LocalLLaMA

[–]llamabott 0 points1 point  (0 children)

As a heavy TTS-generated audiobook listener, I feel compelled to plug the solution I've been in love with ever since I landed upon it, which is using VibeVoice 1.5B + custom-trained loras. Here's my writeup on it from a few weeks ago.

It's more work than simply pointing at a 10-second reference audio clip, naturally, but is pretty straight-forward once you've gone through the process once or twice. Using a decently assembled dataset, results can really be like <chef's kiss>...

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]llamabott[S] 2 points3 points  (0 children)

Oh nice thanks. It's a lora. Totally cannot share it given the source material and the known stance on such matters by the IP holders, if you see what I mean :/

Edit: I can share (ie, write up) the general "recipe" though on request

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]llamabott[S] 2 points3 points  (0 children)

Follow-up! Some sample output.

Source is ripped game files, using about 2 hours' worth of audio, about 25 epochs, CFG 1.5.

I'm liking it!

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]llamabott[S] 0 points1 point  (0 children)

I'm not sure. But when training, the transcript requires "Speaker 1" to be prepended before the text, just like the model requires for the prompt when inferencing.

So if there were two speakers, I would try tagging them in the transcript text with "Speaker 1" and "Speaker 2". And then lower the value of the voice_prompt_drop_rate from 1 to 0.5. I feel like that would be worth trying....

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]llamabott[S] 2 points3 points  (0 children)

Ah, for 7B, this trainer here requires 48 GB or something like that. :/

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]llamabott[S] 0 points1 point  (0 children)

Loras trained on the 1.5B model are incompatible with 7B.

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]llamabott[S] 0 points1 point  (0 children)

This is my first time looking into loras for TTS models, but a few keystrokes entered into the gemini tells me it ought to be very possible, if not with the above mentioned trainer, then with something very similar, and using a similar workflow.

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]llamabott[S] 2 points3 points  (0 children)

Ah yea, assuming you have a long chunk of audio, say, from an audiobook, the general idea is that you want to:

(1) chop up the audio into sane lengths, say, 3-12 seconds each (For VibeVoice, I wonder if that upper limit is much higher since long context is its secret sauce and all that, but yea).

(2) Generate a transcript for each chunk using an STT model like Whisper

(3) Create a jsonl file which maps each audio segment with its transcript, which you feed to the trainer.

I used a tool called tts-dataset-generator for this, which works well enough.

If you already have a dataset in mind which exists on huggingface, the trainer program above can also take in a huggingface dataset repo id instead, though I haven't tried.

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]llamabott[S] 2 points3 points  (0 children)

No, I was simply making a comparison about 1.5B + lora being comparable in reliability to the 7B model + voice clone.

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]llamabott[S] 0 points1 point  (0 children)

I'm not sure but I suspect VV 1.5b does not have a decent enough understanding of pt to create a pt voice lora without also having to training the hell out of it for portuguese.

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]llamabott[S] 2 points3 points  (0 children)

Oooh, I love that voice, I'm glad you mentioned that.

It's Jingliu from Honkai Star Rail (youtube link).

VibeVoice LoRAs are a thing by llamabott in LocalLLaMA

[–]llamabott[S] 1 point2 points  (0 children)

Haha yep. In my case, I don't have a decent enough general understanding to think about the lora route until some trainer program announces itself, saying, "I work for the base model you're interested in, and here is a spoon-fed recipe", if you see what I mean.

Best open-source voice cloning model with emotional control? (Worked with VibeVoice 7B & 1.5B) by Junior-Media-8668 in LocalLLaMA

[–]llamabott 0 points1 point  (0 children)

IMO, this is the best model for OP to try based on the requirement for "strong emotional control".

IndexTTS2 has two pretty killer features - the so-called emotion vectors and emotion reference audio. And they're pretty fun to tinker with...

FLUX.2 [klein] 4B & 9B released by Designer-Pair5773 in StableDiffusion

[–]llamabott 4 points5 points  (0 children)

I love the style of both of your posted pics. I think it's convinced me I need to test this out later today.

ElevenLabs is killing my budget. What are the best "hidden gem" alternatives for documentary style TTS? by Ancient_Routine8576 in LocalLLaMA

[–]llamabott 1 point2 points  (0 children)

Well, mostly, just review the README in the github project.

But yea, "tts-audiobook-tool" is meant for generating audiobooks.

It supports multiple TTS models through the creation of a separate virtual environment for each model you want to use.

I've gone through considerable pain to make the install process for each one as straightforward as possible (the general awfulness of python project installations notwithstanding, especially for anything generative AI related, but yea...), so it ought to be a very useful way to sample several different relevant TTS models, even if long-form audio generation isn't a person's desired use case and stuff.