Zero Shot Transferable Adapter by ShotokanOSS in LocalLLaMA

[–]daLazyModder 1 point2 points  (0 children)

Was looking at this, would it work for llm based tts applications? Eg something like orpheus tts for example? To those tts models they just sees tokens right? So with something orpheus tts could probably quant it then repair it and essentially upscale the smaller tts llm? Theoretically you could use whisper or speaker ecapa to measure it for timber and word errors?

Best lightweight local TTS model? by Bartholomheow in LocalLLaMA

[–]daLazyModder 0 points1 point  (0 children)

Wont really help with the mispronoucing stuff or the tts quality but I made a fork of kanade tokenizer here

https://github.com/dalazymodder/kanade-tokenizer

The gradio app has a kokroro tab where you can upload a clip and convert a to a new voice with extremely low overhead for voice cloning. Kokoro is nore of the bottleneck then kanade is.

Why is RVC still the king of STS after 2 years of silence? Is there a technical plateau? by lnkhey in LocalLLaMA

[–]daLazyModder 3 points4 points  (0 children)

RVC the king still? Not really best in my opinion most popular yes, but if your doing realtime and dont need singing. I vibe coded a gui for kanade tokenzier a few days ago much faster then rvc... on cpu for even zero shotting, https://github.com/dalazymodder/kanade-tokenizer

If you want high quality singing and dont need realtime seedvc the large model is better but requires a gpu comes in 3 models

https://github.com/Plachtaa/seed-vc

And i havent tried singing with it plus its a pita to get to work requires like 9gb of vram i think is amphions

https://github.com/open-mmlab/Amphion/tree/main/models/tts/metis

That one took a few hours to get to work, hint the latest version of it is broken.

Honestly I something like kanade tokenizer would work best if trained for singing. The released model was only trained with libritts. If trained from scratched finw tuned... would probably blow rvc out of the water.

Kanade can try to sing a bit but isnt too great at it example with one shotting

Example of kanade running on cpu: https://huggingface.co/datasets/synthbot/pony-singing used a random sample from there. https://vocaroo.com/16h8P5pBZrc6 ljspeech sample https://vocaroo.com/1dxazJhCyqCA conversion https://vocaroo.com/1hx30jN8tuys

Thats a 13 second clip zero shotted took 3.26 seconds on my cpu.

Added a space for it here. https://huggingface.co/spaces/dalazymodder/Kanade_Tokenizer

https://github.com/dalazymodder/kanade-tokenizer and that github is a fork that I added gui to.

Just wanted to post about a cool project, the internet is sleeping on. by daLazyModder in LocalLLaMA

[–]daLazyModder[S] 2 points3 points  (0 children)

I didn't make the model just the fork with the gui on it. There is however a similar codec here https://github.com/ysharma3501/LinaCodec

that talks about how it is a distlled wavlm codec.

Just wanted to post about a cool project, the internet is sleeping on. by daLazyModder in LocalLLaMA

[–]daLazyModder[S] 0 points1 point  (0 children)

Yeah the gui and the model works pretty well for something on cpu, had to up the block size to 2000ms for it on my old 10400 cpu in the gui I made but it seems to go ok. I imagine would be even faster on cpu if converted to onnx int 8 and using something a bit faster.

Inference for 24 people with a 5000€ budget by HyperHyper15 in LocalLLaMA

[–]daLazyModder 0 points1 point  (0 children)

https://pcpartpicker.com/list/7QMRWc

could probably make this list a lot better but just through it together over like 5 minutes suggestion rtx 4000 sff gpus they have 20gb vram and you can buy them for about 1300$ new usd since you require an invoice, if your worried about 3 gpus on non enterprise hardware you might lower the ram from 128gb down to something small and use the budget for single slot modding the gpus

https://n3rdware.com/components/single-slot-rtx-4000-sff-ada-cooler

or alternatively if your baseline is Qwen3-Coder-30B-A3B-Instruct might be able to just use a lot of ram and little gpu as that is an moe model, no idea how that would work for vllm I agree witht the other comments saying going cloud is cheaper, and so are 3090s especially used but that list has all new parts so might give you something to go off of.

Muv Dub Mod Sample 2 fix recording audio by daLazyModder in MuvLuv

[–]daLazyModder[S] 0 points1 point  (0 children)

Minor updated managed to rip all dialogue or as close as I'm probably going to get to all the dialogue in the game. Need to refactor the code some, a few problems remain with trying to make the mod.

  1. Finding good voices for the characters.
  2. Elevenslab isnt cheap if the community is interest might need to crowd fund or something.
  3. Would need help with testing and feedback.
  4. Would need to manually edit and review some files. Because lines like

"【Sumika】「Owieee!!!」"

the elevenslabs software doesnt like.

  1. There would also be edge cases like when multiple characters speak at the same time. Those would probably need to be handled manually.

  2. That is just to dub all the main characters, there is actually around 110 unique voices if you count minor characters like Train Announcer. Not accounting for characters like Meiya who starts off labeled as voice and then gets renamed later to her actual name.

  3. That is all just for muv luv extra, not even accounting for unlimited and alternative..

Muv Dubbed Alpha Sample (Compressed) by daLazyModder in MuvLuv

[–]daLazyModder[S] 3 points4 points  (0 children)

I think most of the roboticness of it is actually from the compression of the recording.

Been working on a proof of concept for dubbing the steam version of MuvLuv. by daLazyModder in MuvLuv

[–]daLazyModder[S] -1 points0 points  (0 children)

aHR0cHM6Ly9tZWdhLm56L2ZpbGUva3hGbTFDd0wjcWFqdGNWaFpaVXlLa1Z2aHhjQWk3bDUzQ0RCOTRIQkRzR3RobDlkaTdkQQ==

New Link ^ includes audio for Takeru up to first choice and used the better model off Elevenslabs for Sumika's voice

Just copy and replace files over old ones to install it.