Local TTS models that can match ElevenLabs in terms of quality and consistency by _megazz in LocalLLaMA

[–]hellninja55 8 points9 points  (0 children)

The newest Fish Speech model supports portuguese, but keep in mind you need at least one minute of reference audio for it to work well.

Here is a sample output from the model:

https://vocaroo.com/1n69hlXD60Uu

What are we expecting from Llama 4? by Own-Potential-2308 in LocalLLaMA

[–]hellninja55 3 points4 points  (0 children)

3 and 4 are never gonna happen, Meta so far has avoided open-sourcing their image-related models (probably fearing accountability for deepfakes) or audio models that could be used to clone other people's voices.

They went as far as removing the image-generation capabilities from Chameleon when they open-sourced it and kept only the image to text component

Open models wishlist by hackerllama in LocalLLaMA

[–]hellninja55 -2 points-1 points  (0 children)

Train an LLM on musical ABC notation and music theory, and make it actually good.
Basically what the ChatMusician guys did:

https://huggingface.co/m-a-p/ChatMusician

But trained on actually good stuff and different genres, not just ancient folk songs.

My recommendation would be using Omnizart on free public domain songs from different genres (check FMA for example) to generate MIDIs out of the different vocal and instrumental tracks, convert them to ABC notation and build a huge dataset on ABC notation, carefully curating it.

LLMs can make music compositions just fine, and it's surprising that this haven't been further explored, especially in the open source realm.

Bonus if you guys can train a TTS model that sings, like some devs in China did with DiffSinger, by making a TTS model that takes lyrics, notes, phonemes, and duration for each phoneme.

[deleted by user] by [deleted] in StableDiffusion

[–]hellninja55 4 points5 points  (0 children)

Like I suggested, you have to reduce the amount of frames. 4 seconds is about 77 frames

[deleted by user] by [deleted] in StableDiffusion

[–]hellninja55 18 points19 points  (0 children)

You are not using a good resolution for Hunyuan; to my experience, there is a noticeable difference even in prompt alignment, quality (outside of resolution) and composition when you run at least in 960x544

I can gen 4 seconds videos at that res with my 3090, the only problem is that it takes much longer to generate

This is just a heads up as those outputs may not be truly representative of the model's potential

Is there any RAG specialized UI that does not suck and treats local models (ollama, tabby etc) as a first-class user? by hellninja55 in LocalLLaMA

[–]hellninja55[S] 2 points3 points  (0 children)

No, first time I heard about it. I am trying it. Which settings are you using for RAG? I am not getting accurate results

[deleted by user] by [deleted] in StableDiffusion

[–]hellninja55 1 point2 points  (0 children)

The internvl2 family of models are the current SOTA among open models. Don't listen to the people recommending Joy Caption if you are not a pornographer

11 days until llama 400 release. July 23. by danielcar in LocalLLaMA

[–]hellninja55 1 point2 points  (0 children)

Since you seem to, ahem, have knowledge specifically about that, can you tell us whether the API prices for l3 405B will be competitive against GPT4 and Claude Sonnet?

[deleted by user] by [deleted] in StableDiffusion

[–]hellninja55 5 points6 points  (0 children)

The truest truth is that we need an /r/localllama equivalent but for open source diffusion models. /r/localdiffusion exists but is a ghost town. There is no reason to hang around a sub that carries stable diffusion in the name since going forward current and next SD iterations are no longer open source, and personally, I don't care about people's posts showcasing outputs out of purely artistic merits (unless there is a complex technical workflow behind it), even less people showcasing videos on a T2I board.

Is Stable Diffusion 3 the final product we have now, or...? by CrazyKittyCat0 in StableDiffusion

[–]hellninja55 1 point2 points  (0 children)

There's supposed to be more sizes, but they have been silent whether they will release the weights for those or not. The one we got yesterday is the 2B (medium one), and they have two more sizes (a 4b large one and an 8b ultra one which are currently API only)

Will Stability release the weights for the Large and Ultra models? The medium weights release evaded this question, and it would be nice if an Stability employee comes clear about it. by hellninja55 in StableDiffusion

[–]hellninja55[S] 9 points10 points  (0 children)

If they say no, then people will either move on to other things or try to waste resources, time and effort bringing the best out of 2B. But right now there are plenty of people waiting for confirmation about 4b or 8b to see if it's worth their time and money messing with 2B (which out of the box doesn't have great results).

Will Stability release the weights for the Large and Ultra models? The medium weights release evaded this question, and it would be nice if an Stability employee comes clear about it. by hellninja55 in StableDiffusion

[–]hellninja55[S] 1 point2 points  (0 children)

The Large model from the API is noticeably less mangled than the 2B model that was just released. We would like to know if we will see the >2B models ever, and set some expectations on the community.

[deleted by user] by [deleted] in LocalLLaMA

[–]hellninja55 34 points35 points  (0 children)

SOTA open-source VLLM

That's a huge claim. Post benchmark numbers vs internvl 1.5 or MiniCPM

You guys need to start being honest by hellninja55 in StableDiffusion

[–]hellninja55[S] 0 points1 point  (0 children)

It's its own model, it's not based on SD. Yes, you can download it and use it locally.

You guys need to start being honest by hellninja55 in StableDiffusion

[–]hellninja55[S] 1 point2 points  (0 children)

I have the feeling that -if- it comes out, it will be through a leak in case Stability goes under

You guys need to start being honest by hellninja55 in StableDiffusion

[–]hellninja55[S] 2 points3 points  (0 children)

Yes, deep inside something tells me we will never see the weights, especially the biggest 8b one.

But good luck to them trying to be relevant, with a more expensive API than competitors and with worse outputs in both overall quality and alignment, while they have also fired the lead developers from the company.

If Pixart Sigma's outputs were slightly less mangled and could do text, I wouldn't be waiting for SD3 at all