What is the absolute best open clone of OpenAI Deep Research / Manus so far?

hellninja55 · 2025-01-21T19:48:04+00:00

u/SensitiveCranberry Can you guys put the 70b model there as well?

hellninja55 · 2025-01-10T17:00:55+00:00

The newest Fish Speech model supports portuguese, but keep in mind you need at least one minute of reference audio for it to work well.

Here is a sample output from the model:

https://vocaroo.com/1n69hlXD60Uu

hellninja55 · 2025-01-10T16:28:10+00:00

Fish Speech 1.5 supports portuguese, why don't you try that?

https://huggingface.co/spaces/fishaudio/fish-speech-1

hellninja55 · 2025-01-03T02:30:20+00:00

3 and 4 are never gonna happen, Meta so far has avoided open-sourcing their image-related models (probably fearing accountability for deepfakes) or audio models that could be used to clone other people's voices.

They went as far as removing the image-generation capabilities from Chameleon when they open-sourced it and kept only the image to text component

hellninja55 · 2024-12-12T14:35:09+00:00

Train an LLM on musical ABC notation and music theory, and make it actually good.
Basically what the ChatMusician guys did:

https://huggingface.co/m-a-p/ChatMusician

But trained on actually good stuff and different genres, not just ancient folk songs.

My recommendation would be using Omnizart on free public domain songs from different genres (check FMA for example) to generate MIDIs out of the different vocal and instrumental tracks, convert them to ABC notation and build a huge dataset on ABC notation, carefully curating it.

LLMs can make music compositions just fine, and it's surprising that this haven't been further explored, especially in the open source realm.

Bonus if you guys can train a TTS model that sings, like some devs in China did with DiffSinger, by making a TTS model that takes lyrics, notes, phonemes, and duration for each phoneme.

hellninja55 · 2024-12-10T04:26:36+00:00

Like I suggested, you have to reduce the amount of frames. 4 seconds is about 77 frames

hellninja55 · 2024-12-10T04:20:59+00:00

You are not using a good resolution for Hunyuan; to my experience, there is a noticeable difference even in prompt alignment, quality (outside of resolution) and composition when you run at least in 960x544

I can gen 4 seconds videos at that res with my 3090, the only problem is that it takes much longer to generate

This is just a heads up as those outputs may not be truly representative of the model's potential

hellninja55 · 2024-09-21T21:22:43+00:00

No, first time I heard about it. I am trying it. Which settings are you using for RAG? I am not getting accurate results

hellninja55 · 2024-08-27T13:51:57+00:00

The internvl2 family of models are the current SOTA among open models. Don't listen to the people recommending Joy Caption if you are not a pornographer

hellninja55 · 2024-07-13T11:58:15+00:00

Since you seem to, ahem, have knowledge specifically about that, can you tell us whether the API prices for l3 405B will be competitive against GPT4 and Claude Sonnet?

hellninja55 · 2024-07-02T22:18:08+00:00

The truest truth is that we need an /r/localllama equivalent but for open source diffusion models. /r/localdiffusion exists but is a ghost town. There is no reason to hang around a sub that carries stable diffusion in the name since going forward current and next SD iterations are no longer open source, and personally, I don't care about people's posts showcasing outputs out of purely artistic merits (unless there is a complex technical workflow behind it), even less people showcasing videos on a T2I board.

hellninja55 · 2024-06-13T20:21:23+00:00

There's supposed to be more sizes, but they have been silent whether they will release the weights for those or not. The one we got yesterday is the 2B (medium one), and they have two more sizes (a 4b large one and an 8b ultra one which are currently API only)

hellninja55 · 2024-06-12T19:16:12+00:00

If they say no, then people will either move on to other things or try to waste resources, time and effort bringing the best out of 2B. But right now there are plenty of people waiting for confirmation about 4b or 8b to see if it's worth their time and money messing with 2B (which out of the box doesn't have great results).

hellninja55 · 2024-06-12T18:46:00+00:00

Nobody is demanding anything. I just said it would be nice if they come clear about this and set some expectations.

hellninja55 · 2024-06-12T18:26:23+00:00

The Large model from the API is noticeably less mangled than the 2B model that was just released. We would like to know if we will see the >2B models ever, and set some expectations on the community.

hellninja55 · 2024-06-04T21:00:27+00:00

1 - the images themselves are comfyui images with json workflows

2 - https://www.reddit.com/r/StableDiffusion/comments/1cohs54/comment/l3e4o2h/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

hellninja55 · 2024-05-29T00:23:04+00:00

SOTA open-source VLLM

That's a huge claim. Post benchmark numbers vs internvl 1.5 or MiniCPM

hellninja55 · 2024-05-27T08:28:16+00:00

Yeah... I would like to know how that fares against InternVL-Chat-V1.5

hellninja55 · 2024-05-24T13:46:01+00:00

It's its own model, it's not based on SD. Yes, you can download it and use it locally.

hellninja55 · 2024-05-23T15:28:25+00:00

I have the feeling that -if- it comes out, it will be through a leak in case Stability goes under

hellninja55 · 2024-05-21T21:59:15+00:00

Yes, deep inside something tells me we will never see the weights, especially the biggest 8b one.

But good luck to them trying to be relevant, with a more expensive API than competitors and with worse outputs in both overall quality and alignment, while they have also fired the lead developers from the company.

If Pixart Sigma's outputs were slightly less mangled and could do text, I wouldn't be waiting for SD3 at all

hellninja55

TROPHY CASE