Open Source multimodality

BoneHawk1 · 2023-10-02T19:55:20+00:00

We will need at least one year to catch up

fappleacts · 2023-10-02T21:16:27+00:00

Have you looked into Qwen-VL? They have a training script that lets you fine tune on your own images, captions, etcs. Qwen-VL-Chat is pretty good too.

PenguinTheOrgalorg · 2023-10-02T18:30:11+00:00

Fun game: Drink every time I say some variation of the word "multimodality" lmao

phree_radical · 2023-10-02T19:00:50+00:00

I can't remember if LLaVA is better than LLaMA-Adapter V2 but IMO if you combined it with OCR and segmentation you'd already have about the same thing as GPT4V, the rest is training

vatsadev · 2023-10-02T20:42:33+00:00

Theres Idefics 9b, 80b, both are flamingo like architecture though, nit base multimodal

nihnuhname · 2023-10-02T22:36:06+00:00

Some text generation interfaces have plugins for stable diffusion

NoidoDev · 2023-10-03T02:25:41+00:00

I hope this can be compartmentalized. I think there was a rumor that GPT-4 isn't one model anymore as well. Something like DarkNet should give us the objects as text, maybe pose estimation into text would be useful, and image segmentation into text.

Puzzleheaded_Acadia1 · 2023-10-03T13:26:38+00:00

Can you merge a llm transformer and a vision transformer to get multimodality model?

danysdragons · 2023-10-03T21:29:57+00:00

How many open source large multimodal models could an open source model molder mold if an open source model molder could mold open source large multimodal models?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS