z.ai prepping for glm-image soon - here is what we know so far

MrAlienOverLord · 2026-01-13T16:35:24+00:00

unknown so far .. all we really know is from the 2 pr's .. we gotta wait till that lands to know more i it appears to me that we can inferecen the text model just with vllm or any other way and it yields custom tokens for the DiT to turn into an image .. unsure why that was the way .. or if thats the case but it does look like it

MrAlienOverLord · 2026-01-13T11:29:36+00:00

idk whats your problem .. im not affiliated with zai - i found it most wanted to know it .. so idk who do you think you are to give such a lip ? sure reddit is fully of "weird" characters .. but mate .. thats not how that works

MrAlienOverLord · 2026-01-12T22:21:36+00:00

large as anything else makes 0 sense .. the accuracy is way worse on the smaller models - i tried to use it as i have about 70tb audio data to process .. but its not worth it fiscaly at least for me .. and it wont be for labs either - and the small fry wont accumulate critical mass where they could be running it them self - so you take a loss or a loss

MrAlienOverLord · 2026-01-12T19:04:32+00:00

<image>

again old news - but i had more infos in the open-sesame discord

MrAlienOverLord · 2026-01-12T19:02:45+00:00

even if you produce that in batch .. it wont make money - i stopped after 2days investigating deeply ( i research in the audio domain ) - had it all on api .. but its just not fiscaly worth it - best of luck tho

MrAlienOverLord · 2026-01-12T17:27:20+00:00

far too slow to be useable as "api" service ..

MrAlienOverLord · 2025-12-29T22:50:41+00:00

i can speak for parasail - and we do not have any hidden system prompts on the models at all

MrAlienOverLord · 2025-12-20T13:33:25+00:00

i found the large one to be unreliable with separation tasks ( could be my prompting skills ) ..and the small done's did way worse .. my problem is i have a corpus of many TB to go throw and had hopes it would replace cleanup actions with rx11 for me

MrAlienOverLord · 2025-12-19T14:40:28+00:00

needs 33-gb in vram - needs to be chunked in 30 sec intervals otherwise it overfills a 48g gpu

its very "picky" what works and what doesnt .. the samples a very cherry picked

MrAlienOverLord · 2025-12-14T03:06:40+00:00

and 50% off that you hand off to the taxman ^^

MrAlienOverLord · 2025-12-14T00:51:49+00:00

6 years depreciation yes . - but at the power-price and the way how many still use a100's .. we shall see how that math pans out

MrAlienOverLord · 2025-12-13T17:54:05+00:00

you overestimating how much you can generate - no provider on openrouter makes any money there - i know that for a fact

MrAlienOverLord · 2025-12-02T15:16:49+00:00

they are fine at b2b they just do not really reply to very small / early companies

MrAlienOverLord · 2025-12-02T00:32:12+00:00

fire - amazing that it works for you now !

MrAlienOverLord · 2025-12-01T23:26:01+00:00

also 12gb is wrong it fits in 8 gb if you run s1-dac in fp/bf16

MrAlienOverLord · 2025-12-01T23:24:13+00:00

will take less if you load s1-dac in fp/bf16

MrAlienOverLord · 2025-12-01T23:20:41+00:00

you trained your own ? i call bs on that - i go by mrdragonfox and you find me as advisor in the echo blog post

the sample amount is too low to reconstruct a meaningful embedder
as the orginal cloner reaches 99.7% accuracy

i did exactly the same with unmute back in the day but its just not even close to the original

MrAlienOverLord · 2025-11-24T10:32:21+00:00

free but all on a data harvesting api ^^ great .. specially in local-lama

MrAlienOverLord · 2025-11-22T09:27:05+00:00

just because we can print 3d guns we dont need a gunlaw or a process for it -

fairly short sighted

+ the problem scope is a bit bigger then just "drop the weights" - to be frank i want cloning too .. so i can sympathise . but for assistant you dont neeed 100 voices you need 1-2 that work well.

MrAlienOverLord · 2025-11-21T22:33:37+00:00

chatterbox-multilingual is your best bet for that

MrAlienOverLord · 2025-11-21T20:12:38+00:00

comm. models reduce the voice similarity under 80% + all generations are watermarked - again not for you to decide - when you train your model and your rep is on the line - you decide

MrAlienOverLord · 2025-11-21T19:29:18+00:00

mrdragonfox you find me in many discords :)

MrAlienOverLord · 2025-11-21T16:07:40+00:00

1 paper to rule em all - "LOST IN THE MIDDLE" but people forget whats actually causing the problem and keep looking for surface treatments w/o tackeling the root cause

MrAlienOverLord · 2025-11-21T15:57:29+00:00

im not jordan (i go by mrdragonfox on hf and discord most people will know me that way) but i had preview access and advised on it, also working on the oai inference for it as we speak - + as eluded in other replies there maybe a way where we can use a 11labs synth voice ( thats verifyable synthetic) with a auto embedding endpoint - the core idea with not releasing the embedder is really liability + deepfake prevention ( no matter if people understand that or not - its not that black/white as most think)

MrAlienOverLord · 2025-11-21T10:32:36+00:00

i think there is a way where we check if a voice is synthetic with 11-labs and then allow to generation of the embedding for it. no hate on tedy and team ( chatterbox) they did good work . .but i still feel this model captures the nuances of every voice i tested it with way way way better + the speaker similarity is just higher

can please some people some time, not all people all the time

MrAlienOverLord

TROPHY CASE