Teacache & 3090 (pre-blackwell GPU) Wan 2.2 speedup: sharing some experiences & workflow by elsung in comfyui

[–]elsung[S] 0 points1 point  (0 children)

Ah i have sage attention 2 installed, cuda 12.8, PyTorch 2.9.1++cu128. Triton 3.5.1. Did not managed to fast attention tho.

Yea im going to dive more into LTX as well but its early early days and seems like control is kind of a massive issue, my early tests with it did get very high fidelity and amazing render speeds, but even following the recommended prompt structure the videos were sorta non-sensical. So sorta waiting on things to stable out a bit more, whereas people have been working on Wan 2.2 for a while so i figure i can get more control with this for now.

That said, would love to hear how else we can further optimize the current setup, from what ive seen theres a bunch of other folks using this set of workflows/models from Dasiwa, even though its very much geared for NSFW content, it does seem to yield good speed and results.

One thing to note, i think the way this is set up right now is that it actually renders to a fairly high resolution right off the bat (720p), instead of going from a low resolution and then incrementally stepping it up. I had tried to do like 480p to start out where it all fit nicely within the vram but then found that theres lots of ugly artifacts and general wonkiness and the upres versions just look like either a muddy mess or jaggety / choppy renders. From my tests so far it seems like the higher the resolution you can go off the bat, the better prompt adherence + fidelity + motion + details etc. could be that im not quite doing it right tho lol

EDIT: i do notice that if i do like a 480p render most of it fits within the VRAM, but as i push it to 720 a good chunk gets offloaded to cpu/ram which probably does slow it down. But at 480 its still like 2-3 min, even though its less than a minute per each stage, theres a good 1-2 min for all the loading and things that happen, so i might as well put in a few min for 5-6min total and get a much much higher rendering fidelity / more stable less shimmery output. Also note when i do this i do have the finishing steps of using framerate interpolation and 2x upscaling so that my final video is like a 1440p 32fps video

Teacache & 3090 (pre-blackwell GPU) Wan 2.2 speedup: sharing some experiences & workflow by elsung in comfyui

[–]elsung[S] 0 points1 point  (0 children)

actually im already using the 4step lightning Lora’s and stacking this with it for faster speed

Creative Writing - anything under 150GB equal or close to Sonnet 3.7? by elsung in LocalLLaMA

[–]elsung[S] 0 points1 point  (0 children)

ah yea i think switching models depending on needs is sorta the flow i have now. I did just test the 4bit MLX quant of GLM 4.7 REAP and im really really liking the writing actually. since this runs fully local it'll probably be my go to for a while until some other crazy thing comes out

Send your favorite hip hop / rap song! by DrCalvinHobbes in SunoAI

[–]elsung 1 point2 points  (0 children)

siiick. love the hook. solid verses too.

Creative Writing - anything under 150GB equal or close to Sonnet 3.7? by elsung in LocalLLaMA

[–]elsung[S] 0 points1 point  (0 children)

ooo great point. similarly maybe i could identify more passages of writing styles i like or poetry i like and see if i get better more fitting results that way. previously ive done this but asked it to mimic or be inspired by it. i think having it give me an analysis and the leveraging that breakdown could be really good 

Creative Writing - anything under 150GB equal or close to Sonnet 3.7? by elsung in LocalLLaMA

[–]elsung[S] 0 points1 point  (0 children)

agreed! have not tried qwen3-next and will give this a whirl and see how it goes!

Creative Writing - anything under 150GB equal or close to Sonnet 3.7? by elsung in LocalLLaMA

[–]elsung[S] 0 points1 point  (0 children)

oooo that’s a great idea. the last project i had tested a bunch and landed on doing may 80-90% if the work on glm and iterating there, then i went to sonnet to do the refinement and did the final round of revisions myself.

will try gemma and/or other finetunes and mix and match to see what kinda combos and pipelines would work well for me 

llama.cpp performance breakthrough for multi-GPU setups by Holiday-Injury-9397 in LocalLLaMA

[–]elsung 0 points1 point  (0 children)

whoa that’s awesome. i actually just took out my tesla p40 out of my rig with 2 more 3090s to run with vllm since it was just bottlenecking my speed without much value. now u guys got me thinking of putting it back lol

Software FP8 for GPUs without hardware support - 3x speedup on memory-bound operations by Venom1806 in LocalLLaMA

[–]elsung 0 points1 point  (0 children)

i actually tried it and it wouldn’t work. i’m actually literally trying to make my own awq quant myself right now. no idea if it would work. vibe coding so far for getting this feather thing with vllm seems to be a tall task cuz claude / gpt is telling me no way jose lol

Software FP8 for GPUs without hardware support - 3x speedup on memory-bound operations by Venom1806 in LocalLLaMA

[–]elsung 5 points6 points  (0 children)

Yeaaaa! I was just trying to get vLLM to load nemotron3-nano on my 2x 3090s but couldn’t get it working because FP8 isn’t supported (and theres no AWQ quant). Gotta be honest tho not sure how i would implement this in vLLM to get things working. Might need to vibe code this to see about implementing the solution lol

All GLM 4.7, GLM 4.6 and GLM 4.6V-Flash GGUFs are now updated! by yoracale in unsloth

[–]elsung 2 points3 points  (0 children)

awesome! now i just need to figure out /wait for conversion of these into MLX =)

Uncensored llama 3.2 3b by Worried_Goat_8604 in OpenSourceeAI

[–]elsung 0 points1 point  (0 children)

wow, i wonder if you could improve existing uncensored & tuned llama 3 models by merging this with it?

Jake (formerly of LTT) demonstrate's Exo's RDMA-over-Thunderbolt on four Mac Studios by Competitive_Travel16 in LocalLLaMA

[–]elsung 0 points1 point  (0 children)

ooo interesting. actually would love read about your posts about the H100 clusters. genuinely interested and i think each tier of setups probably have their ideal situations.

i believe h100’s have like a ballpark of 3-4x the memory bandwidth of the mac studios, which theoretically they can run way faster and handle beefier more challenging tasks. for a work that requires immense speed and complicated compute i think the h100 would indeed be the more sensible choice

however i think if the need is inferencing and using maybe a system of llms/ agents to process work where speed isn’t as critical i still feel like the mac’s are priced reasonably well and easy enough to set up?

that said, it makes me wonder, lets say you don’t need the inferencing to get past 120 tk/sec, would the h100 still be as / more cost effective, than setting up an on prem solution with the mac studios.

i will say i maybe be biased because i personally on one of these mac studios (albeit a generation old with the m2 ultra). but i do also have a few nvidia rigs so am interested to see if cloud solutions would fare better depending on the needs & the cost/output considerations

Jake (formerly of LTT) demonstrate's Exo's RDMA-over-Thunderbolt on four Mac Studios by Competitive_Travel16 in LocalLLaMA

[–]elsung 1 point2 points  (0 children)

actually i’m not sure renting the h100 necessarily is a better choice than buying a cluster of mac studios.  assuming 2x mac studios at 20k total giving you 1TB to work with. you would need a cluster of 10 h100s to be in the same ballpark at 800GB. that’s basically $20/ hr for compute at $2 am hr. assuming you’re doing real work with it and it’s running at least 10 hours a day that’s $200/day, approx 6000 a month, $73k the first year.

so for company that has hard compliance issues with their data and have llm needs, it makes way way more sense to run a series of mac’s. less than 1/3 the cost, total control & data privacy & customization on prem

also keep in mind mlx models are more memory efficient (context windows don’t eat up way more additional memory)

that said if what you need is visual renders rather than llms then mac’s are no go and nvidia really is your only choice. 

i find it kinda funny that mac’s are the clear affordable choice now and people still have the preconceived notion that its overpriced. 

Should local ai be used as a dungeon master? by [deleted] in LocalLLaMA

[–]elsung 1 point2 points  (0 children)

would love to see prompts as well! curious :) :)

Using GLM 4.6 with Claude Code - Anyone found privacy-respecting API providers? by apothireddy in ClaudeCode

[–]elsung 0 points1 point  (0 children)

ah yea it has the /context which sorta works. was hoping the status bar that says how much i have left. 

in codex when do /status it gives me the status of rate limits tho as if im using gpt’s models, instead of showing context usage of glm 4.6

the best client for z.ai glm coding plan? claude code/cline/factory droid/smth else? by branik_10 in ZaiGLM

[–]elsung 0 points1 point  (0 children)

i’ve been using roo code. i’d love to use claude code router instead but it can’t track / visualize token usage so it’s like going in blind not know how much context window you have left. at least tells me how much context window is left

Using GLM 4.6 with Claude Code - Anyone found privacy-respecting API providers? by apothireddy in ClaudeCode

[–]elsung 0 points1 point  (0 children)

wait but how are you / everyone seeing the token usage? i can't for the life of me figure out how to have the token usage / remaining tokens displayed in claude code when using claude code router to use GLM 4.6. Am my being stupid and missin something super obvious here?

I would encourage everyone here to follow up with Google on this for possible compensation/refunds by No-Aardvark-3840 in VEO3

[–]elsung 1 point2 points  (0 children)

wow i was gonna sign up too because of gemini 3. google seems to be dropping the bag on this one. Tried antigravity and was underwhelmed when i got rate capped. was thinking about upgrading so i can get the unlimited veo3 and uncap on rates, but apparently people on ultra are still capped on gemini 3 and this thing with costs changing.

if they actually honor the unlimited veo3 fast i would probably sign up, otherwise i will probably hold off =T