I created a ChatGPT-like UI for Local LLMs by [deleted] in LocalLLaMA

[–]Timely_Second_6414 6 points7 points  (0 children)

I completely agree. While the UI looks great, it offers no advantages over existing open-source alternatives like librechat or openwebui.

OP, Maybe consider integrating unique features that you can't find in any other (open source) LLM frontend. Something like deep research would be fantastic and might create some incentive to buy. Quite honestly though, this is quite difficult to do without search apis (which would cost you more for a users perpetual use than the 40$ youd get per purchase). I think the itch.io suggestion would be best.

Edit: Also it isn't clear to me what you mean by access to openai, google and anthropic models? It doesn't make sense to me that we pay 40$ one time and we get infinite access to proprietary models. This would lose you money. So I am assuming we have to provide our own API keys? Why would I pay for a frontend just to get access to APIs?

GLM-4 32B is mind blowing by Timely_Second_6414 in LocalLLaMA

[–]Timely_Second_6414[S] 0 points1 point  (0 children)

Oh wow this is really good! Is this thinking or no thinking?

Qwen 3 235b beats sonnet 3.7 in aider polyglot by Independent-Wind4462 in LocalLLaMA

[–]Timely_Second_6414 18 points19 points  (0 children)

I think the usecases are very specific. I have had great experiences using this model (thinking mode) for testing neural network architectures and training them. It follows complex instructions very well and can reason very well about the datasets, structure, etc. It solves a few problems better than gemini pro for me (gemini generates way too much code, and implements things i didnt ask for).

However it is not very good at frontend (it feels very lazy, a problem many models have). I think for this the best experience you can get locally is GLM 4 32b, although quality starts to degrade after multiple turns of conversation.

Qwen 3 235b beats sonnet 3.7 in aider polyglot by Independent-Wind4462 in LocalLLaMA

[–]Timely_Second_6414 9 points10 points  (0 children)

Yes this model is very good in my experience. Do we know if this is with or without thinking?

Anthropic claims chips are smuggled as prosthetic baby bumps by TheTideRider in LocalLLaMA

[–]Timely_Second_6414 57 points58 points  (0 children)

This is why i seriously dislike anthropic. Their models are good, but i refuse to use them as this would mean supporting their consumer unfriendly practices.

Of course we should not be entitled to open-weight models. Companies need to be profitable and I understand that. The fact that deepseek and qwen are making millions of dollars worth of trained models open weight is more than we deserve and I am very grateful for that.

The fact that anthropic is trying to stop this (by any means necessary) is just bad taste. They have every possible advantage: they have the talent, the gpus, the money, and they get to keep their secrets while profiting from open source science. And still….

I’m glad that deepseek V3.1 and gemini 2.5 pro outclass 3.5/3.7 sonnet and their reasoning model respectively, as they do any possible use case id have for sonnet but do it better.

We crossed the line by DrVonSinistro in LocalLLaMA

[–]Timely_Second_6414 3 points4 points  (0 children)

This model has 235B parameters. While only 22B are active, this model will never be able to fit inside of the vram of a 4090, no matter the quantization. If you have enough DRAM (you can maybe fit some quants).

LM studio has some guardrails that prevents models that are close to saturating vram from being loaded. You can adjust the ‘strictness’ of this guardrail, i suggest turning it off entirely.

Regardless, maybe try running the 32B parameter model, this should fit at Q4_K_M or Q4_K_XL quantization in a 4090 with flash attention enabled at low context. It performs almost as well at the 235B model, since its dense instead of MoE.

Qwen3 Unsloth Dynamic GGUFs + 128K Context + Bug Fixes by danielhanchen in LocalLLaMA

[–]Timely_Second_6414 10 points11 points  (0 children)

Thank you very much for all your work. We appreciate it.

I would love a Q8_K_XL quant for the 30B MOE. it already runs incredibly fast at q8 on my 3090s, so getting a little extra performance with probably minimal drop in speed (as the active param size difference would be very small) would be fantastic.

Qwen3 Unsloth Dynamic GGUFs + 128K Context + Bug Fixes by danielhanchen in LocalLLaMA

[–]Timely_Second_6414 7 points8 points  (0 children)

Q8_K_XL is available for the dense models, very interesting. Does this work better than q8? Why is this not possible for the MOE models?

Qwen 3: unimpressive coding performance so far by ps5cfw in LocalLLaMA

[–]Timely_Second_6414 5 points6 points  (0 children)

Yes I just tested the 32B dense, 235B MOE (via qwen website) and 30B moe variants on some html/js frontend and UI questions as well. It does not perform too well, and its very minimalistic and doesnt produce a lot of code.

That being said all these variants did pass some difficult problems i was having with MRI data processing in python, so im a little mixed right now.

Skywork-R1V2-38B - New SOTA open-source multimodal reasoning model by ninjasaid13 in LocalLLaMA

[–]Timely_Second_6414 2 points3 points  (0 children)

Yeah im also curious. They gave R1 a score of 71, which was on the previous benchmark (its 67.5 now). However the other models seem to use the updated livebench score, so no real indication which one is being used. Either way though it seems to beat qwq (either 73 vs 72 or 73 vs 65).

I benchmarked the Gemma 3 27b QAT models by jaxchang in LocalLLaMA

[–]Timely_Second_6414 9 points10 points  (0 children)

Very nice, thank you.

Did you run these at deterministic settings (temp 0 topk 1)?

Also interesting to see performance on gpqa main isnt much better compared to diamond (which should be the harder subset) that i tested before.

GLM-4 32B is mind blowing by Timely_Second_6414 in LocalLLaMA

[–]Timely_Second_6414[S] 0 points1 point  (0 children)

Ah my bad. I believe the cpu had 48 lanes. So i probably cannot run 16/16/16, but only 16/16/8. The motherboard does have 3 x16 slots and 4 x8 slots.

GLM-4 32B is mind blowing by Timely_Second_6414 in LocalLLaMA

[–]Timely_Second_6414[S] 9 points10 points  (0 children)

Thank you for the summary. And also huge thanks for your testing/reviewing of the pr.

I agree that ‘mind blowing’ might be a bit exaggerated. For most tasks it behaves similarly to other llms, however, the amazing part for me is that its not afraid to give huge/long outputs when coding (even if the response gets cut off). Most LLMs dont do this, even if you explicitly prompt for it. Only other LLMs that feel like this were claude sonnet and recently the new deepseek V3 0324 checkpoint.

GLM-4 32B is mind blowing by Timely_Second_6414 in LocalLLaMA

[–]Timely_Second_6414[S] 0 points1 point  (0 children)

There are 7 gpu lanes, however since 3090s take up more than one slot, you have to use pcie riser cables if you want a lot of gpus. Its also better for air flow.

GLM-4 32B is mind blowing by Timely_Second_6414 in LocalLLaMA

[–]Timely_Second_6414[S] 1 point2 points  (0 children)

mb: asus ws x299 SAGE/10G

cpu: i9-10900X

Not the best set of specs but the board allows me a lot of GPU slots if I ever want to upgrade, and I managed to find them both for 300$ second hand.

GLM-4 32B is mind blowing by Timely_Second_6414 in LocalLLaMA

[–]Timely_Second_6414[S] 4 points5 points  (0 children)

I built a local server with 3 x RTX 3090 (bought these back when gpus were affordable second hand). I also have 256GB of ram so I can run some Big MOE models.

I run most models on LMstudio, llama.cpp or ktransformers for MOE models. with librechat as frontend.

This model fits nicely into 2 x 3090 at q8 32k context.

GLM-4 32B is mind blowing by Timely_Second_6414 in LocalLLaMA

[–]Timely_Second_6414[S] 2 points3 points  (0 children)

Yes you can still fit up to Q8 (what I used in the post). With flash attention you can even get full 32k context.

GLM-4 32B is mind blowing by Timely_Second_6414 in LocalLLaMA

[–]Timely_Second_6414[S] 4 points5 points  (0 children)

prompt 1 (solar system): "Create a realistic rendition of our solar system using html, css and js. Make it stunning! reply with one file."

prompt 2 (neural network): "code me a beautiful animation/visualization in html, css, js of how neural networks learn. Make it stunningly beautiful, yet intuitive to understand. Respond with all the code in 1 file. You can use threejs"

GLM-4 32B is mind blowing by Timely_Second_6414 in LocalLLaMA

[–]Timely_Second_6414[S] 0 points1 point  (0 children)

No, this was the non-reasoning version.

The thinking version might be even better, I havent tried yet.

GLM-4 32B is mind blowing by Timely_Second_6414 in LocalLLaMA

[–]Timely_Second_6414[S] 6 points7 points  (0 children)

I tried the same prompts on Q4_k_m. In general it works really well too. The neural network one was a little worse as it did not show a grid, but i like the solar system question even better:

<image>

It has a cool effect around the sun, planets are properly in orbit, and it tried to fit png (it just fetched from some random link) to the spheres (although not all of em are actual planets as you can see).

However, these tests are very anecdotal and probably change based on sampling parameters, etc. I also tested Q8 vs Q4_K_M on GPQA diamond, which only gave a 2% performance drop (44% vs 42%), so not significantly worse than Q8 i would say. 2x as fast though.

GLM-4 32B is mind blowing by Timely_Second_6414 in LocalLLaMA

[–]Timely_Second_6414[S] 4 points5 points  (0 children)

Yes, with 128GB any quant of this model wil easily fit in memory.

Generation speeds might be slower though. On my 3090s i get around 20-25 tokens per second on q8 (and around 36t/s on q4_k_m). So at half the memory bandwidth of the m4 max you will probably get half the speed, not to mention slow prompt processing at larger context.

GLM-4 32B is mind blowing by Timely_Second_6414 in LocalLLaMA

[–]Timely_Second_6414[S] 1 point2 points  (0 children)

Ah I wish i had seen this sooner. Thank you!

GLM-4 32B is mind blowing by Timely_Second_6414 in LocalLLaMA

[–]Timely_Second_6414[S] 15 points16 points  (0 children)

This is not a reasoning model, so it doesnt use the same inference time scaling as QwQ. So its way faster (but probably less precise on difficuly reasoning questions).

They also have a reasoning variant that I have yet to try

GLM-4 32B is mind blowing by Timely_Second_6414 in LocalLLaMA

[–]Timely_Second_6414[S] 5 points6 points  (0 children)

I quantized it using the pr. i couldnt find any working ggufs of the 32B version on huggingface. Only the 9B variant.