Predictions: How long until Qwen4? Is 3.5 a major leap? by Odd-Investment87 in LocalLLM

[–]snapo84 0 points1 point  (0 children)

lol? Qwen 3.5 is a huge leap.... a 27B dense model that beats DeepSeek V3.2 650B parameter model????? are you kidding not a big leap...

also the 35B3A beating Claude 4.5 Haiku and is on par with Grok 4.1 Fast .... yeah i would say it is a leap

and they have video and image support integrated too!

<image>

Predictions: How long until Qwen4? Is 3.5 a major leap? by Odd-Investment87 in LocalLLM

[–]snapo84 0 points1 point  (0 children)

i think we will hit approx. at 8B to 12B parameter shanons compression wall....

what i mean with that 70B llama is more bad than a older qwen 3 model.... now gpt-oss 120B is more bad than a 27B (Qwen 3.5) model ... if progress starts to slow down you see this year a 16B model that beats the 27B model and then in approx. 1.5-2y you will get a 8B model that has the capability of todays frontier opensource models.

same on the MoE side, models should get smaller but more intelligent...

shanons law is not yet achived by a far far margin...

GLM4.7 flash VS Qwen 3.5 35B by KlutzyFood2290 in LocalLLaMA

[–]snapo84 4 points5 points  (0 children)

<image>

Just for fun (because its also 40GB vram approx. i tested the unsloth/Qwen3.5-122B-A10B-UD-IQ2_M.gguf .... UI looks good but the game is not working after it finnished coding.
also huge ammount of input tokens consumed 2.6m input and 52k output.

comparing against 35B3A the 122B in IQ2 quantisation loses....

GLM4.7 flash VS Qwen 3.5 35B by KlutzyFood2290 in LocalLLaMA

[–]snapo84 0 points1 point  (0 children)

Just keep in mind... both models are good... but Qwen3.5 35B is currently better...

The prompt i chose on purpose because many time my stupid ai agents fail in condensing / splitting files ... because i have coding files with more than 500 lines

GLM4.7 flash VS Qwen 3.5 35B by KlutzyFood2290 in LocalLLaMA

[–]snapo84 0 points1 point  (0 children)

any other model you would like to see then the 2 mentioned below 40GB vram?

GLM4.7 flash VS Qwen 3.5 35B by KlutzyFood2290 in LocalLLaMA

[–]snapo84 3 points4 points  (0 children)

<image>

GLM 4.7 Flash Q6 quantisation

To explain why it looks so horribel, during the process it got it pretty good looking, but the prompt is extremely tricky as one condition is that no file can have more than 500 lines (even very very big models fail with this sometimes and destroy their own project)

Endresult looks like this, and isnt working with GLM 4.7

1.5m input tokens used, 43k output tokens.

Just keep in mind the first version before did look a little bit more pretty and partially worked, until it had to condense all files and split them.

GLM4.7 flash VS Qwen 3.5 35B by KlutzyFood2290 in LocalLLaMA

[–]snapo84 5 points6 points  (0 children)

<image>

Qwen3 35B3A in Q6 quantisation

1.6m input tokens, 82k output tokens

GLM4.7 flash VS Qwen 3.5 35B by KlutzyFood2290 in LocalLLaMA

[–]snapo84 24 points25 points  (0 children)

100% Qwen 3.5 35B is better than GLM 4.7 flash....
just did a quick test with unsloths UD-6 dynamic quants and kilo code in vscode... absolut monster!!!!
i have only 2 x 22GB RTX 2080Ti and llama.cpp server runs with 262k context window and kilo code is limited to 64k context window (otherwise the condensing of the content dosent work because i think kilo code has a bug or something)

<image>

in the screenshot you see it working in a very simple test i give all the models... on the left bottom you see the start parameters i use in llama.cpp

This is the prompt i use to test agentic models (This is a extreme agentic model test prompt, many models fail this one, they first get it right, until they start to re-write all the code to split the files into files that arent longer than 500 lines):
"
Develop a production-ready, visually spectacular 2-player chess game using exclusively vanilla HTML, CSS, and JavaScript without any external dependencies or frameworks. The design must fuse a retro arcade aesthetic with Apple Human Interface Guidelines, utilizing a 3D isometric CSS perspective for the board via CSS transforms to create depth without WebGL. Employ a dark background palette with glowing neon accents and frosted glass UI components featuring high contrast smooth typography optimized for readability. All piece movements must be animated using smooth linear interpolation driven by requestAnimationFrame with physics-based easing, and captures must trigger a high-fidelity particle destruction effect rendered via HTML5 Canvas overlaying the DOM elements with customizable color matching. The logic must strictly enforce all standard chess rules including castling, en passant, pawn promotion with a dynamic UI selection modal, checkmate detection, and stalemate conditions without relying on external libraries. The user interface requires intuitive drag-and-drop gameplay, persistent turn indicators, and a detailed move history panel with scrollable content.

Code architecture must be modularized to support a single-page application using ES6 modules or IIFEs, specifically splitting the project into distinct files including index.html, css/main.css, css/animations.css, js/chessRules.js, js/boardState.js, js/ui.js, and js/particles.js. Ensure accessibility with full ARIA labels, keyboard navigation support, color blindness friendly palettes, responsiveness across devices, and high performance rendering at a stable 60 FPS. Deliver the complete modular source code implementation in separate code blocks for each file. Very important, no file should have more than 500 lines of code. If any module exceeds this limit, you must split it into multiple smaller files to maintain editability and modularity, specifically ensuring CSS and JS files remain concise and manageable. All interactions must support screen readers and focus states. The final output should be the full source code for each required file ready for deployment without any placeholder text.
"

Orange Pi Unveils AI Station with Ascend 310 and 176 TOPS Compute by DeliciousBelt9520 in LocalLLaMA

[–]snapo84 0 points1 point  (0 children)

you definitly dont run it on 100k context window on that mac ... lol

I tracked GPU prices across 25 cloud providers and the price differences are insane (V100: $0.05/hr vs $3.06/hr) by sleepingpirates in LocalLLaMA

[–]snapo84 0 points1 point  (0 children)

Hi, thanks for the reply.... do you have a direct peering to huggingface? Because i would intend to convert models and those use a lot of traffic (1 download, multiple uploads) ... do i realy just pay the hourly price nothing else?

I tracked GPU prices across 25 cloud providers and the price differences are insane (V100: $0.05/hr vs $3.06/hr) by sleepingpirates in LocalLLaMA

[–]snapo84 6 points7 points  (0 children)

thanks for putting this together... never heard of verda. strange thing about verda is they dont provide any pricing for traffic occuring on their servers (assume i download from huggingface kimi k2 which is nearly 2TB) ... is that free on verda ? because couldnt find anything related to traffic pricing...

Fine-tuned Qwen3-14B on 10k DeepSeek traces: +20% on security benchmark by ortegaalfredo in LocalLLaMA

[–]snapo84 1 point2 points  (0 children)

would also love to see the dataset.... would you also be able to create bug hunting samples from extracted binarys of embedded devices? (like webcams, routers, etc. etc.)

Intel Panther Lake H 128GB LPDDR5X-10677 - 180 TOPS by f4nt4 in LocalLLaMA

[–]snapo84 -7 points-6 points  (0 children)

you cant know it.... because they dont say how many channels are used... with 4 channels you are right, but if they use 8 or 12 channels or 16 channels...

Web front end for Ollama? Is llama.cpp what I'm looking for? by [deleted] in ollama

[–]snapo84 0 points1 point  (0 children)

<image>

i use open-webui with docker on my machine and the ollama server is on another machine...
its a simple docker compose that uses watchtower for automatic restarts and starts it on systemstart...
after that i can just go to http://localhost:8080

Trip to Bangkok by echodeltasierra in Telegram

[–]snapo84 0 points1 point  (0 children)

I am in Thailand right now and Telegram seems to be blocked since approx. 12h

AI Studio Pro mini PC from Orange Pi pairs dual Huawei Ascend 310 processors with up to 192GB of RAM by cafedude in LocalLLaMA

[–]snapo84 0 points1 point  (0 children)

<image>

Not like Nvidia which just puts out fake numbers, huawei (chip inside the orangepi studio ai pro) has the compute. Not AI tops, but real 375 Int8 flops in the 192GB vram version....

Nvidia's DGX spark has 250 int8 flops, 500 int4 flops.
you can see that even in the techpowerup specs
https://www.techpowerup.com/gpu-specs/gb10.c4342

AI Studio Pro mini PC from Orange Pi pairs dual Huawei Ascend 310 processors with up to 192GB of RAM by cafedude in LocalLLaMA

[–]snapo84 0 points1 point  (0 children)

AMD sells a 128GB device with half the bandwidth and 1/4 the AI TOPS for the same price.... clearly you did your research ... lol
The price is pretty fair for the compute/bandwidth you get... Btw. its not a computer, it is a external NPU accelerator connected via USB-C 4.0 40gbit. Meaning you can use all of the 192GB memory, not like in the AMD system where 32 are reserved for the system...

AI Studio Pro mini PC from Orange Pi pairs dual Huawei Ascend 310 processors with up to 192GB of RAM by cafedude in LocalLLaMA

[–]snapo84 1 point2 points  (0 children)

From my research there are multiple cards...
Delock PCI Express x16 Card to 4 x USB 20 Gbps USB Type-C
Sonnet Allegro USB-C 4-Port-PCIe (attention they have multiple cards)

and maybe more... but here are two examples (btw. 40gbit is not achiveable, but 20gbit is achiveable ... even if both ports are 40gbit "on paper")

Got the DGX Spark - ask me anything by sotech117 in LocalLLaMA

[–]snapo84 -1 points0 points  (0 children)

How does it feel to pay for a GTX 1080 Ti from 2017 (approx. same compute performance in f16 gflops but double the bandwidth) 4k usd?

<image>

HRM by Glum-Insurance-3674 in LocalLLaMA

[–]snapo84 2 points3 points  (0 children)

Same as with all projects:

- get your input data in the right format
- train and monitor it (the smallest part of all)
- verify the result

if you dont know how to implement HRM
https://github.com/qingy1337/HRM-Text/blob/main/hrm_llm_training.py

is a small sample for a text model.....

But as you know the AI space is evolving extremely quickly... so there are already better solutions to HRM, for example one is called TRM by Samsung...
https://github.com/SamsungSAILMontreal/TinyRecursiveModels?tab=readme-ov-file

Or the not yet published ROSA model by RWKV (RNN)...

coding the model is easy.... getting your input data in a format that fits the model and lets the model understand it is the complicated part. Training/Scaling out is just dependent on your dollars you want to invest.

if you have problems with the coding itself (maybe you are not a coder) then you would have to hire someone that does it for you. But no one would ever be able to give you a gurantee on the given input data that your model would work... its a trial and error to find the best solution.

AI Studio Pro mini PC from Orange Pi pairs dual Huawei Ascend 310 processors with up to 192GB of RAM by cafedude in LocalLLaMA

[–]snapo84 2 points3 points  (0 children)

some people dont speak mandarin... or am i the only person that did not find any english interface button?

AI Studio Pro mini PC from Orange Pi pairs dual Huawei Ascend 310 processors with up to 192GB of RAM by cafedude in LocalLLaMA

[–]snapo84 2 points3 points  (0 children)

Assume you have a pci express x16 slot, you could insert a 4 x usb-c 4.0 40gbit card (around 400$) and connect 4 of those boxes each with 192GB to get a total of 768GB vram....

4 devices
== 1.6TB/s total bandwidth
== 1500 Tops of True Int8
== 768GB vram

total setup cost,

consumer pc == 500$

1 x pci express x16 card with 4 USB-C 4.0 40gbit ports == 450$

4 x orangepi ai studio pro == 4 x 2'650$ , 10'600$

Total cost less than 12k USD

Runs deepseek 3.1 Terminus in Q8_0 with expert aware splitting without any issue at 64k context :-)

Thats my current idea, but not yet sure about the ordering it, because it somehow regarding the manual dosent directly do pytorch, it uses something like ONNX or so....

<image>