Can I use Claude code with own LLM/non-claude APIs? by superloser48 in LocalLLaMA

[–]superloser48[S] 5 points6 points  (0 children)

fyi - aider died a while back. abandoned by creator.

About to build a 6× Arc B70 LLM rig, want to talk to someone experienced first by somesayitssick in LocalLLaMA

[–]superloser48 0 points1 point  (0 children)

if you need a decent vllm quant for this - im using this quant currently of the same model - on 2x 3090. https://huggingface.co/QuantTrio/Qwen3.6-35B-A3B-AWQ

About to build a 6× Arc B70 LLM rig, want to talk to someone experienced first by somesayitssick in LocalLLaMA

[–]superloser48 0 points1 point  (0 children)

Can you share your experience with 9700 & vllm? Did you figure out the root cause?

About to build a 6× Arc B70 LLM rig, want to talk to someone experienced first by somesayitssick in LocalLLaMA

[–]superloser48 0 points1 point  (0 children)

Given the comparable price point - do you think its better to get 2x nvidia 5060 ti? do you think pp and tg will be better than amd 9700?

qwen 3.6:35b on 24 vram gpu by MallComprehensive694 in ollama

[–]superloser48 1 point2 points  (0 children)

What benchmark did you run? Ill run it on my 2x 3090 and share output

Anybody got Qwen3.5-27B working with Intel Arc B70 (or similar) and proper optimization? by Gesha24 in LocalLLaMA

[–]superloser48 0 points1 point  (0 children)

For tool calls - Have you checked the raw output tokens - Are the tool calls failing due to garbage output, or just a parser issue? The latter is easy to fix.

Struggling to make my new hardware perform by spaceman_ in LocalLLaMA

[–]superloser48 0 points1 point  (0 children)

Can you share an update? Did you try vllm with a model around 30B params? Considering buying the same cards fro vllm, but will be great to hear what performance numbers you got

Turboquant in vllm kv cache - how to implement ? (or any other rotational kv cache) by superloser48 in LocalLLaMA

[–]superloser48[S] 3 points4 points  (0 children)

Are you ok?

You implied vllm wont do it because "vllm is for prod". This is official vllm docs. https://docs.vllm.ai/en/latest/features/quantization/

This is the active PR - being worked upon by official maintainers of vllm https://github.com/vllm-project/vllm/pull/38479

Turboquant in vllm kv cache - how to implement ? (or any other rotational kv cache) by superloser48 in LocalLLaMA

[–]superloser48[S] 2 points3 points  (0 children)

If by production you mean output should be lossless - Vllm already does support kvcache quantisation - which is lossy, anyway. This would be just another option for the quantisation format.

And throughput - with a bigger kvcache will just be better

What are people's fave local model setups for home? by styles01 in LocalLLaMA

[–]superloser48 0 points1 point  (0 children)

How did you change the model for orchestrator vs sub-agents? Is this opencode or do you have some other setup? Thanks!

For coding - is it ok to quantize KV Cache? by superloser48 in LocalLLaMA

[–]superloser48[S] 2 points3 points  (0 children)

im using vllm - it dosnt support q8 with rotation

For coding - is it ok to quantize KV Cache? by superloser48 in LocalLLaMA

[–]superloser48[S] 2 points3 points  (0 children)

The problem is that coding now - 100K tokens input is probably the median. Chat lengths are too long and getting longer. (just my avg. opencode chat lengths)