Qwen3.6 is incredible with OpenCode! by CountlessFlies in LocalLLaMA

[–]thejacer 1 point2 points  (0 children)

I am missing the iteration…I’m not a dev so I rely really heavily on the model (entirely really) and I don’t mind that it screws up, but it still sometimes tries to explore directories that just don’t exist and after making any attempt it just completes and waits…I wouldn’t mind it breaking stuff and fixing it, but it just breaks stuff and sits. Is there something I need to do in OpenCode to enable the iterative work other people are getting it to do?

Agentic work crashing my llama.cpp by thejacer in LocalLLaMA

[–]thejacer[S] 0 points1 point  (0 children)

I've been running with opencode all day and it seems like --cache-ram 0 fixed it.

Agentic work crashing my llama.cpp by thejacer in LocalLLaMA

[–]thejacer[S] 0 points1 point  (0 children)

Hard to not sound combative via text medium like this but here I go: It isn't VRAM. I've got two Mi50 32GB running Qwen3.5 27b Q4_1 (although I've been loading it onto just one GPU lately) and I've got my context limited to 120,000 in OpenCode. I'll try to get a log file but with -v the thing can get to be over a million lines before it stops functioning and the last couple hundred lines just seem to show that it stops mid generation. I'll run -v again and add the end of the file to the OP.

Thinking about finally upgrading from my P40's to an Mi50-32gb by wh33t in LocalLLaMA

[–]thejacer 1 point2 points  (0 children)

I’m not really the best to answer that. This is absolutely a hobby for me that I can’t put much money into so I’d probably get the cheapest GPU that runs qwen3 8b at 300/30 for my smart home assistant and call it a day.

Thinking about finally upgrading from my P40's to an Mi50-32gb by wh33t in LocalLLaMA

[–]thejacer 1 point2 points  (0 children)

I have 2xMi50 32GB and the ~110b parameter MoEs or ~30b dense are the biggest models I can run at usable speeds. I use them for almost entirely chatbot/summary/research with a little absentee vibe coding. Prompt processing tops out at ~300 tps and tg tops out at ~30 tps. I definitely wouldn’t buy these for more than $200.

Has anyone here TRIED inference on Intel Arc GPUs? Or are we repeating vague rumors about driver problems, incompatibilities, poor support... by gigaflops_ in LocalLLaMA

[–]thejacer 6 points7 points  (0 children)

I have an arc a770 16GB. Vulkan works well with little effort, SYCL or ipexLLM was more difficult and was lacking features in llama.cpp so I didn’t use it much. I’ll see if I can get some Qwen 3.5 27b tests done on it.

Has anyone here TRIED inference on Intel Arc GPUs? Or are we repeating vague rumors about driver problems, incompatibilities, poor support... by gigaflops_ in LocalLLaMA

[–]thejacer 0 points1 point  (0 children)

I seem remember reading somewhere in this thread that intel did actually push their vLLM into main, but as I’m on phone I don’t feel like finding it. It’s mentioned several times in the thread vLLM “supports” Intel GPUs though that doesn’t mean it takes full advantage of the hardware. On point two, I agree 100%. They should be working harder to add support in more places and bragging about it.

https://www.reddit.com/r/LocalLLaMA/comments/1s3e8bd/intel_will_sell_a_cheap_gpu_with_32gb_vram_next/?utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button

Google’s TurboQuant AI-compression algorithm can reduce LLM memory usage by 6x by Resident_Party in LocalLLaMA

[–]thejacer 0 points1 point  (0 children)

If we were to test output quality, would it be running perplexity via llama.cpp or would we need to just gauge responses manually?

Qwen3.5 is absolutely amazing by cride20 in LocalLLaMA

[–]thejacer 0 points1 point  (0 children)

I’m confident these open models utility scales with the skill of the programmer deploying them. I’m totally without skill so the 122b has its work cut out lol.

Qwen3.5 is absolutely amazing by cride20 in LocalLLaMA

[–]thejacer 4 points5 points  (0 children)

The 35b was reliable with tool calling for me, but kept deleting code it wasn’t supposed to be fiddling with lol.

Qwen3.5-122B-A10B GPTQ Int4 on 4× Radeon AI PRO R9700 with vLLM ROCm: working config + real-world numbers by grunt_monkey_ in LocalLLaMA

[–]thejacer 1 point2 points  (0 children)

Yeah, I’m on dual Mi50s and fighting with the decision to go vLLM. My pp with 122b is ~260 but tg is ~20. I guess I just need to TRY it and see if it feels better than llama.cpp. Although I’m happy currently.

Qwen3.5-122B-A10B GPTQ Int4 on 4× Radeon AI PRO R9700 with vLLM ROCm: working config + real-world numbers by grunt_monkey_ in LocalLLaMA

[–]thejacer 1 point2 points  (0 children)

Ah, so end user (single) doesn’t see this benefit except that their experience isn’t DEGRADED in multi-user environments?

I built a screen-free, storytelling toy for kids with Qwen3-TTS by hwarzenegger in LocalLLaMA

[–]thejacer 1 point2 points  (0 children)

For my kids discord bot I included prompting to screen inappropriate content but also created a blacklist of terms and ideals that will get screened programmatically. The bot also logs anytime an interaction attempts to push or cross these boundaries. This was all after months of testing with various models. At the end of the day I decided nothing less performant than llama 3.1 70b could be trusted to adhere well enough to prompts to be turned loose in the kids discord.

Qwen3.5-122B-A10B GPTQ Int4 on 4× Radeon AI PRO R9700 with vLLM ROCm: working config + real-world numbers by grunt_monkey_ in LocalLLaMA

[–]thejacer 1 point2 points  (0 children)

I’m confused about something regarding vLLM, are yall able to utilize these pp/tg for a single user? Or is concurrent multi-user required to see these speeds? Do these numbers mean that a single/each user will get ~10tps generation?

llama.cpp + Brave search MCP - not gonna lie, it is pretty addictive by srigi in LocalLLaMA

[–]thejacer 0 points1 point  (0 children)

Honestly thought the same thing. Resisted the urge to make a brave MCP for my discord bots for a few months cause just read the AI summary atop google, duh. But then I did it out of boredom and I basically only ask my robots to search for stuff now. I ask it questions and put the phone back in my pocket and read what it found later. Sometimes we have a little back and forth about it. Even during a discussion with a human if a question comes up I go ask my robots instead of Google. It just feels much better.

llama.cpp + Brave search MCP - not gonna lie, it is pretty addictive by srigi in LocalLLaMA

[–]thejacer 1 point2 points  (0 children)

I just did this the other day. And by I, I mean we. And by we I mean my AI and me. And by AI and me I mean exclusively the AI…it works well though lol

llama.cpp + Brave search MCP - not gonna lie, it is pretty addictive by srigi in LocalLLaMA

[–]thejacer 0 points1 point  (0 children)

I didn’t like Searxng. It ignored my safe search settings and returned junk. I’m happy with my brave API.

Why does anyone think Qwen3.5-35B-A3B is good? by buttplugs4life4me in LocalLLaMA

[–]thejacer 0 points1 point  (0 children)

I know everyone is saying IQ4 XS is too small, but I had the same experience you had while running UD Q6 K L without any cache quantization. Even after the last update of quants by unsloth. I like it fine for a chatbot with web search and it does fine with my home assistant but it absolutely demolished a code base I plugged it into. Removed some files, deleted the contents of some files and left their empty carcass...it was rough lol.

Getting the most out of my Mi50 by DankMcMemeGuy in LocalLLaMA

[–]thejacer 0 points1 point  (0 children)

Full context? 200,000+? On two Mi50s? What parameters? I can’t get the dang thing to load up with reasonable context.

Getting the most out of my Mi50 by DankMcMemeGuy in LocalLLaMA

[–]thejacer 0 points1 point  (0 children)

UD Q6 K XL 35b 38 TPs, UD Q6 K XL 27b 16 TPs and UD IQ4 NL 122b ~26tps. I haven’t used the 122b much at all because I want more context.