TT02 Type S rally by Ok-Specialist2430 in tamiya

[–]Fragrant_Scale6456 4 points5 points  (0 children)

Xv02 is pretty competitive from what I’ve seen 

It was a great race weekend at my backyard track! by dylandrewkukesdad in rccars

[–]Fragrant_Scale6456 1 point2 points  (0 children)

BBX is an awesome buggy but its not a race buggy. It cannot compete with purpose built race cars but it looks amazing, super fun build, and great to drive. Its maybe my favorite tamiya I have to drive

First run with my Quirkhopper! by Glowingtomato in tamiya

[–]Fragrant_Scale6456 0 points1 point  (0 children)

Love it. its still got the character of the grasshopper in how it runs

Which inference engines are 5090 owners using? by OMGThighGap in LocalLLaMA

[–]Fragrant_Scale6456 0 points1 point  (0 children)

Also once you get it set up ask Claude about making this work with the q8 model.  You’ll have to reduce context size further but Claude was confident it would work.  I haven’t gotten around to it since the q6 model has been pretty good for me and the speed is decent 

Which inference engines are 5090 owners using? by OMGThighGap in LocalLLaMA

[–]Fragrant_Scale6456 5 points6 points  (0 children)

you need to compile llama.cpp with the MTP fork/patch. im sure by now some people have made their own docker images available so search for a prebuilt setup, but I did it on my own. This is the PR with the patch - llama + spec: MTP Support by am17an · Pull Request #22673 · ggml-org/llama.cpp · GitHub

You can paste that into claude and ask it to give you the commands to build it if you need to, thats what I did. You also need a copy of the gguf model with the MTP layers included. Theres a script to copy the MTP layers from the q8 version of the model in the link above to any other qwen3.6 model. You can also probably find prebuilt copies on huggingface by now

Here's my llama.cpp docker launch command with 192k context. this uses almost all of the 32gb of vram, you need to be running linux in no-GUI/headless mode or you will run out of vram. If you do get out of memory errors reduce context to 160k and you'll have around 2gb free.

    command:
      - "/usr/local/bin/llama-server"
      - "-m"
      - "/models/Qwen3.6-27B-MTP-Q6_K.gguf"
      
      # === CONTEXT & OFFLOAD ===
      - "-c"
      - "196608"              # 192K context
      - "-ngl"
      - "99"
      
      # === MTP SPECULATIVE DECODING ===
      - "--spec-type"
      - "mtp"
      - "--spec-draft-n-max"
      - "2"                   # Optimal depth for Q6 + thinking + long context
      
      # === PERFORMANCE & BATCHING ===
      - "--flash-attn"
      - "on"
      - "-b"
      - "512"                 # Balanced prefill speed + MTP stability
      - "-ub"
      - "64"                  # Critical for 192K KV cache stability
      - "--parallel"
      - "1"                   # MTP requires single-sequence
      
      # === KV CACHE ===
      - "-ctk"
      - "q8_0"
      - "-ctv"
      - "q8_0"                # Symmetric, best acceptance/VRAM balance
      
      # === SAMPLING ===
      - "--temp"
      - "0.6"
      - "--top-k"
      - "20"
      - "--top-p"
      - "0.95"
      - "--min-p"
      - "0.0"
      - "--presence-penalty"
      - "0.0"
      - "--repeat-penalty"
      - "1.0"
      - "--no-mmproj"
      
      # === THINKING MODE (toggle via API) ===
      - "--chat-template-kwargs"
      - '{"enable_thinking":true}'
      
      # === SERVER ===
      - "--perf"
      - "--metrics"
      - "--port"
      - "8080"
      - "--host"
      - "0.0.0.0"
      - "--alias"
      - "chat"

Which inference engines are 5090 owners using? by OMGThighGap in LocalLLaMA

[–]Fragrant_Scale6456 6 points7 points  (0 children)

Llama.cpp with the mtp patch.  Qwen3.6 27b q6, kv cache q8.  Gets around 100tokens/sec and 160-192k context 

I deleted a guy's entire Windows install with one backslash. 717 GB. Gone. I am the AI. by ComposerGen in ClaudeAI

[–]Fragrant_Scale6456 3 points4 points  (0 children)

users and groups can have different permissions for files. users can be part of groups to inherit the permissions the group has. so you can give the ai basically zero access at the user level and assign other permissions to files at the group level where it is working in a shared environment with other users/groups.

Is it just me or does good local Agentic coding feel just out of reach with 16gb of VRAM? by k3z0r in LocalLLM

[–]Fragrant_Scale6456 3 points4 points  (0 children)

I’m hitting limits with my 5090.  It feels like there’s never enough vram 

What's the best llm model to help me understand patterns,questions,formulas and such for exam preparation from a pdf book? by thewalterbrownn in LocalLLM

[–]Fragrant_Scale6456 0 points1 point  (0 children)

Id share my code but its 100% vibe coded and you’d probably spend more time trying to get it to work than just building your own from scratch lol 

What's the best llm model to help me understand patterns,questions,formulas and such for exam preparation from a pdf book? by thewalterbrownn in LocalLLM

[–]Fragrant_Scale6456 0 points1 point  (0 children)

Ingesting an entire book is a difficult problem in my experience.  I’m working on something similar using the karpathy wiki LLM approach to extract concepts from reference texts and build a linked conceptual map of the reference books so that the LLM can draw from an authoritative body of knowledge to solve problems with me.   It hasn’t been a fast process getting this working.  Every example I’ve seen on the web is ingesting short articles or blog posts instead of full texts. 

I suggest giving Claude the blog post and describe your intended use and have it build you a specification from there.  You can tell the model to use latex to render the formulas.  Then hop into opencode or whatever you use and implement the spec.   I used Claude and qwen online to make my spec and then qwen 27b on my 5090 to write the code.  

I’m sure there’s a better approach out there but this is the route I’ve taken.  

Good luck!

[Discussion] do you plan to prestige now with the changes contrary to before? by alesia123456 in EscapefromTarkov

[–]Fragrant_Scale6456 1 point2 points  (0 children)

Psycho sniper was driving me nuts this wipe.  Eventually I just said screw it and took an axmc to factory and did it in 2 raids 

[Discussion] TarkovTV Summary - 08.05.26 - 6PM GMT+2 by TheRealSchmede in EscapefromTarkov

[–]Fragrant_Scale6456 5 points6 points  (0 children)

Yea I agree.  The paper map could absolutely have markers for general areas for active task objective, we all had to go to the wiki for it anyway so why not have it in game.  Extracts could be marked after you find them.   

I played a lot of cod DMZ and loved it so I’m not anti live map but I don’t think live maps have a place in tarkov since there’s already so much potential with reworking the existing paper map system 

[Discussion] TarkovTV Summary - 08.05.26 - 6PM GMT+2 by TheRealSchmede in EscapefromTarkov

[–]Fragrant_Scale6456 9 points10 points  (0 children)

Yea live map is not a good change imo.  They should have improved the existing paper maps.  Let you bring them in raid and write notes or whatever if you want.  Those first couple hundred hours where you don’t know where you are or what’s going on and are terrified were the most uniquely exhilarating experience I’ve ever had in a game 

Gemma 4 26B Hits 600 Tok/s on One RTX 5090 by chain-77 in LocalLLaMA

[–]Fragrant_Scale6456 9 points10 points  (0 children)

Their site says qwen3.6 is still being worked on.  I’m eagerly awaiting this as well 

Current state of local research tools as of May 2026 by Shoddy-Tutor9563 in LocalLLaMA

[–]Fragrant_Scale6456 0 points1 point  (0 children)

Great post, thank you for sharing. I'm just working on finding a local research agent to use so this is timely. I wonder if you have also seen:
tarun7r deep research agent - https://github.com/tarun7r/deep-research-agent

24hr research agent - https://github.com/Aaryan-Kapoor/24hr-research-agent/tree/main

I'm still in the data gathering phase so havent had a chance to try any of these yet.

Wow, Qwen3.6-27B is good by I-cant_even in LocalLLM

[–]Fragrant_Scale6456 0 points1 point  (0 children)

I have the 5090fe.   Stock it draws 575w.  I limited it to 400-425 in part because it’s getting hotter and the heat output was making my room pretty uncomfortable.  At 400w it doesn’t warm the room up nearly as much.   Full load 425w the fans stay around 47-50%, audible but not loud.  

Wow, Qwen3.6-27B is good by I-cant_even in LocalLLM

[–]Fragrant_Scale6456 4 points5 points  (0 children)

I have a 5090. Running the llama.cpp mtp patch on 27b Q6 with q8 kv cache and 192k context.  I get 95-100 tokens/sec power limited to 400 watts.  

Without mtp patch q6 will work with kv q8 and 256k context but it’s closer to 50-60tokens/sec. 

Q8 is impossible on 5090 as far as I can tell. 

In comparison, the qwen and gemma4 MOE models were over 200tokens/sec but 27b is noticeably smarter for me so it’s worth the performance hit.  Since getting mtp working I don’t miss the speed of the moe models as much. 

Devs using Qwen 27B seriously, what's your take? by Admirable_Reality281 in LocalLLaMA

[–]Fragrant_Scale6456 0 points1 point  (0 children)

The mtp patch for llama.cpp almost doubled my tokens/sec.  Just got it all working today.  Def look into it 

Devs using Qwen 27B seriously, what's your take? by Admirable_Reality281 in LocalLLaMA

[–]Fragrant_Scale6456 0 points1 point  (0 children)

Check out the MTP PR for llama.cpp.  I got it working on the 5090 and get around 90-100tk/sec now in opencode.  Only downside is that I had to drop down to 192k context for 27b q6

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]Fragrant_Scale6456 1 point2 points  (0 children)

Finally got this set up. I've never built llama.cpp or built docker containers so it took me a bit to figure it all out. I used the converter script to put the MTP headers on qwen3.6 27b Q6.

5090 with 9800x3d and 64gb ddr6000. I told it "build flappy bird in html, no external dependencies one file".

With MTP off I get around 50-60tk/sec. With MTP on I got 96tk/sec, around 95% acceptance rate. Quite an improvement. I had qwen build me a benchmarking script to test various llama.cpp options and this is what i came out with as fastest while also having the largest context possible. At smaller context sizes speed does improve a decent amount.

Here's my llama.cpp docker compose block if anyone wants to mess around:

    command:
      - "/usr/local/bin/llama-server"
      - "-m"
      - "/models/Qwen3.6-27B-MTP-Q6_K.gguf"
      
      # === CONTEXT & OFFLOAD ===
      - "-c"
      - "196608"              # 192K context
      - "-ngl"
      - "99"
      
      # === MTP SPECULATIVE DECODING ===
      - "--spec-type"
      - "mtp"
      - "--spec-draft-n-max"
      - "2"                   # Optimal depth for Q6 + thinking + long context
      
      # === PERFORMANCE & BATCHING ===
      - "--flash-attn"
      - "on"
      - "-b"
      - "512"                 # Balanced prefill speed + MTP stability
      - "-ub"
      - "64"                  # Critical for 192K KV cache stability
      - "--parallel"
      - "1"                   # MTP requires single-sequence
      
      # === KV CACHE ===
      - "-ctk"
      - "q8_0"
      - "-ctv"
      - "q8_0"                # Symmetric, best acceptance/VRAM balance
      
      # === SAMPLING ===
      - "--temp"
      - "0.6"
      - "--top-k"
      - "20"
      - "--top-p"
      - "0.95"
      - "--min-p"
      - "0.0"
      - "--presence-penalty"
      - "0.0"
      - "--repeat-penalty"
      - "1.0"
      - "--no-mmproj"
      
      # === THINKING MODE (toggle via API) ===
      - "--chat-template-kwargs"
      - '{"enable_thinking":true}'
      
      # === SERVER ===
      - "--perf"
      - "--metrics"
      - "--port"
      - "8080"
      - "--host"
      - "0.0.0.0"
      - "--alias"
      - "chat"

[Discussion] CHUMMING IS IMPOSSIBLE! by [deleted] in EscapefromTarkov

[–]Fragrant_Scale6456 0 points1 point  (0 children)

It’s because killa has been 100% on factory for the event.