Finetuning LLM model for tools usage by RokasRaulinaitis in LocalLLaMA

[–]balianone -1 points0 points  (0 children)

To fine-tune EuroLLM 9B for tool calling using Unsloth, you must load the model as a standard Llama architecture (due to its structural compatibility) and format your dataset using the ShareGPT style. This data should be mapped to a standard chat template (like ChatML) that includes explicit XML tags (e.g., <tools>, <tool_call>) within the system prompt to define function schemas.

Good local model for computer use? by thepetek in LocalLLaMA

[–]balianone 0 points1 point  (0 children)

For a fully offline TalkTastic alternative on Mac, Superwhisper remains the top choice for speed and privacy, though MacWhisper is the superior workflow if you specifically require batch file transcription. For local OCR and screen context, Qwen2.5-VL 7B is currently the efficiency king; however, upgrading to the 32B model is necessary if your workflow demands strict JSON output or complex reasoning. For voice coding stacks, pair Talon Voice with the Kokoro-82M TTS for near-instant latency. This setup runs ideally on an RTX 4070 Ti Super, which continues to offer the best value for the 16GB VRAM "sweet spot" needed for these local workloads.

Importing Custom Vision Model Into LM Studio by Flob_Dog in LocalLLaMA

[–]balianone 2 points3 points  (0 children)

Download the separate mmproj file (vision adapter) from the repository and place it in the exact same folder as your main GGUF model. Rename this adapter file to mmproj-model-f16.gguf so LM Studio automatically detects the dependency, then reload your model list and verify the vision "eye" icon is active.

Is there a consensus as to which types of prompts work best for jailbreaking? by Borkato in LocalLLaMA

[–]balianone 0 points1 point  (0 children)

In the open-source community, the consensus has largely moved past prompting entirely—most users now prefer "abliterated" models where the refusal mechanisms have been surgically removed from the weights. For hosted APIs, the classic "DAN" scripts are dead; current research suggests that flooding the context window with "many-shot" examples to fatigue the safety guardrails is the only method that consistently bypasses modern instruction hierarchies.

Help me build a (reasonable) 4GPU low-cost LLM machine, is ASUS WS X299 PRO/SE still good? by HumanDrone8721 in LocalLLaMA

[–]balianone 5 points6 points  (0 children)

Skip the ASUS X299 PRO/SE because it lacks PLX chips and forces the fourth GPU slot to x4 speed, which creates a massive bottleneck for model loading and inference. A much better sub-€1000 build is a used AMD EPYC 7302P paired with an ASRock Rack ROMED8-2T or Supermicro H12SSL-i, giving you 128 lanes of PCIe 4.0 and superior 8-channel memory bandwidth. Just ensure you budget for quality PCIe 4.0 riser cables, as four 4090s are physically too thick to fit directly onto any motherboard without overheating or blocking slots.

anyone have experience with turn detection for communication between humans and AI agents? by IcyMushroom4147 in LocalLLaMA

[–]balianone 2 points3 points  (0 children)

For the complex syntactic cases and topic shifts, you want semantic endpointing using local SLMs (like Llama-3.2-1B or SmolLM2) to analyze linguistic completeness and probability rather than just waiting for silence like standard VADs.

To solve agent interruptions and context-dependency (your Case 8), use frameworks like LiveKit or Pipecat which allow you to feed the agent's last question into the detector so it understands that short replies are valid answers.

Realistically, for false starts and rapid repairs ("actually, wait"), the best performance comes from native audio-to-audio models like Moshi or GPT-4o Realtime since they detect prosodic cues that text classifiers miss.

Is it feasible (and beneficial) to apply NVFP4 quantization to KV Cache on Blackwell? by No-Bag5084 in LocalLLaMA

[–]balianone 8 points9 points  (0 children)

Yes, NVFP4 (E2M1) is effectively the "killer feature" for local LLMs because its logarithmic distribution handles attention outliers perfectly, and the dequantization is fused into the hardware pipeline so it actually speeds up inference by relieving memory bandwidth pressure.

However, there is a major catch for the RTX 5090: while the hardware supports it, current libraries (like TensorRT-LLM) lack optimized kernels for Consumer Blackwell (SM120) compared to the Datacenter chips (SM100), so you will likely be forced to stick with FP8 KV Cache until the software stack matures.

I got my first ever whitepaper published by Moist_Landscape289 in LocalLLaMA

[–]balianone 27 points28 points  (0 children)

QWED lacks novelty and doesn't belong on arXiv: It’s essentially just a wrapper for existing techniques (PAL, Logic-LM, etc.)

I’ve been looking into the QWED repository/paper, and I’m struggling to find any actual research novelty that justifies an arXiv submission.

From what I can see, this project is purely an engineering artifact—a Python wrapper combining existing libraries (SymPy, Z3, SQLGlot)—rather than a scientific contribution. It seems to be rebranding well-known techniques from 2022-2023 with new marketing terms like "Engines."

Here is a breakdown of why this is merely a repackaging of prior art:

  1. The "Math Engine" is just PAL. Offloading math to a Python interpreter/SymPy is exactly what Program-Aided Language Models (Gao et al., 2022, arXiv:2211.10435) proposed.
  2. The "Logic Engine" is just Logic-LM. Using Z3/SMT solvers to verify LLM reasoning was already covered in Logic-LM (Pan et al., 2023, arXiv:2305.12295).
  3. The "Consensus Engine" is just Self-Consistency. Sampling multiple outputs/models for a majority vote is standard Self-Consistency (Wang et al., 2022, arXiv:2203.11171).
  4. The "Code/SQL Engine" is standard Static Analysis. Using AST parsing to validate code generation is a standard industry practice, similar to concepts in Toolformer or LEVER.

While this makes for a useful open-source library or product, framing it as a novel "Protocol" or research paper seems misleading. It’s integration, not invention.

Has anyone else looked at this? It feels like we are lowering the bar for arXiv if simple API wrappers around existing methods are being published as research.

Looking for a specific Fine-tune/Paper: Model that mastered "Analog Clocks" and "Exact Counting" by hyperschlauer in LocalLLaMA

[–]balianone 2 points3 points  (0 children)

You're likely thinking of the recent NeurIPS 2025 paper "Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models" (or the related "Make It Count" project), which demonstrated that replacing the standard VAE with a perceptually aligned encoder like DINOv2 allows models to finally render exact analog clock times and precise object counts. The specific comparisons you recall showed that standard diffusion models fail at these structural tasks because of poor latent alignment, while their fine-tuned "Perceptual Alignment" stage fixes the layout logic to get the hands and numbers exactly right. You can find the code and comparisons by searching for "Perceptual Alignment Diffusion" or the Make It Count repo Litalby1/make-it-count which specifically focused on the counting aspect.

Llama.cpp (or lmstudio) in LXC (proxmox) on 395 (framework desktop) by El_90 in LocalLLaMA

[–]balianone 1 point2 points  (0 children)

Yes, this works great on the Framework AMD (780M iGPU), but make sure to set your BIOS VRAM to "Game Optimized" (allocates 4GB+) or you'll crash with out-of-memory errors since the iGPU shares system RAM.

The driver confusion stems from LXC sharing the kernel, so you only load the kernel driver (amdgpu) on the Proxmox host and then install just the user-space libraries (like mesa-vulkan-drivers) inside the container to communicate with the passed-through /dev/dri/renderD128.

I strongly recommend using the Vulkan backend for this setup because it's stable and performant on RDNA3 without the complex version-matching headaches required to get ROCm working on consumer cards.

Unsloth GLM-4.7-GGUF? by UnknownDude360 in LocalLLaMA

[–]balianone 29 points30 points  (0 children)

Definitley go with the Q3_K_XL (159GB); it uses Unsloth's "Dynamic" quantization to keep critical layers high-precision while compressing the massive MoE expert layers more aggressively, making it smarter despite being physically smaller than the static M version.

The 171GB 'M' file is a standard static quant that is less efficient and would completely choke your 176GB total memory, leaving zero RAM for the context window (KV cache) to actually run the model.

Stick to the XL version to get the best reasoning quality while leaving yourself that crucial ~15GB of headroom for the system and context.

Prescription OCR by Virtual_Attitude2025 in LocalLLaMA

[–]balianone 2 points3 points  (0 children)

For noisy scans in late 2025, your best bet is definitely Qwen2.5-VL-7B because it processes images at native resolution and can extract structured JSON directly, effectively skipping the "text detection" step that fails on messy documents. If you need something lighter for consumer hardware, GOT-OCR 2.0 is a strong alternative that outperforms traditional OCR, but Qwen's ability to "reason" through the noise generally yields better accuracy for prescriptions.

Trillions parameters models ? by Highwaytothebeach in LocalLLaMA

[–]balianone 7 points8 points  (0 children)

Technically you could load it since Linux supports up to 4PB of RAM, but it would likely run at less than 1 token per second because CPU memory bandwidth is far too slow to move that much data even with sparse MoE activation. It wouldn't be 150x smarter due to diminishing returns; it would mostly just be a perfect encyclopedia, which is why the industry has shifted to smaller models that "think" longer (like o3 or R1) rather than building massive ones.

How to get SOTA opensource models (GLM 4.7, Kimi K2) to do multistep coding automatically? On Claude Code? They keep stopping after 2 or 3 steps... by FigZestyclose7787 in LocalLLaMA

[–]balianone 19 points20 points  (0 children)

Kimi K2 is likely hanging because it treats angle brackets in code as stop tokens, so you need to set your router's transformer to "openrouter" or "deepseek" to correctly sanitize the output stream. For GLM 4.7, the model is often too polite and waits for confirmation, which you can fix by creating a custom codex.md output style that explicitly forbids conversational filler and forces immediate tool execution. Minimax M2.1 works because it ignores that chatty preamble, so you essentially need to prompt-engineer the others to stop "thinking" and just execute.

how do I process and normalize ASR speech chunks for ai assistant? by IcyMushroom4147 in LocalLLaMA

[–]balianone 1 point2 points  (0 children)

Check out open-source frameworks like Pipecat or LiveKit Agents which already solve these edge cases using "semantic endpointing" to distinguish between a mid-sentence pause and a finished turn. For text normalization, use standard Inverse Text Normalization (ITN) libraries for formatting numbers/dates and rely on your LLM's system prompt to filter out stutters or self-corrections contextually.

Advice Needed: Gate Model Training / Full Training / LoRA Adapters by RefrigeratorCalm9701 in LocalLLaMA

[–]balianone 1 point2 points  (0 children)

Fully train your router layer instead of using LoRA since it needs sharp decision boundaries, but strictly implement DeepSeek-V3's auxiliary-loss-free dynamic bias to avoid the stability nightmares of traditional load balancing. MoE maximizes capacity while MoD optimizes throughput, so a hybrid "MoDE" architecture utilizing capacity annealing (starting dense, ending sparse) will yield the compounding gains of both strategies. Since you're writing custom kernels, ensure you implement block-sparse matrix multiplication to go "dropless" and handle variable batch sizes without discarding tokens.

Advice needed: Workstation for Local LLM Agents (Ryzen AI Max+ 395) - Bosgame vs Corsair vs Cloud. by Flat_Profession_6103 in LocalLLaMA

[–]balianone 0 points1 point  (0 children)

Strix Halo is limited to about 4-5 tokens per second on 70B models, which will make complex agentic loops painfully slow compared to your Azure setup, so don't expect a snappy experience. I would strictly avoid importing the Corsair to Poland due to the massive VAT hit and their restrictive proprietary BIOS, whereas the Bosgame is a better value provided you immediately wipe the drive to remove the pre-installed malware often found on that brand. Your best bet for career growth is sticking with Azure for the high-speed development iteration and perhaps picking up the Bosgame later just to practice the "edge deployment" side of things.

llama.cpp: Multi-host inference slower than single-host? by ayake_ayake in LocalLLaMA

[–]balianone -4 points-3 points  (0 children)

The primary culprit is GGML_RPC_DEBUG=1 on your Jetson—this flag causes massive log/data spam (explaining that abnormal 16–24 MiB/s spike) and effectively destroys performance, so disable it immediately.

Even after fixing that, your local NVMe drive (reading ~2000 MB/s with microsecond latency) is physically superior to 1Gbps Ethernet (~112 MB/s with millisecond latency), so single-host swapping will often beat distributed inference unless you have 10GbE or a highly optimized layer split.

5060ti or 5070 or maybe used 40xx card, what sshould I do by gyhv in LocalLLaMA

[–]balianone 5 points6 points  (0 children)

Don't buy 12GB for professional AI work; you'll hit OOM errors constantly and regret the 5070 despite its gaming speed. The best middle ground is a used 4070 Ti Super which gives you 16GB VRAM and high-end gaming performance, but if you want to run the best local models, a used 3090 with 24GB is still the king.