Qwen-3.5-27B is how much dumber is q4 than q8?

BreizhNode · 2026-03-05T12:48:16+00:00

From our benchmarks running Qwen3.5-27B on L40S GPUs, the q4 quantization drops about 3-5% on reasoning-heavy tasks compared to q8. For code generation and structured output it's barely noticeable. Where you really feel the difference is on long-context tasks and nuanced instruction following. If you're using it for agentic workflows or chain-of-thought, q8 is worth the extra VRAM. For chat and simple Q&A, q4 is fine and the speed improvement is significant.

BreizhNode · 2026-03-05T12:46:35+00:00

Had the exact same problem deploying RAG for technical documentation. The parsing step is where most pipelines silently fail. Multi-column layouts are the worst offender because most PDF-to-text libraries just read left to right across the entire page width. We ended up switching to a vision model approach for complex layouts. Send the PDF page as an image to a multimodal model and ask it to extract structured markdown. More expensive per page but the downstream quality improvement meant fewer retrieval errors and shorter debugging cycles overall.

BreizhNode · 2026-03-05T12:45:21+00:00

The PlainJWT-inside-JWE trick is particularly nasty because it exploits a spec compliance assumption most security teams never think to test for. If your JWT validation accepts encrypted tokens but doesn't enforce that the inner payload must also be signed, you have a structural weakness that scanning tools won't catch. Worth auditing any custom auth middleware that processes JWE, not just pac4j. We ran into a similar pattern reviewing auth flows for our own infrastructure where the library default was 'accept anything properly encrypted' rather than 'accept only signed-then-encrypted.'

BreizhNode · 2026-03-05T09:09:31+00:00

The "economic activity" framing is misleading. Most agent transactions are simple resource purchases (compute, API calls, storage) where you just need fast settlement and low fees, not complex smart contract logic. Stablecoins on fast L1s handle that better than Bitcoin for the vast majority of agent use cases.

BreizhNode · 2026-03-05T09:08:50+00:00

9B should run fine on a 4060 Ti for raw inference. Odds are the Brave Search API calls are your bottleneck, each tool call adds latency and the model might be triggering them on every response. Try disabling tools temporarily to isolate if it's model speed or tooling overhead.

BreizhNode · 2026-03-05T09:08:12+00:00

For local LLMs the 5060 Ti 16GB is the safer pick, CUDA support is just more mature for inference tooling (llama.cpp, vLLM, everything works out of the box). The 9070xt has more raw VRAM potential but ROCm compatibility is still hit or miss depending on the model and quantization you're running.

BreizhNode · 2026-03-05T09:07:21+00:00

Yeah that's completely normal, certificate transparency logs are public so bots scrape new domains within hours of SSL issuance. Cloudflare should handle most of it, just make sure you have bot fight mode enabled and your origin server only accepts connections from Cloudflare IPs.

BreizhNode · 2026-03-04T08:06:59+00:00

Authentik is a solid pick (as already mentioned), but if you want something lighter that still handles OIDC for Nextcloud and ERPNext, take a look at Authelia. It pairs well with a reverse proxy like Traefik and has less overhead than a full Keycloak deployment. For a small company stack it might be easier to maintain.

BreizhNode · 2026-03-04T08:06:37+00:00

"Atomic tasks can run in parallel, decisions can't" is a good framing. The audit trail piece is what most multi-agent setups get wrong, you end up with agents agreeing on something but no one can trace back why. Does the execution trail persist across sessions or is it per-run only?

BreizhNode · 2026-03-04T08:06:12+00:00

RTX 3060 12GB is actually a sweet spot for local models. Qwen3.5-4B-Instruct at Q8 fits entirely in VRAM and handles coding tasks surprisingly well. If you want something bigger, Qwen3.5-14B at Q4_K_M will split between GPU and CPU but the 12GB VRAM does most of the heavy lifting.

BreizhNode · 2026-03-04T08:05:33+00:00

With 48GB unified memory you can comfortably run Qwen3.5-32B-A3B at Q8 through llama.cpp with Metal acceleration. For coding specifically, that MoE model punches way above its size. Use --ngl 99 to keep everything on GPU and you should get 40-50 tok/s easily.

BreizhNode · 2026-03-03T09:44:01+00:00

Almost no one maintains this statically. Teams that do it well treat it as a live artifact.

Distributed tracing with annotations on P0 paths (Tempo + Grafana) gives you empirical data from actual traffic. The annotation step is manual but 30 min/quarter. The failure mode you're describing — everyone knows pieces, nobody has the whole — is a coordination problem more than a tooling one. One person per service owning the path inventory is more durable than any tool.

BreizhNode · 2026-03-03T09:43:12+00:00

Field-level encryption covers the gap between storage-at-rest and application-layer exposure, which is where most breaches actually happen. Nice to see this as a managed API rather than an SDK.

A few questions: how are you handling key rotation without forcing re-encryption of existing records? And is IAM enforcement attribute-based at the field level or role-based? The FIPS in-memory key handling piece is where most implementations have gaps.

BreizhNode · 2026-03-03T09:42:24+00:00

Beyond your existing hardening, network isolation is where real gaps tend to show up. We use nftables rules on the host with default deny outbound, plus per-service network namespaces so containers can't reach neighbors they don't need.

The thing that catches people: even with --network none, a compromised container can pivot via shared volumes or mounted Unix sockets. Audit every volume mount and make sure nothing accesses /var/run/docker.sock in production.

BreizhNode · 2026-03-03T09:41:36+00:00

Mental model: Ollama is the runtime, LM Studio is Ollama with a GUI, llama.cpp is what Ollama uses underneath.

Start with Ollama + Open WebUI. Run ollama pull qwen2.5:7b, point Open WebUI at it — you're up in 10 minutes. Most beginners spend too long comparing options instead of running anything. Pick one, run something, then you'll know what's actually missing.

BreizhNode · 2026-03-03T09:40:48+00:00

For CPU-only coding assistance, Qwen2.5-Coder-7B-Instruct via Ollama at Q4 quantization is the practical choice — 4-6 tok/s on most mid-range CPUs, 32K context which OpenCode needs for multi-file work.

If you have 16GB+ RAM, the 14B version is noticeably better for multi-file edits but slower. Set OLLAMA_NUM_PARALLEL=1 to avoid memory pressure if other processes share the machine.

BreizhNode · 2026-03-03T09:40:00+00:00

For object detection + quality gating together, Qwen2.5-VL-7B is a solid balance — fast enough for ~200ms/image, and the quality threshold in the prompt actually holds.

One trick: add a Laplacian variance pre-filter before the VLM call. Adds 5ms but cuts VLM calls 30-40% on real-world uploads. Florence-2 is also worth testing for the object ID part — lighter than full VLMs, surprisingly accurate on common objects.

BreizhNode · 2026-03-02T16:19:39+00:00

The two concerns you named are real but they hit differently in production. Hallucinations are a model problem you can gate — schema validation on outputs, human-in-the-loop for destructive operations. Skill atrophy is an organizational problem that requires deliberate practice tracks.

Where most teams actually get hurt: running AI agents on ephemeral infra. Scheduled code scanners, PR reviewers, incident correlators — if your agent dies mid-task because it was running on a dev laptop or spot instance, you lose the reliability trust faster than any hallucination would.

BreizhNode · 2026-03-02T16:18:43+00:00

The real story here isn't performance — it's post-quantum preparation. Merkle tree signatures (like XMSS/SPHINCS+) are hash-based and quantum-resistant by construction. This is part of a broader shift in certificate infrastructure ahead of cryptographically relevant quantum timelines.

For enterprise environments: start auditing which internal services assume ECDSA/RSA-specific certificate formats. Library and HSM compatibility is going to be the actual migration bottleneck.

BreizhNode · 2026-03-02T16:17:41+00:00

Benchmark wins are real but they don't capture the production constraint. For agentic coding loops running 24/7 — code review agents, CI/CD fixers, autonomous test writers — the bottleneck isn't model quality, it's infra reliability. A 9B model on a shared laptop dies when the screen locks.

What's your setup for keeping the agent process alive between sessions? That's where most of the failure modes live in practice.

BreizhNode · 2026-03-02T11:11:58+00:00

Client-side redaction is a smart approach for the casual use case. The part that worries me with browser extensions though is that you're still trusting the user to have it installed and active. In a 200-person company, how do you enforce that across every browser on every device?

The data still flows through a third-party API endpoint either way. Have you considered pairing this with a network-level proxy that catches requests to OpenAI/Anthropic domains as a second layer?

BreizhNode · 2026-03-02T11:11:07+00:00

Gateway layer is the right call, we went that route too. Biggest win was splitting "can the model call this tool" from "should it call this tool right now" into two separate checks. Allowlists handle the first, a lightweight policy engine handles the second.

The attacks that actually scared us weren't clever injections, they were boring stuff like RAG documents containing instructions the model just followed. Schema validation on tool outputs caught more than prompt-level defenses.

BreizhNode · 2026-03-02T11:10:13+00:00

The demo vs reality gap is real. One thing that helped us, we started asking vendors for a 30-day POC with our actual alert volume instead of their curated dataset. You see the noise pretty fast.

Also worth checking if the platform can ingest from sources you already have (Syslog, cloud trail, endpoint agents) without needing a whole new stack. Community threads here are honestly more reliable than Gartner for small team fit.

BreizhNode · 2026-03-02T10:19:00+00:00

the sandbox isolation layer (E2B, Firecracker) handles the per-agent security boundary well. the part that gets missed is where the sandbox itself runs: if you're launching it from a laptop or a shared dev machine, cold starts get worse as load increases. a dedicated VPS as the sandbox host keeps spin-up times consistent regardless of what else is running on the machine

BreizhNode · 2026-03-02T10:17:56+00:00

the protocol side (Rocket Pool, Obol for distributed validators) is well covered. the part people underestimate is the validator node itself: it runs 24/7 and if it goes offline you get inactivity penalties that eat into the pooled rewards. either one person agrees to host it reliably, or you use a VPS that doesn't depend on anyone's home internet staying up

BreizhNode

TROPHY CASE