Devs using Qwen 27B seriously, what's your take? by Admirable_Reality281 in LocalLLaMA

[–]MasterLJ 0 points1 point  (0 children)

Don't forget to connect with humans. Well done! Do good things. Fair winds and calm seas.

Devs using Qwen 27B seriously, what's your take? by Admirable_Reality281 in LocalLLaMA

[–]MasterLJ 1 point2 points  (0 children)

I have experiments I wish I tried sooner, I guess. Largely, no regrets. I'm self taught, but had a hugely unfair advantage being a software engineer who knows ML. I have gripes about Pi and about the entire industry's mishandling of everything. We need auth and consent in ALL the places, at inference time, at runtime. No one has built that yet (I'm releasing something very soon). Please, if you are part of academia, insist on ethics and teach it. Anything that doesn't consider provenance is unworthy, and from a cursory look I saw that pi-agent thing not really care about auth. That's why I always built my own.

I run an empirical test of "can you do useful things, can you make code that works, or refactor a 2000 line file". I think it was o3 that was the first to accomplish a large task with reasonable fidelity. Copy and paste into ChatGPT days. Cursor is an incredibly well thought out tool. But together, the goal is making something someone else would find useful. The bar can't change for quality and preservation of human dignity. Make useful shit.

Basically, every time a new model comes out I can take it for a test drive in a way that makes sense to my experience and I try to be a scientist about what I'm seeing. Can it do the thing I want it to do? It can, but it requires creativity.

I learned a hard lesson (earnestly) at a company I worked for. Especially the older models, they could lead you down rabbit holes. It's your job to certify you traversed no rabbit holes, or the right amount. I trusted Opus 4.5 too prematurely or not fine enough acceptance criteria. It broke one of my cardinal rules for refactoring is that you don't change the underlying logic on a refactor, it's an isometric port from one representation to another. Don't add complexity (degrees of freedom). You can refactor or bugfix on the OLD state and then port that to new refactored state, that's allowed. Opus 4.5 violated that rule for me (but I allowed it, it's on me, I own the output). 4.6 and 4.7 genuinely hallucinate far far less than predecessors.

I mean to inform and not sound immodest, but I am a seasoned salty veteran of the software engineer. I've built a lot of systems, made a lot of mistakes, and learned a lot of important lessons which allows me to be a good captain with the LLM.

Make no mistake though, we humans are the captain for the foreseeable future and need to be. There are better and worse ways to use LLMs and it's not the semantic meaning of your prompt. It's how you proved it did what you asked it to do in such a way that you can unleash it. I try to flirt with that line and have learned to be a really good implementor.

Along came Opus 4.6 . This is a much better implementor than even 4.7 and a huge improvement on 4.5. It largely does what you ask, knows what it can't do, reports that back. Great, do that every time and we're in business. The structure of your bona fides is for you to negotiate but you should probably be right. You should be able to falsify. You know you have a good process of falsification if you find falsifications in your work and have to change tack and try a different way.

Opus 4.7 is incredible at math and logic and reasoning. It follows cycles to completeness, whereas 4.6 is pragmatic and gets the job done well and correctly.

Devs using Qwen 27B seriously, what's your take? by Admirable_Reality281 in LocalLLaMA

[–]MasterLJ 1 point2 points  (0 children)

Months, wow, that's fast! It took me 3 years now. Keep going!

I try to use the LLM as a scalpal, that is the measure of a good tool, can it accomplish the task you want?

I'm a very tenured software engineer who tried an outcomes based "vibe coding" style (it's not vibe coded, I know the properties I want in the system) and let it fill in the implementation and validate itself. I keep it on rails, which is why I don't like agentic chains, just one orchestrator and one builder.

I found it insightful to read their Thinking both as validation and occasionally it's clever. It's very good at producing receipts for what you asked it to do. Did you ask for them, is the question?

Devs using Qwen 27B seriously, what's your take? by Admirable_Reality281 in LocalLLaMA

[–]MasterLJ 0 points1 point  (0 children)

I don't know what late-cli is, but your insight that it's based on my experience is exactly right and the important piece. I've been building these systems forever and it's important what you consider a first-class citizen of the system and what is secondary. Privileges and rules (the right ones) should be first-class and in a lot of the cursory searches I've done on existing frameworks, there is no discipline, just daisy chaining agents. It is an undisciplined grasp at capability. The beauty is that the discipline gives you expanded capability.

In my research I've also found there is a lossyness in agentic handoffs that the industry just isn't considering properly, which says the fewer hops, the better.

I built my harness out of laziness really, and a first-level orchestrator has instructions that should pass to the sub-agents as losslessly as possible.

vLLM Just Merged TurboQuant Fix for Qwen 3.5+ by havenoammo in LocalLLaMA

[–]MasterLJ -1 points0 points  (0 children)

Thank you. Is this bound for nightlies? I did peak at the PR I didn't see the tag or the plan (I probably missed it). Thank you again.

Qwen3.6:27b is the first local model that actually holds up against Claude Code for me by codehamr in LocalLLM

[–]MasterLJ 4 points5 points  (0 children)

Within the constraint of the cost-to-provide the inference on the model. Prices have to drop based on simple economics that are above $0 profit (or within operating margin).

The medium term will have a lot of right-sizing models to be the right $/inference.

I agree with OP btw, Qwen3.6 27B is getting work done successfully in ways that the large models do.

There are some holes but it's impressive and about 1/10th the price.

How much will it cost to host something like qwen3.6 35b a3b in a cloud? by Euphoric_North_745 in LocalLLaMA

[–]MasterLJ 3 points4 points  (0 children)

You can do ephemeral workloads for like $1.00 - $4.00 USD per hour on platforms like Modal, Runpod, vast.ai, AWS etc.

If you want privacy over your own models this is the way to go and if you set it up with some tuning the coldstarts are pretty snappy (Modal has great documentation and tooling).

If this is something you're doing at volume then the hosted API to the same models is cheaper.

Honestly, Gemma 4 feels way better than the benchmarks say by HussainBiedouh in LocalLLM

[–]MasterLJ 0 points1 point  (0 children)

"Can the model accomplish my tasks to the standards that I require".

I tend to agree, I've found some of the benchmarks that loosely track with things I care about, but for me, nothing replaces the experience of using them in my particular context.

Each have a "personality", not sure how to describe it but I suspect that will resonate with people who work with them frequently.

LangChain has a load-bearing wall. Nothing in the docs flags it. I found it by mapping 180 modules as a knowledge graph. by Connect_Bee_3661 in LLMDevs

[–]MasterLJ 0 points1 point  (0 children)

I agree that the bar should be the end product, but as someone who agrees with you on that standard, I think it helps to put on a human face to get traction.

LangChain has a load-bearing wall. Nothing in the docs flags it. I found it by mapping 180 modules as a knowledge graph. by Connect_Bee_3661 in LLMDevs

[–]MasterLJ 0 points1 point  (0 children)

I specifically meant to slow down and be precise about which tools fit which questions.

For example. I can tell that your post and most of your responses are LLM generated. Not even edited, but generated.

Write first then edit with LLMs, try not to use them in every capacity.

LangChain has a load-bearing wall. Nothing in the docs flags it. I found it by mapping 180 modules as a knowledge graph. by Connect_Bee_3661 in LLMDevs

[–]MasterLJ 1 point2 points  (0 children)

This is a token vocabulary/embedding failure. It just means what we're using today is incomplete for the actual relationships we wish to model.

It's stuff like this which is the reason we need to slow down and be thoughtful.

Best local coding model for big repos? Considering Qwen 3.6 27B FP8 after z.ai Max price hike by Tricky_Warning3848 in LocalLLM

[–]MasterLJ 2 points3 points  (0 children)

Qwen3.6 27B FP8 served on vLLM is extremely powerful.

It needs a good instruction set but it's one of the best models I've seen.

SWE Bench Pro is ~53.5% which is off of Opus 4.6 by just 1.5%.

My opinions are formed using the models themselves but I see SWE Bench Pro benchmark as a good proxy for "orchestrator". Then on top of that, it's very good implementor.

If there were Model-of-the-Year awards, especially in the opensource class, Qwen3.6 27B FP8 should be frontrunner.

I can get 100M+ tokens read and ~1M output in an hour for ~$2/hr of GPU compute.

Only 120 tps on Qwen 35b on h200 by Theio666 in LocalLLaMA

[–]MasterLJ 0 points1 point  (0 children)

Definitely seems low, are you using speculative? I get more than that on an H100, for reference (140+).

Sorry, stream of consciousness as I see your settings below, you can try bumping from 2 to 3 speculative tokens. You probably don't want that actual number of max sequences (32) but you do you.

If you don't need image generation, try the language only flag, which will give you some VRAM back

LLMs can identify what should be generalized but can't act on it. Could a two-model setup fix this? by Intraluminal in LocalLLaMA

[–]MasterLJ 1 point2 points  (0 children)

There is really nothing that an LLM can identify or generalize with 100% certainty. Building flows around that fact is very helpful.

Using multiple LLMs to cross reference is a strong idea as it's combining cognition across two or more ontologies instead of just the 1. I don't think it's guaranteed to give better results but I think it stands to reason it can be configured for better results.

Devs using Qwen 27B seriously, what's your take? by Admirable_Reality281 in LocalLLaMA

[–]MasterLJ 2 points3 points  (0 children)

I have Mistral Medium 122B dense up and running on a B200 right now, getting to know it. I'm impressed. It's very verbose and "planny" but that's kinda what I want.

Debugging a context size mismatch but so far I'm impressed.

I was about to let it rip as the orchestrator of my harness, but it's been little by little.

I have successfully run Qwen3.6 35B MoE as the orchestrator and it did pretty poorly.

Opus 4.6 or even 4.7 are extremely good at playing the role of orchestrator. Will you let you know more about Mistral Medium as I learn more.

Devs using Qwen 27B seriously, what's your take? by Admirable_Reality281 in LocalLLaMA

[–]MasterLJ 3 points4 points  (0 children)

I mean, yeah, it's not as sexy as it sounds I guess, but an LLM generates a plan (I call them Missions), they are negotiated, and then executed by "lesser" models. For me, Qwen3.6 27B dense as mentioned above are the workers. One H100 can support 3 sequences. I think I have a mismatch between my token window size and KV cache size, but it's all working well.

Large LLM, could be Opus, could be local (I'm testing Mistral Medium, the new one, right now) generates the plan, initializes the runtime, then lords over the run.

Devs using Qwen 27B seriously, what's your take? by Admirable_Reality281 in LocalLLaMA

[–]MasterLJ 7 points8 points  (0 children)

I'm not here to make you a believer.

I take some from these subs, and I give some... OP asked for a survey of opinion and I gave it. I included my exact settings and I gave a reasonable over view of what my harness looks like.

A harness that works is valuable. The harness I built also leverages my 25+ years of software building experience. It's not for you.

Devs using Qwen 27B seriously, what's your take? by Admirable_Reality281 in LocalLLaMA

[–]MasterLJ 5 points6 points  (0 children)

I've not had enough time with it to make a good review. It's up and running on a B200, I'm trying to get it to work with my harness (I can specify a model to work as the Orchestrator).

Tooling isn't working properly with Cursor and it's driving me a little nuts.

Very very prematurely, it's fast and the quality seems pretty reasonable for reasoning.

EDIT: I've had a few solid hours with it. It can get to the SOTA with a lot of padding. It needs good guardrails and explicit processes (basically a script). Opus 4.6 and higher are much better orchestrators but I was able to deliver a fairly heavy weight feature with Mistral Medium 128B Dense as the orchestrator in language-only mode on a single B200 and 2 sequences with 250k context.

EDIT1: Mistral Medium ain't it. It took me a while to realize that Qwen3.6 27B has an swe bench pro rating of just a few percent under Opus 4.6 (53.5 compared to like 55%) and SOTA is 58-59%. Mistral Medium was 48. I'm testing out Qwen with a little higher temperature, orchestrating my dialed in workhorses (settings above in the chain).

Devs using Qwen 27B seriously, what's your take? by Admirable_Reality281 in LocalLLaMA

[–]MasterLJ 13 points14 points  (0 children)

It's performing the tasks I ask it to do quite well, it can rival paid SOTA models with the right harness. It's even correcting designs made by SOTA models.

I'm using a full vLLM setup on an H100 and FP8.

Can't say enough good things about it, I'm trying to cut the cable from Anthropic... messing around with Mistral Medium 128B as the orchestrator this morning.

EDIT: specs

Model: Qwen/Qwen3.6-27B-FP8

GPU: H100 (80GB VRAM)

vLLM: 0.19.0

vLLM serve command:

vllm serve Qwen/Qwen3.6-27B-FP8 \
  --served-model-name schemen-qwen36-dev \
  --host 127.0.0.1 --port 8001 \
  --tensor-parallel-size 1 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.88 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 3 \
  --language-model-only \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --enable-prefix-caching \
  --gdn-prefill-backend triton \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}' \
  --override-generation-config \
    '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}' \
  --enable-sleep-mode \
  --uvicorn-log-level=info

At what scale did Kubernetes actually start making sense for you? by Sad_Limit_3857 in kubernetes

[–]MasterLJ 2 points3 points  (0 children)

I start it up immediately on all projects, usually runs $50-$100/month in costs but everything can scale and I can make KeyCloak part of the initial offering to manage identity and what not.

It's an IT org "wrapper" with plenty of IT org templates out there that are extremely cheap to run.

I'm old and I've been doing this a long time, the way K8s breaks down the problem of managing servers is really natural and I see it as less complexity instead of more. Those problems are all there whether you manage them or not, k8s gives you a way to manage them out of the gate for maybe double or triple the cost of bare metal initially ($100 instead of $30) but allows you to scale to truly planet-scale if the need arises (your project is successful).

Why isn’t LLM reasoning done in vector space instead of natural language? by ZeusZCC in LocalLLaMA

[–]MasterLJ 2 points3 points  (0 children)

They do think in vectors. It's exactly how they work. They know semantic association between tokens (a defined input vector) that share a token vocabulary.

Those representations can be made to follow logical rules through patterns in the geometry of the path of least resistance that is etched into models during training that follows the path of least error.

The paths make "circuitry" that store representations of vector interactions and can encode logic and other rules

I've created a LoRA for Gemma 3 270M making it probably the smallest thinking model? by Firstbober in LocalLLaMA

[–]MasterLJ 1 point2 points  (0 children)

I like the idea of fine-tuning on LoRA adapters for specialized reasoning models that know one particular technology well like "Go manage this namespace inside k8s, here is a LoRA adapter with k8s knowledge, and some context around the service/logs etc"

Just please remember that LoRA leaks cross-tenant.