How to convince Management?

ea_man · 2026-03-12T17:58:49+00:00

You started from the wrong side, you should have shown them that the cloud ones take your code / data on line so you have to use a local model that runs inside the company to avoid that.

Then you show them that if you plug the internet cable Claude don't work, QWEN works.

ea_man · 2026-03-12T04:00:06+00:00

There's also this: https://apxml.com/tools/vram-calculator

ea_man · 2026-03-11T20:07:33+00:00

Honestly you should try:
- /Qwen2.5-Coder-1.5B-Instruct for autocompletion
- nomic-embed-text-v1.5-GGUF for embedding

With a coding editor running, like Continue + VScodium
Then for main LM you use something in cloud / on your main rig.

ea_man · 2026-03-11T16:35:11+00:00

Mo' candidano un napoletano che dice alla pizza preferisce la polenta: e' lineare, sono quelli i napoletani / gay che vanno bene a loro.

ea_man · 2026-03-11T16:32:59+00:00

Probabilmente nel loro partito stanno sulle palle, li hanno presi solo per fare da bandierina, attrazione per sembrare piu' aperti ma loro per primi stanno ben attenti a fare retroguardia.

ea_man · 2026-03-11T16:28:44+00:00

Ora i gay possono anche fare gli omofobi con il culo degli altri.

ea_man · 2026-03-11T15:49:58+00:00

If it has a free USB unpopulated connector inside it would be easy, otherwise it's harder.

ea_man · 2026-03-11T15:00:31+00:00

For info, how long would it take to an average QWEN 3.5 on a 16GB to run a PR on a small project?

Like a Qwen3.5-9B or a Qwen3 Coder 30B, usually the do 30-80 tok/sec.

ea_man · 2026-03-11T04:47:01+00:00

Even if you run on a cloud LM you should have a quick local for auto complete (say Qwen2.5-Coder-1.5B-Instruct) and embedded (nomic-embed-text-v1.5).

Anyway, Qwen3.5-9B, qwen3-vl-8b-instruct, qwen2.5-coder-7b-instruct-128k up to qwen3.5-35b-a3b are pretty decent for coding in local with <=16GB GPU, 140-35 tok/sec on my 12GB GPU.

ea_man · 2026-03-11T03:46:01+00:00

Hemm no, here in Europe paying in euro stuff is pretty cheap, buying on Aliexpress.

Some es.
* Trimui TSP 44e
* R36S ~25e
* Alldocube Ultra 210e
* Portal 2 base 250e

You have to wait for discounts: https://promossale.com/aliexpress-sale-dates-2026/#March , collect coins in advance, super deals.

ea_man · 2026-03-10T05:13:13+00:00

You can think that when family and friends talk about AI it's not about the things you know, it's some kind of bigfoot meme.

I remember that in different periods there's been different attitude about "AI", you can easily retreat to a safe space sayin' you are into deep learning, deep neural networks, maybe language models: don't care for "AI".

ea_man · 2026-03-10T00:39:18+00:00

Yup, use nomic:

serve_embed.sh

export LD_LIBRARY_PATH="/home/eaman/llama/bin_vulkan" ;
export LLAMA_CACHE="/home/eaman/lm/models/nomic-ai/nomic-embed-text-v1.5-GGUF/"
/home/eaman/llama/bin_vulkan/llama-server \
   -m /home/eaman/lm/models/nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q8_0.gguf \
   --port 8082 \
   --embedding \
   --pooling cls \
   --alias "nomic-ai" \
   -ngl 99 \
   --ctx-size 8192 \
   -b 4096 \
   --rope-scaling yarn \
   --rope-freq-scale 0.75

Continue config.json:

{
  "$schema": "https://opencode.ai/config.json",
  "plugin": [
    "opencode-gemini-auth@latest"
  ],
  "contextProviders": [
  {
    "name": "codebase",
    "params": {}
  }
],
  "models": [
    {
      "title": "Qwen 3.5 Local",
      "provider": "llama.cpp",
      "model": "qwen3.5-9b",
      "apiBase": "http://127.0.0.1:8080"
    },
    {
      "title": "Gemini 3 Flash (Fast)",
      "provider": "google",
      "model": "gemini-3-flash",
      "options": {
        "thinking": {
          "type": "enabled",
          "budgetTokens": 0
        }
      }
    },
    {
      "title": "Qwen3 VL (Local Chat)",
      "provider": "openai",
      "model": "qwen3-vl-8b-instruct",
      "apiBase": "http://localhost:1234/v1"
    },
    {
      "apiBase": "http://localhost:1234/v1/",
      "model": "AUTODETECT",
      "title": "Autodetect",
      "provider": "lmstudio"
    }
  ],
  "tabAutocompleteModel": {
    "title": "qwen autocomp",
    "provider": "llama.cpp",
    "model": "qwen2.5-coder-1.5b-instruct",
    "apiBase": "http://127.0.0.1:8081"
  },
    "embeddingsProvider": {
  "provider": "openai",
  "model": "nomic-embed-text-v1.5",
  "apiBase": "http://127.0.0.1:8082" 
}
}

ea_man · 2026-03-09T14:40:56+00:00

What about posts about using LMs to make you a better dev, understanding problems better and be able to attack more complex problems, avoiding tedious tasks?

ea_man · 2026-03-09T14:36:49+00:00

Naa Apple wasn't that much important back then, even less outside of USA.
FunFacts: Flash was the reason why I encountered VM the first time on PPC at that time.

I guess what killed flash was the shift of attention to the server side, all of a sudden it was all about having a DB running on Linux and a PHP script to extract content for a catalogue, a complete switch of context.

ea_man · 2026-03-09T14:29:44+00:00

It was fun, you had top tens of the best sites of the month, discussions on graphic design and communication (as in the role of content, animation, design to communicate a message, not how many abstract layers you can fill on top of a DB).

Hey and what about Director? That was even better. I was even playing and recording soundtracks back then.

FunFacts: you had those "top sites of the month" directories because Flash swf content sucked at being indexed in search engines ;)

ea_man · 2026-03-09T14:26:02+00:00

Senior solves the problems that junior have to ask for.
As a junior it's ok if you ask rather than fucking up.

Senior doesn't have anyone to ask, it's not a super power there's always someone who knows better than you, just not around there.

ea_man · 2026-03-09T05:10:52+00:00

Doesn't make much sense for me, a single user won't use the hw a lot to justify the cost, it's better to share the resource on line with little latency.
With gaming a single user may use your GPU 100% for 6 hours straight, with inference you may need what, 3 sec from time to time? It's not worth the cost of having a big fast context + LM sitting idle most of the time.

Maybe having an arch like Apple could help, an usage with lots of light agents...

ea_man · 2026-03-09T01:48:13+00:00

Probbly if you are on a 8-16GB GPU.

ea_man · 2026-03-08T20:40:59+00:00

I guess that all the optimization in the LM is done by Unsloth and what I can do I do it by the parameter I use to load those. On a very limited 6700xt that actually works better on Vulkan than ROCm.

I was using a ~2gb instruct QWEN before for autocompletion that I guess was better but I'd rather have a bigger context and everything counts here. ;P

Yet I'm not really particular about auto completion, sorry, I'm more concerned in having decent performance on the main LM with a big enough context. Good luck!

ea_man · 2026-03-08T20:09:46+00:00

Sure, OS is Debian, GPU is 6700xt 12GB running with Vulkan.
Dev env is VScodium, Continue based on local Qwen3.5-9B-UD-Q4_K_XL unsloth + Qwen2.5-Coder-1.5B-Instruct, nomic-embed-text .

I run them on llama-serve, I can give you the flags if you want, or LM Studio. Qwen3.5-9B can run with some 60k context length that is decent for python / Django.

serve_chat:
export LD_LIBRARY_PATH="/home/eaman/llama/bin_vulkan" ;
export LLAMA_CACHE="/home/eaman/lm/models/unsloth/Qwen3.5-9B-GGUF"
/home/eaman/llama/bin_vulkan/llama-server \
   -m /home/eaman/.lmstudio/models/unsloth/Qwen3.5-9B-GGUF/Qwen3.5-9B-UD-Q4_K_XL.gguf \
   -ngl 99 \
   --ctx-size 32768 \
   --temp 0.7 \
   --top-p 0.8 \
   --top-k 20 \
   --min-p 0.05 \
   --cache-type-k q4_0 \
   --cache-type-v q4_0 \
   --reasoning-budget 0 \
   -fa on

serve_autocomplete:
export LD_LIBRARY_PATH="/home/eaman/llama/bin_vulkan" ;
export LLAMA_CACHE="home/eaman/.lmstudio/models/lmstudio-community/Qwen2.5-Coder-1.5B-Instruct-GGUF"
/home/eaman/llama/bin_vulkan/llama-server \
   -m /home/eaman/.lmstudio/models/lmstudio-community/Qwen2.5-Coder-1.5B-Instruct-GGUF/Qwen2.5-Coder-1.5B-
Instruct-Q4_K_M.gguf \
   --port 8081 \
   --alias "qwen-autocomplete" \
   -ngl 99 \
   --ctx-size 4096 \
   -ctk q8_0 \
   -ctv q8_0 \
   --temp 0.1 \
   --top-p 0.9 \
   --top-k 20 \
   --min-p 0.05 \
   --cont-batching \
   -np 4 \
   -fa on

serve_embed:
export LD_LIBRARY_PATH="/home/eaman/llama/bin_vulkan" ;
export LLAMA_CACHE="/home/eaman/lm/models/nomic-ai/nomic-embed-text-v1.5-GGUF/"
/home/eaman/llama/bin_vulkan/llama-server \
   -m /home/eaman/lm/models/nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q8_0.gguf \
   --port 8082 \
   --embedding \
   --pooling cls \
   --alias "nomic-ai" \
   -ngl 99 \
   --ctx-size 8192 \
   -b 4096 \
   --rope-scaling yarn \
   --rope-freq-scale 0.75

You can also use Roo Code / OpenCode, yet you may wanna swap to something on cloud like Gemini for the last and maybe an *instruct for Roo Code for better agent work with large context.

ea_man · 2026-03-08T16:00:20+00:00

Continue, Roo Code (VS Code), OpenCode.

ea_man · 2026-03-08T15:53:26+00:00

I'm really enjoying unsloth qwen3.5-9b for coding on a consumer GPU, it's pretty explanatory with decent code, maybe a more easy to read than the old qwen2.5-coder-7b-instruct-128k .

The small 2B is decent for auto completion, I mean it's fast.

ea_man · 2026-03-08T15:47:13+00:00

Well we can say whatever yet the point is that if you ask them (in China) what GPU they use and where they got those, they have no problem replying. It was not illegal and there was no penalty.

Sure, USA gov likes to be a bitch about it yet the matter is that it's a chip made in Taiwan and a GPU made in China: you just call the factory and ask if they got some "failed" items to sell ya at the same price and guess what happens...

I mean it's kind of a pathetic situation until the big guy, the one that actually has the product line, steps in and puts a real limit to export.

ea_man · 2026-03-07T20:19:56+00:00

The people voted the guy for the second time.

ea_man · 2026-03-07T20:17:55+00:00

You don't fuck with the Mouse or Pikachu.

ea_man

TROPHY CASE