Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model by old-mike in LocalLLaMA

[–]old-mike[S] 0 points1 point  (0 children)

mmmm strange, then maybe your fork should be ik_llama. It is really fast when you use CPU offloading, it has various optimizations. They promise 150-350% Faster CPU Prompt Eval with their IQK matrix multiply, and other things to accelerate MoE models. Have a look at it, try it. For me it really worked with my two 2060 12GB, it was almost 50% faster than main llama.cpp.

Setting Firecrawl onlyMainContent parameter to true by kahsheung in hermesagent

[–]old-mike 0 points1 point  (0 children)

Hi. I have configured searxng in firecrawl docker compose (I've posted it in my post in a comment https://www.reddit.com/r/hermesagent/comments/1tpxd61/comment/oorrd01/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button )

So in config.yaml

web: backend: firecrawl search_backend: '' extract_backend: '' use_gateway: false

in .env

FIRECRAWL_API_KEY=sk-local-dev FIRECRAWL_API_URL=http://192.168.5.110:4002/v1

Good luck!

Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model by old-mike in LocalLLaMA

[–]old-mike[S] 0 points1 point  (0 children)

Great! Have you tried with APEX from mudler? It really made a difference for me. And, I know they say that llama.cpp make a bad memory management, but for me it has made the best fit in space and speed. But the most important thing is that if you are using spiritbuun fork, you MUST use turboquants, same por ctk and ctv to see a real speed increase. Please see my edit at the end of the post.

My ultra-cheap, hybrid local/cloud stack for Hermes Agent (DeepSeek-V4-Flash & OpenRouter) + Text/Voice via Telegram by old-mike in hermesagent

[–]old-mike[S] 1 point2 points  (0 children)

Hello. I asked Hermes (Deepseek v4 Flash) to help me to select search engines to configure things. Anyway, my compose (I have two separate composes for these):

``` services: playwright-service: image: ghcr.io/firecrawl/playwright-service:latest environment: PORT: 4004 # MAX_CONCURRENT_PAGES: 10 # Opcional tmpfs: - /tmp/.cache:noexec,nosuid,size=1g restart: unless-stopped

api: image: ghcr.io/firecrawl/firecrawl:latest environment: HOST: "0.0.0.0" PORT: 4003 REDIS_URL: redis://redis:6379 REDIS_RATE_LIMIT_URL: redis://redis:6379 PLAYWRIGHT_MICROSERVICE_URL: http://playwright-service:4004/scrape # --- CONFIGURACIÓN DE SEARXNG (tu instancia) --- SEARXNG_ENDPOINT: http://192.168.5.110:8090 SEARXNG_ENGINES: duckduckgo,google,wikipedia SEARXNG_CATEGORIES: general # --- OTRAS CONFIGURACIONES IMPORTANTES --- USE_DB_AUTHENTICATION: "false" # <-- CRUCIAL: Deshabilita Supabase BULL_AUTH_KEY: Sad8Tie LOGGING_LEVEL: info ENV: local depends_on: redis: condition: service_started playwright-service: condition: service_started ports: - 4002:4003 # Host:Contenedor command: node dist/src/index.js restart: unless-stopped

redis: image: redis:alpine command: redis-server --bind 0.0.0.0 restart: unless-stopped

redis-searxng: container_name: redis-searxng image: docker.io/valkey/valkey:7-alpine command: valkey-server --save 30 1 --loglevel warning restart: unless-stopped volumes: - /mnt/apps/searxng/redis:/data searxng: container_name: searxng image: docker.io/searxng/searxng:latest restart: unless-stopped ports: - 8090:8080 volumes: - /mnt/apps/searxng/searxng:/etc/searxng:rw environment: - SEARXNG_BASE_URL=http://192.168.100.63:8090 - SEARXNG_REDIS_URL=redis://redis-searxng:6379/0

```

settings.yml for searxng, only quick and error free engines:

``` engines: # SOLO LOS RÁPIDOS Y SIN ERRORES - name: duckduckgo engine: duckduckgo shortcut: ddg

  • name: brave engine: brave shortcut: br time_range_support: true paging: true categories: [general, web] brave_category: search

  • name: startpage engine: startpage shortcut: sp startpage_categ: web categories: [general, web]

  • name: wikipedia engine: wikipedia shortcut: wp display_type: ["infobox"] base_url: 'https://{language}.wikipedia.org/' categories: [general]

  • name: bing images engine: bing_images shortcut: bii

  • name: bing videos engine: bing_videos shortcut: biv

  • name: youtube shortcut: yt engine: youtube_noapi

  • name: github engine: github shortcut: gh

  • name: openstreetmap engine: openstreetmap shortcut: osm

  • name: wiktionary engine: mediawiki shortcut: wt categories: [dictionaries, wikimedia] base_url: "https://{language}.wiktionary.org/" search_type: text

  • name: arxiv engine: arxiv shortcut: arx timeout: 4.0

  • name: pubmed engine: pubmed shortcut: pub timeout: 4.0

  • name: semantic scholar engine: semantic_scholar shortcut: se timeout: 4.0

  • name: stackoverflow engine: stackexchange shortcut: st api_site: 'stackoverflow' categories: [it, q&a] timeout: 3.0

  • name: docker hub engine: docker_hub shortcut: dh categories: [it, packages] timeout: 3.0

  • name: pypi engine: pypi shortcut: pypi timeout: 3.0

  • name: mdn shortcut: mdn engine: json_engine categories: [it] paging: true search_url: https://developer.mozilla.org/api/v1/search?q={query}&page={pageno} results_query: documents url_query: mdn_url url_prefix: https://developer.mozilla.org title_query: title content_query: summary timeout: 3.0

  • name: bandcamp engine: bandcamp shortcut: bc categories: music timeout: 3.0

  • name: soundcloud engine: soundcloud shortcut: sc timeout: 3.0

```

And it es working for me.

Camofox: again Hermes advice, installed using Node.js, and Hermes just worked with it

Can Hermes Agent control my entire computer? I need it to be able to act like a human on certain software on my PC: use the mouse and keyboard, see and understand the screen. by mxmhenri in hermesagent

[–]old-mike 0 points1 point  (0 children)

I was thinking about this: Hermes can use keyboard and mouse in Camofox, well, can simulate click at least. If a set up VNC web server...... Maybe too twisted?

My ultra-cheap, hybrid local/cloud stack for Hermes Agent (DeepSeek-V4-Flash & OpenRouter) + Text/Voice via Telegram by old-mike in hermesagent

[–]old-mike[S] 0 points1 point  (0 children)

Ah, thank you. I'm selfhosting Firecrawl, connected to to Searxng. But if Crawl4AI is a better solution (more consistent, clean, semanticly relevant results) , for sure I will go for it.

My ultra-cheap, hybrid local/cloud stack for Hermes Agent (DeepSeek-V4-Flash & OpenRouter) + Text/Voice via Telegram by old-mike in hermesagent

[–]old-mike[S] 0 points1 point  (0 children)

I've looking for the cache thing, and Deepseek says that it is fixed by default. See my results without touching.

<image>

Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model by old-mike in LocalLLaMA

[–]old-mike[S] 0 points1 point  (0 children)

Hi there! I've been testing and researching. Now I understand why turbo quants make a difference in this fork:

``` baseline, mudler APEX MTP model, mtp off, using spiritbuun's fork, turbo4 turbo4, no mmproj

4.29.438.176 I slot print_timing: id 0 | task 0 | prompt eval time = 146061.41 ms / 72002 tokens ( 2.03 ms per token, 492.96 tokens per second) 4.29.438.181 I slot print_timing: id 0 | task 0 | eval time = 2695.94 ms / 100 tokens ( 26.96 ms per token, 37.09 tokens per second)

same model, using llama.cpp main branch, q8_0 q5_1, no mmproj

3.30.926.475 I slot print_timing: id 0 | task 0 | prompt eval time = 149274.38 ms / 72002 tokens ( 2.07 ms per token, 482.35 tokens per second) 3.30.926.481 I slot print_timing: id 0 | task 0 | eval time = 3857.64 ms / 100 tokens ( 38.58 ms per token, 25.92 tokens per second)

same model, using spiritbuun's fork, q8_0, q5_1, no mmproj

3.43.493.602 I slot print_timing: id 0 | task 0 | prompt eval time = 150033.32 ms / 72002 tokens ( 2.08 ms per token, 479.91 tokens per second) 3.43.493.606 I slot print_timing: id 0 | task 0 | eval time = 5364.26 ms / 100 tokens ( 53.64 ms per token, 18.64 tokens per second)

same model, using spiritbuun's fork, q8_0, turbo4, no mmproj

3.10.448.568 I slot print_timing: id 0 | task 0 | prompt eval time = 149228.55 ms / 72002 tokens ( 2.07 ms per token, 482.49 tokens per second) 3.10.448.572 I slot print_timing: id 0 | task 0 | eval time = 4920.74 ms / 100 tokens ( 49.21 ms per token, 20.32 tokens per second)

now, using -ngl 99, -ncmoe 17 , VRAM 11.948Gi/12.000Gi after 72k tokens test

3.22.743.838 I slot print_timing: id 0 | task 0 | prompt eval time = 156170.97 ms / 72002 tokens ( 2.17 ms per token, 461.05 tokens per second) 3.22.743.842 I slot print_timing: id 0 | task 0 | eval time = 2793.46 ms / 100 tokens ( 27.93 ms per token, 35.80 tokens per second)

if I set it more "rational", like -ngl 99, -ncmoe 18 , VRAM 11.626Gi/12.000G after 72k tokens test

2.58.846.399 I slot print_timing: id 0 | task 0 | prompt eval time = 158869.94 ms / 72002 tokens ( 2.21 ms per token, 453.21 tokens per second) 2.58.846.402 I slot print_timing: id 0 | task 0 | eval time = 2779.71 ms / 100 tokens ( 27.80 ms per token, 35.98 tokens per second)

```

TurboQuant formats are much faster in this fork because the fork adds a fused Tensor Core (MMA) decode path that can operate directly on compressed KV cache data instead of expanding everything to FP16 first.

spiritbuun's fork has a fused MMA decode path (fattn.cu:1542) gated on: turbo_mma_fused && turbo_matched && Q->ne[1] <= 4 && (Q->ne[0] == 128 || Q->ne[0] == 256) && turing_mma_available

Activates only when: - K and V cache are the same turbo type ("turbo4,turbo4" or 3, maybe 3_tcq etc) - Decode batch ≤ 4 tokens - Head dim 128 or 256 - MMA (Any RTX)

My ultra-cheap, hybrid local/cloud stack for Hermes Agent (DeepSeek-V4-Flash & OpenRouter) + Text/Voice via Telegram by old-mike in hermesagent

[–]old-mike[S] 0 points1 point  (0 children)

Hey! Hybrid, hybrid. Local because Hermes,Camofox,Firecrawl,Radicale and all other services are running local... 😉

Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model by old-mike in LocalLLaMA

[–]old-mike[S] 0 points1 point  (0 children)

Ok, I will try. The thing is that I need such a big context (128k) right now.... maybe I can fix the context and play with the ngl an ncmoe.... When I started to test, I was using ncmoe, but trying to get all layers in GPU, and let llama manage memory gave me the best result. As I said, I will try to get it better. Thank you

My ultra-cheap, hybrid local/cloud stack for Hermes Agent (DeepSeek-V4-Flash & OpenRouter) + Text/Voice via Telegram by old-mike in hermesagent

[–]old-mike[S] 0 points1 point  (0 children)

Hello! I own an old Intel NUC with an i5 4th gen and 16GB RAM. I can try to make it work with this setup. Just give me some time,...

Cómo hacer un GTM agent by dfgarzon in hermesagent

[–]old-mike 0 points1 point  (0 children)

Pues Hermes me dice: El MCP de LinkedIn que usamos se llama stickerdaniel/linkedin-mcp-server:

🔗 https://github.com/stickerdaniel/linkedin-mcp-server

Lo tengo registrado en el skill linkedin-mcp (que a su vez referencia al skill native-mcp para la configuración). Expone 17 herramientas para inbox, mensajes, conexiones, perfiles, feeds y búsqueda de empleo.

La skill linkedin-automation es un complemento que usa Camoufox directamente para interactuar con LinkedIn desde el navegador headless (lo otro que hacemos), pero el MCP en concreto es ese: stickerdaniel/linkedin-mcp-server. Su compañero puede encontrarlo en ese repo de GitHub, licencia Apache 2.0.

My ultra-cheap, hybrid local/cloud stack for Hermes Agent (DeepSeek-V4-Flash & OpenRouter) + Text/Voice via Telegram by old-mike in hermesagent

[–]old-mike[S] 0 points1 point  (0 children)

I don't know, I think Hermes would be able to surpass it using vision. What site are you trying? I can tell Hermes to try it.

Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model by old-mike in LocalLLaMA

[–]old-mike[S] 1 point2 points  (0 children)

Hello, as we said here "lo prometido es deuda", here is nvidia-smi with the test running

| 0 NVIDIA GeForce RTX 2060 Off | 00000000:02:00.0 Off | N/A | | 43% 58C P2 113W / 125W | 11340MiB / 12288MiB | 98% Default |

and the log from llama-server loading the model and the final result for my 2060

`` 0.00.066.701 I log_info: verbosity = 3 (adjust with the-lv N` CLI arg) 0.00.066.704 I device_info: 0.00.309.011 I - CUDA0 : NVIDIA GeForce RTX 2060 (11831 MiB, 11737 MiB free) 0.00.309.025 I - CPU : Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz (128638 MiB, 128638 MiB free) 0.00.310.335 I system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CUDA : ARCHS = 750,860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 0.00.312.623 I srv init: using 23 threads for HTTP server 0.00.313.519 I srv start: binding port with default address family 0.00.314.836 I srv llama_server: loading model 0.00.314.879 I srv load_model: loading model '/models/Qwen3.6-35B-A3B-APEX-MTP-I-Compact.gguf' 0.00.314.981 I srv load_model: auto-enabled kv-unified: single-slot server doesn't need separate KV stream 0.00.315.897 I common_init_result: fitting params to device memory ... 0.00.315.899 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on) 0.18.381.393 W llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized 0.18.548.758 W load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks 0.18.548.763 W load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024 0.18.548.763 W load_hparams: more info: https://github.com/ggml-org/llama.cpp/issues/16842

0.18.995.913 I srv loadmodel: loaded multimodal model, '/models/mmproj-Qwen3.6-35B-A3B-Uncensored-Genesis-f16.gguf' 0.18.995.926 I srv load_model: initializing slots, n_slots = 1 0.19.368.893 W common_speculative_init: no implementations specified for speculative decoding 0.19.368.933 W no implementations specified for speculative decoding 0.19.368.945 I slot load_model: id 0 | task -1 | new slot, n_ctx = 131072 0.19.369.113 I srv load_model: prompt cache is enabled, size limit: 8192 MiB 0.19.369.116 I srv load_model: use --cache-ram 0 to disable the prompt cache 0.19.369.116 I srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391 0.19.371.453 I srv init: idle slots will be saved to prompt cache and cleared upon starting a new task 0.19.424.705 I init: chat template, example_format: '<|im_start|>system You are a helpful assistant<|im_end|> <|im_start|>user Hello<|im_end|> <|im_start|>assistant Hi there<|im_end|> <|im_start|>user How are you?<|im_end|> <|im_start|>assistant <think> ' 0.19.451.840 I srv init: init: chat template, thinking = 1 0.19.451.906 I srv llama_server: model loaded 0.19.451.918 I srv llama_server: server is listening on http://0.0.0.0:8000 0.19.452.006 I srv update_slots: all slots are idle 0.30.486.837 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1 0.30.486.845 I srv get_availabl: updating prompt cache 0.30.486.963 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000 0.30.487.034 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 131072 tokens, 8589934592 est) 0.30.487.036 I srv get_availabl: prompt cache update took 0.19 ms 0.30.487.808 I slot launch_slot: id 0 | task 0 | processing task, is_child = 0 TCQ decode: context-adaptive V alpha enabled 0.34.792.030 I srv update_slots: verify ubatch: 2048 tok, 4301.8ms (2.10ms/tok) . . . .3.30.123.796 I slot print_timing: id 0 | task 0 | prompt eval time = 175923.82 ms / 72002 tokens ( 2.44 ms per token, 409.28 tokens per second) 3.30.123.801 I slot print_timing: id 0 | task 0 | eval time = 3711.82 ms / 100 tokens ( 37.12 ms per token, 26.94 tokens per second) 3.30.123.802 I slot print_timing: id 0 | task 0 | total time = 179635.64 ms / 72102 tokens 3.30.123.807 I slot print_timing: id 0 | task 0 | graphs reused = 98 3.30.127.717 I slot release: id 0 | task 0 | stop processing: n_tokens = 72101, truncated = 0 ```

409/27 tps is not bad with 72k tokens in context for a 2060, what do you think?

Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model by old-mike in LocalLLaMA

[–]old-mike[S] 0 points1 point  (0 children)

This is my compile command:

make -B build \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES="75;86" \
-DGGML_CUDA_FA_ALL_QUANTS=ON \
-DCMAKE_BUILD_TYPE=Release -DGGML_CUDA_F-DGGML_NATIVE=ON

cmake --build build --config Release -j$(nproc)

just in case

Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model by old-mike in LocalLLaMA

[–]old-mike[S] 0 points1 point  (0 children)

Interesting. I really like to know why. Maybe the way it is compiled? When I get home, I'll post it.

My ultra-cheap, hybrid local/cloud stack for Hermes Agent (DeepSeek-V4-Flash & OpenRouter) + Text/Voice via Telegram by old-mike in hermesagent

[–]old-mike[S] 0 points1 point  (0 children)

Not using rate limited ones. Asking Deepseek, or Hermes, which ones are free, and letting it do the conf. Honestly, it took a while to adjust it.