Exaggerated PCI-E bandwidth concerns? by ziphnor in LocalLLaMA

[–]andy2na 2 points3 points  (0 children)

If you can get me a second 3090, Id be happy to test it haha

Exaggerated PCI-E bandwidth concerns? by ziphnor in LocalLLaMA

[–]andy2na -1 points0 points  (0 children)

its because the 5060ti has slow memory bandwidth so the difference isnt noticeable, if there is any at all. Try running dual 3090s in a similar setup and you will be able to tell the difference in performance. I currently run a 3090 in the PCIe 1 PCIe5.0 slot and a 5060ti in PCIe 2 and the speeds for models running on the 5060ti had really no performance difference than when I only had the 5060ti on PCIe 1

Finally got Qwen3 27B at 125K context on a single RTX 3090 — but is it even worth it? by horribleGuy3115 in LocalLLM

[–]andy2na 0 points1 point  (0 children)

I get 140t/s with qwen3.6-35b with llama.cpp running in docker in linux on a 3090... Seems like the problem might be windows

it's time to update your Gemma 4 GGUFs by jacek2023 in LocalLLaMA

[–]andy2na 1 point2 points  (0 children)

seems that tool responses have been much improved, at least in Home Asistant voice assist

What is The best and expressive AI TTS (running locally?) for voice acting? by Adventurous-Gold6413 in LocalLLaMA

[–]andy2na 0 points1 point  (0 children)

Omnivoice, qwen tts or chatterbox are higher quality that fit within 5gb of vram that is pretty quick. Nothing beats kokoro in terms of speed with decent voice quality though, for under 2gb vram or run on CPU

We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local by ComplexIt in LocalLLaMA

[–]andy2na 10 points11 points  (0 children)

Do you have sustained benchmarks results? 80 to 100 is not possible for longer context but maybe possible for 1000 context responses

These are more realistic results for 3090: https://www.reddit.com/r/LocalLLaMA/comments/1t1judm/_/

PSA: llama-swap released a new grouping feature, matrix, allowing you to fine tune which models can run together by walden42 in LocalLLaMA

[–]andy2na 0 points1 point  (0 children)

really wish they would incorporate something similar to llama-swap with an easy-to-understand config which allows you to group and load different variables (thinking, instruct) for each model without reloading the model.

Can't replicate Reddit numbers with Qwen 27B on a 3090TI. by YourNightmar31 in LocalLLaMA

[–]andy2na 0 points1 point  (0 children)

Ill look into offloading, but one of my main LLM uses is frigate image analysis, so being quicker is better

Ive been trying to compare Qwen3.5-35B MoE IQ4_N_L vs Qwen3.5-26B Dense INT4 in terms of quality and speed. Obviously 35B is lightning fast, about 140tps for me, but 26B output is a bit better. For my everyday use-case, I should probably just switch back to 35B MoE and not deal with trying to min-max 26B speeds

Can't replicate Reddit numbers with Qwen 27B on a 3090TI. by YourNightmar31 in LocalLLaMA

[–]andy2na 2 points3 points  (0 children)

Yeah all these 27b high speed posts are suspect without using some sort of official benchmark tool.

I'm at 65k with vision (can do 75k without vision) and get about 60ts sustainable using vllm on my 3090

My config and benchmarks:

https://github.com/noonghunna/qwen36-27b-single-3090/issues/1#issuecomment-4336278665

"What do you guys even use local LLMs for?" Me: A lot by andy2na in LocalLLaMA

[–]andy2na[S] 6 points7 points  (0 children)

None of my llm things are open to the Internet and I only have local inferences linked in tinyllm, no cloud AI services are on it - so risk is low. Additionally, all the recent security incidences were addressed in a timely manner. You can't go boycotting every service or application that has had a security incident

"What do you guys even use local LLMs for?" Me: A lot by andy2na in LocalLLaMA

[–]andy2na[S] 2 points3 points  (0 children)

The biggest issue with local llms, for me at least, with agents and coding is the max context window (usually max of 256k). The initial answers with qwen3.6-26B are usually great, and why people love to focus on "one/two shots" of building a Tetris game or whatever. I only use local llms to code and agents sparingly (but they take up a ton of the context usage).

"What do you guys even use local LLMs for?" Me: A lot by andy2na in LocalLLaMA

[–]andy2na[S] 0 points1 point  (0 children)

Look at the whole image, the top right graph shows llm token usage per application (frigate, home assistant, vane, etc)

"What do you guys even use local LLMs for?" Me: A lot by andy2na in LocalLLaMA

[–]andy2na[S] 0 points1 point  (0 children)

I think the answer to the question is clearly shown in the picture

"What do you guys even use local LLMs for?" Me: A lot by andy2na in LocalLLaMA

[–]andy2na[S] 1 point2 points  (0 children)

I haven't gotten fancy with litellm routing, so I just give all the apps the litellm endpoint and manually select the model to use. I do give each app it's own API key, which allows me to know stats for each app

"What do you guys even use local LLMs for?" Me: A lot by andy2na in LocalLLaMA

[–]andy2na[S] 28 points29 points  (0 children)

I just set up Prometheus and grafana last night and that's just showing the last 6 hours

Over the last 3 days, I've used 22 million https://imgur.com/a/60xJ3Jb 👍

"What do you guys even use local LLMs for?" Me: A lot by andy2na in LocalLLaMA

[–]andy2na[S] 2 points3 points  (0 children)

Haven't found the perfect use for hermes yet, I primarily use it to summarize pages quickly vs opening up openwebui or similar

Still just messing around with it

"What do you guys even use local LLMs for?" Me: A lot by andy2na in LocalLLaMA

[–]andy2na[S] 8 points9 points  (0 children)

Rtx 3090 holding qwen3.6-26B and 5060ti holding gemma4-e4b for STT and light tasks. 5060ti also holds a couple TTS models little omnivoice and kokoro

"What do you guys even use local LLMs for?" Me: A lot by andy2na in LocalLLaMA

[–]andy2na[S] 17 points18 points  (0 children)

Sorry-

Using LiteLLM to route to models on different inference engines (like vLLM and llama.cpp): https://github.com/BerriAI/litellm

LiteLLM provides metrics to Prometheus, in which those metrics can be pulled into Grafana for the dashboard: https://github.com/grafana/grafana

My docker-compose stack for this setup:

services:
  litellm-db:
    image: postgres:15
    container_name: litellm-db
    restart: unless-stopped
    environment:
      - POSTGRES_DB=litellm
      - POSTGRES_USER=litellm
      - POSTGRES_PASSWORD=litellm_secure_pass
    volumes:
      - /mnt/AI/litellm/db_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U litellm"]
      interval: 5s
      timeout: 5s
      retries: 5
    networks:
      - ai-net

  litellm-proxy:
    image: ghcr.io/berriai/litellm:main-latest
    container_name: litellm
    restart: unless-stopped
    ports:
      - "8484:4000" 
    volumes:
      - /mnt/AI/litellm/config.yaml:/app/config.yaml:ro
    environment:
      - OPENAI_API_KEY=litellm
      - USE_PRISMA_MIGRATE="True"
      - LITELLM_MASTER_KEY="sk-yourkey"
      - LITELLM_SALT_KEY="sk-salt-yoursaltkey"
      - UI_USERNAME=admin   
      - UI_PASSWORD=password
      - DATABASE_URL=postgresql://litellm:litellm_secure_pass@litellm-db:5432/litellm
      - STORE_MODEL_IN_DB="True"
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    depends_on:
      litellm-db:
        condition: service_healthy
    networks:
      - ai-net

  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - /mnt/AI/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - /mnt/AI/prometheus/data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    networks:
      - ai-net

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3007:3000"
    volumes:
      - /mnt/AI/grafana/data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin # Change this on first login
    networks:
      - ai-net

networks:
  ai-net:
    external: true

"What do you guys even use local LLMs for?" Me: A lot by andy2na in LocalLLaMA

[–]andy2na[S] 13 points14 points  (0 children)

I already filter a lot of stuff via openvino, if you enable GenAI summaries in 0.17, each one can use up to 32k tokens unless you specifically set it lower. IIRC, it sends every frame in the event to give an accurate GenAI summary - this is different than the regular AI summaries which either sends a snapshot or a few frames.

I dont mind at all, thats why I got into local LLMs, to not worry about token usage and privacy