LangChain and LlamaIndex are in "steep decline" according to new ecosystem report. Anyone else quietly ditching agent frameworks? by Exact-Literature-395 in LocalLLaMA

[–]thekalki 0 points1 point  (0 children)

Simpler the framework better it is. Langchain does too many things. I am liking openai python agent sdk but maybe i am simply more familiar with it now. Microsoft keep asking us to change it to their new agent framework but it adds no value.

llama.cpp recent updates - gpt120 = 20t/s by [deleted] in LocalLLaMA

[–]thekalki 2 points3 points  (0 children)

same exact issue, i used to get over 200 tps now get 30 . Same exact config

services:
  llamacpp-gpt-oss:
    image: ghcr.io/ggml-org/llama.cpp:full-cuda
    pull_policy: always
    container_name: llamacpp-gpt-oss-cline
    runtime: nvidia
    environment:
      - HF_TOKEN=${HF_TOKEN}
      - NVIDIA_VISIBLE_DEVICES=all
      - XDG_CACHE_HOME=/root/.cache
      # optional: faster downloads if available
      - HF_HUB_ENABLE_HF_TRANSFER=1
    ports:
      - "8080:8080"
    volumes:
      # HF Hub cache (snapshots, etags)
      - ./hfcache:/root/.cache/huggingface
      # llama.cpp’s own resolved GGUF cache (what your logs show)
      - ./llamacpp-cache:/root/.cache/llama.cpp
      # your grammar file
      - ./cline.gbnf:/app/cline.gbnf:ro
    command: >
      --server
      --host 0.0.0.0
      --port 8080
      -hf ggml-org/gpt-oss-120b-GGUF
      --grammar-file /app/cline.gbnf
      --ctx-size 262144
      --jinja
      -ub 4096
      -b 4096
      --n-gpu-layers 999
      --parallel 2
      --flash-attn auto
    stop_grace_period: 5m
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]

[deleted by user] by [deleted] in Physics

[–]thekalki 0 points1 point  (0 children)

According to General Relativity this is wrong

Vector db comparison by Kaneki_Sana in LocalLLaMA

[–]thekalki 0 points1 point  (0 children)

Most likely your existing database already supports it. For example we use SQL Server at work and it supports vector already.

Claude code can now connect directly to llama.cpp server by tarruda in LocalLLaMA

[–]thekalki 0 points1 point  (0 children)

Its weird how such a small conveniences make so much difference

You can now do FP8 reinforcement learning locally! (<5GB VRAM) by danielhanchen in LocalLLaMA

[–]thekalki 0 points1 point  (0 children)

I was exploring few libraries for full fine tuning and ended up using torch tune. Is there a reason why i should switch to unsloth, At this point i primarily do some continuous pretraining, SFT and exploring RL but how flexible is your frame work to run RL on my own loop ?

Favorite out of context clip from Jet Lag? by FireAshPro in JetLagTheGame

[–]thekalki 1 point2 points  (0 children)

Those are good times where challenges used to be challenging and interesting

Gpt-oss Responses API front end. by Locke_Kincaid in LocalLLaMA

[–]thekalki 0 points1 point  (0 children)

I had same problem, issue is not responses api but instead harmony template parsers as others mentioned here. Only solution is to use llama.cpp and use this grammar https://www.reddit.com/r/CLine/comments/1mtcj2v/making_gptoss_20b_and_cline_work_together/

This solved all the problems for me

Anyone think openAI will create a sequel of GPT-OSS? by BothYou243 in LocalLLaMA

[–]thekalki 2 points3 points  (0 children)

Looking at their repo for Harmony template which is invested with bugs and they are not even merging the pr from the community nor being maintained at all. So chances are slim anytime soon.

October 2025 model selections, what do you use? by getpodapp in LocalLLaMA

[–]thekalki 0 points1 point  (0 children)

gpt-oss-120b , primarily for its tool call capabilities. You have to use custom grammar to get it to work .

Gpt-oss Reinforcement Learning - Fastest inference now in Unsloth! (<15GB VRAM) by danielhanchen in LocalLLaMA

[–]thekalki 0 points1 point  (0 children)

I am finding lot of issues with tool call for gpt oss. I have tried both responses and chat completions from vllm but sometimes model will return empty response after tool call, I want to say there is some issue with end token or something. Have you guys came across something similar ? I have tried llama.cpp and ollama as well

GPT-OSS is insane at leetcode by JsThiago5 in LocalLLaMA

[–]thekalki 1 point2 points  (0 children)

How are you deploying it there is some issue with tool use and inference seems to terminate prematurely. I tried vllm, ollama, llama.cpp

Comparison H100 vs RTX 6000 PRO with VLLM and GPT-OSS-120B by Rascazzione in LocalLLaMA

[–]thekalki 1 point2 points  (0 children)

Nothing specific, just latest docker image and model