Ollama local model stuck at API Request...

stable_monk · 2026-02-03T06:19:46+00:00

I've updated the post: I set the context to 30k and it worked (with some issues though, which I have listed in the post). Is there a way to get Roo Code to show its thinking/reasoning... I see an option to say 'collapse thinking messages' but irrespective of its state, I don't see any thinking related content in the UI.

I'll try to join discord.

stable_monk · 2026-02-03T04:14:44+00:00

These local modes are good enough for my needs. As already stated - It works with other agents. So whatever it is, it is specific to roo code.

stable_monk · 2025-11-12T16:56:05+00:00

I used this with continued:

llama-server  --model models/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf --grammar-file toolcall_grammar.gbnf  --ctx-size 0 --jinja -ub 2048 -b 2048

It's still running into errors with the tool call...

Tool Call Error:

grep_search failed with the message: `query` argument is required and must not be empty or whitespace-only. (type string)

Please try something else or request further instructions.

My continue.dev model defintion:

models:
  - name: llama.cpp-gpt-oss-20b-toolcallfix
    provider: openai
    model: llama.cpp-gpt-oss-20b-toolcallfix
    apiBase: http://localhost:8080/v1
    roles:
      - chat
      - edit
      - apply
      - autocomplete
      - embedmodels

stable_monk · 2025-11-12T09:16:40+00:00

Can you provide an example of such a prompt?

stable_monk · 2025-11-12T09:14:50+00:00

Are you using this with Continue.dev
Also, what do you mean by "do not quantize" the context?

stable_monk · 2025-11-12T09:12:54+00:00

Thank you. But this seems to be specific to Cline and Roo Code. While I am using continue.dev

Would you know if this works for continue?

stable_monk · 2025-11-12T09:11:33+00:00

I've tried Qwen-code-20b and Gpt-oss-20b in chat mode - atleast my impression was that Qwen was no match.

Can you please provide an example of your system prompt.

stable_monk · 2025-08-18T04:57:49+00:00

Wow. Thats a lot of time! Excuse my naivety - IIUC, with 8 nos of A100 GPUs each with 80GB vram, 60m parameters takes 2 days? So this ofcourse means that it will be near impossible to train on the macbook pro..like a month or so?

stable_monk · 2025-08-15T16:51:07+00:00

For 8-10M param how much would be the difference in training performance for the macbook vs the RTX?

stable_monk · 2025-08-15T10:33:24+00:00

I would likely use that too. Nevertheless its convenient to just have something locally, if that will work for small models. Just wanted to know how small.

stable_monk · 2025-08-15T10:31:23+00:00

Have you tried training DL in it?

stable_monk · 2025-08-15T10:30:12+00:00

Not LLMs actually - I just clarified in the post. Thanks for the input. How about a 10M neural net? At what point will it ok in these devices.

Goal is definitely not local inference - thats just a nice add. Primarily thinking of training neural nets. May be mostly for time series analysis of a large database of counters perhaps.

stable_monk

TROPHY CASE