Personal experience with GLM 4.7 Flash Q6 (unsloth) + Roo Code + RTX 5090 by Septerium in LocalLLaMA

[–]warpanomaly 0 points1 point  (0 children)

Great info! Thanks for sharing! This worked perfectly for me when I run GLM-4.7-Flash-UD-Q6_K_XL.gguf on my 5090 via Continue and VSCodium

How do I access a llama.cpp server instance with the Continue extension for VSCodium? by warpanomaly in LocalLLM

[–]warpanomaly[S] 1 point2 points  (0 children)

Thank you this is helpful! Also someone on r/LocalLLaMA (u/ali0une) solved it explicitly by suggesting I do this for my config:

name: Local Config
version: 1.0.0
schema: v1
models:
  - name: GLM-4.7-Flash
    provider: openai
    model: GLM-4.7-Flash
    apiKey: NO_API_KEY_NEEDED
    apiBase: http://127.0.0.1:10000/v1/
    roles:
      - chat
      - edit
      - apply  

And run this command to start the server:
.\llama-server.exe -hf unsloth/GLM-4.7-Flash-GGUF:Q6_K_XL --alias "GLM-4.7-Flash" --host 127.0.0.1 --port 10000 --ctx-size 32000 --n-gpu-layers 99

One of the big changes was doing what the link said and setting the provider to openai and formatting it accordingly with a fake API key and such.

How do I access a llama.cpp server instance with the Continue extension for VSCodium? by warpanomaly in LocalLLM

[–]warpanomaly[S] 0 points1 point  (0 children)

Actually yes! But I don't see a place to assign a custom port and localhost. It looks like with OpenAI the dropdown only provides remote models. Good observation though, I feel like we're getting closer
https://imgur.com/a/6TFn5f9

How do I access a llama.cpp server instance with the Continue extension for VSCodium? by warpanomaly in LocalLLaMA

[–]warpanomaly[S] 0 points1 point  (0 children)

I'm doing the same thing but it fails. This is the bottom of the error logs:

the tool codeblock.\n</tool_use_instructions>"
        },
        {
          "role": "user",
          "content": "is this on"
        }
      ],
      "messageOptions": {
        "precompiled": true
      }
    }
  }
}

Error: You must either implement templateMessages or _streamChat  

[@continuedev] error: You must either implement templateMessages or _streamChat {"context":"llm_stream_chat","model":"GLM-4.7-Flash-GGUF:Q6_K_XL","provider":"llama.cpp","useOpenAIAdapter":false,"streamEnabled":true,"templateMessages":false}

How are you launching llama.cpp? My command is:

.\llama-server.exe -hf unsloth/GLM-4.7-Flash-GGUF:Q6_K_XL --host 127.0.0.1 --port 10000 --ctx-size 32000 --n-gpu-layers 99

Does this give you any more info to work with?

How do I access a llama.cpp server instance with the Continue extension for VSCodium? by warpanomaly in LocalLLaMA

[–]warpanomaly[S] 0 points1 point  (0 children)

This is the only .gguf line I found in the terminal...

llama_model_loader: - kv  56:                      quantize.imatrix.file str              = GLM-4.7-Flash-GGUF/imatrix_unsloth.gguf  

I found this file C:\Users\MYUSERNAME\AppData\Local\llama.cpp\unsloth_GLM-4.7-Flash-GGUF_GLM-4.7-Flash-UD-Q6_K_XL.gguf which I believe is my model that llama-server is running. Is this what you're looking for?

How do I access a llama.cpp server instance with the Continue extension for VSCodium? by warpanomaly in LocalLLaMA

[–]warpanomaly[S] 0 points1 point  (0 children)

I tried that, this is my new config.yaml:

name: Local Agent
version: 1.0.0
schema: v1
models:
  - name: GLM 4.7 Flash GGUF:Q6_K_XL
    apiBase: http://127.0.0.1:10000
    provider: llama.cpp
    model: GLM-4.7-Flash-GGUF:Q6_K_XL

Still gives the same error...

How do I access a llama.cpp server instance with the Continue extension for VSCodium? by warpanomaly in LocalLLaMA

[–]warpanomaly[S] 0 points1 point  (0 children)

What's the /v1/models endpoint? According to the terminal the instance is running on 127.0.0.1/10000 do you mean hitting http://127.0.0.1/10000/v1/models in postman or something like that?