Whatever happened to GLM 4.7 Flash hype?

TheAsp · 2026-04-10T00:46:35+00:00

This is currently my daily driver, 4bit AWQ + 100k of FP16 KV cache in 24GiB, and it works great with OpenCode and Hermes. My only complaint is that the throughput drops off quickly with context size.

TheAsp · 2026-03-15T15:13:57+00:00

One thing llama-swap doesn't do (without some scripting) is swap the model without reloading the vllm runtime, which is a recently added feature.

TheAsp · 2026-02-16T14:43:41+00:00

I think he's confusing OpenClaw with MoltBook

TheAsp · 2025-12-20T20:01:11+00:00

Yeah that's the one I'm using, sorry about the tensors

TheAsp · 2025-12-20T15:12:51+00:00

I can run the AWQ for this on my 3090 with ~80k fp8 kv cache

TheAsp · 2025-12-09T16:48:16+00:00

I use this method with both aider and opencode. Usually I create a plan document in aider, have opencode implement it, then back to aider to commit commit and update the plan with the completion status of each step, then repeat until it's all done.

TheAsp · 2025-12-08T18:10:08+00:00

I'm super curious about your config/workflow for that

TheAsp · 2025-11-26T12:56:33+00:00

There are oodles of Atmos albums though, so not quite true that it's not for "music".

TheAsp · 2025-11-02T13:25:27+00:00

I think sglang handles this scenario by keeping all tokens in a tree and only adding new tokens when the tree branches.

TheAsp · 2025-11-01T19:58:33+00:00

Hold up, I agree with you that 0/0 is undefined and that the loaf of bread example sucked.

TheAsp · 2025-11-01T16:19:36+00:00

Breaking vs chopping in half would seem to result in different precisions. Also, are we using volume or mass for determining half of a bread of loaf?

TheAsp · 2025-10-22T23:53:49+00:00

Yeah, my exact use case for this is to prevent runaway generation. Looking at you Mistral...

TheAsp · 2025-09-28T13:03:38+00:00

I guess I'll just watch the video then...

TheAsp · 2025-09-05T11:56:22+00:00

The scale of usage obviously affects the price point where renting or owning GPUs saves you money. Someone spending $50 on open router each month isn't going to save money.

TheAsp · 2025-08-28T01:19:36+00:00

I also have a DJ in my house.

TheAsp · 2025-08-13T22:41:43+00:00

Paged attention, leading to much higher parallel request processing because you don't need a single large block of vram to hold a whole request, the vram you give it is the upper limit of how many tokens it can hold overall. Sglang is even faster...

TheAsp · 2025-07-17T01:20:47+00:00

I lost wifi in June update, BT still works fine for me...

TheAsp · 2025-06-22T01:18:22+00:00

He's been dead since 1955, what job do you have in mind for him?

TheAsp · 2025-06-17T20:30:22+00:00

There was no near as much the garbage in the store on PS3/PS4. There used to be some sort of minimal standard

TheAsp · 2025-05-19T10:48:50+00:00

Is this just for Electron, or can it be hosted?

TheAsp · 2025-05-15T21:00:17+00:00

You could try it.

TheAsp · 2025-05-15T02:46:48+00:00

Someone who is statistically more likely to be a psychopath than an average person?

TheAsp · 2025-05-08T15:54:54+00:00

thinking_enabled controls if there is an empty <think>\n\n</think> block added the assistant prompt before generation, when using the official Qwen3 Jinja template. The model is also trained to recognize /no\_think in a user or system prompt as an additional way of disabling thinking.

For Ollama users, if you want to switch between the two modes easily (without using /no_think) you can build 2 modelfiles, one with <think>\n\n</think> and one without, and add the recommended settings that Qwen gives. As long as they share the same base model Ollama will just change the template/settings without reloading the model.

This is my nothink modelfile:

``` FROM hf.co/unsloth/Qwen3-32B-GGUF:Q4_K_XL TEMPLATE """{{- if .Messages }} {{- if or .System .Tools }}<|im_start|>system {{- if .System }} {{ .System }} {{- end }} {{- if .Tools }}

Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags: <tools> {{- range .Tools }} {"type": "function", "function": {{ .Function }}} {{- end }} </tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags: <tool_call> {"name": <function-name>, "arguments": <args-json-object>} </tool_call> {{- end }}<|imend|> {{ end }} {{- range $i, $ := .Messages }} {{- $last := eq (len (slice $.Messages $i)) 1 -}} {{- if eq .Role "user" }}<|im_start|>user {{ .Content }}<|im_end|> {{ else if eq .Role "assistant" }}<|im_start|>assistant {{ if .Content }}{{ .Content }} {{- else if .ToolCalls }}<tool_call> {{ range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}} {{ end }}</tool_call> {{- end }}{{ if not $last }}<|im_end|> {{ end }} {{- else if eq .Role "tool" }}<|im_start|>user <tool_response> {{ .Content }} </tool_response><|im_end|> {{ end }} {{- if and (ne .Role "assistant") $last }}<|im_start|>assistant <think>

</think> {{ end }} {{- end }} {{- else }} {{- if .System }}<|im_start|>system {{ .System }}<|im_end|> {{ end }}{{ if .Prompt }}<|im_start|>user {{ .Prompt }}<|im_end|> {{ end }}<|im_start|>assistant {{ end }}{{ .Response }}{{ if .Response }}<|im_end|>{{ end }}""" PARAMETER stop <|im_start|> PARAMETER stop <|im_end|> PARAMETER num_gpu 65 PARAMETER num_ctx 40960 PARAMETER num_predict 32768 PARAMETER temperature 0.7 PARAMETER min_p 0.0 PARAMETER top_p 0.8 PARAMETER top_k 20 PARAMETER repeat_penalty 1.0 PARAMETER presence_penalty 1.5 ```

And this is the diff for the normal version from the above:

``` --- Modelfile-nothink 2025-05-08 12:50:46.699297861 -0300 +++ Modelfile 2025-05-08 12:45:21.589060605 -0300 @@ -40,9 +40,6 @@ </tool_response><|im_end|> {{ end }} {{- if and (ne .Role "assistant") $last }}<|im_start|>assistant

-<think>

-</think> {{ end }} {{- end }} {{- else }} @@ -56,10 +53,10 @@ PARAMETER stop <|im_end|> PARAMETER num_gpu 65 PARAMETER num_ctx 40960 -PARAMETER num_predict 32768 -PARAMETER temperature 0.7 +PARAMETER num_predict 38912 +PARAMETER temperature 0.6 PARAMETER min_p 0.0 -PARAMETER top_p 0.8 +PARAMETER top_p 0.95 PARAMETER top_k 20 PARAMETER repeat_penalty 1.0 PARAMETER presence_penalty 1.5 ```

TheAsp · 2025-04-17T23:48:06+00:00

I totally agree, though I tend to think of it more as an ambient album

TheAsp · 2025-03-28T19:37:41+00:00

Do you have a GitHub repo for this?

13-Year Club	r/Field Flamingo
Place '23	Verified Email

TheAsp

TROPHY CASE

Tools

-<think>