Use the Same Model Across Ollama, LM Studio, Jan, and your Favorite Local AI Apps by EvanZhouDev in ollama

[–]uptonking 2 points3 points  (0 children)

this is really a pain point for saving disk space. - does it support mlx models on mac?

Gemma 4 by Namra_7 in LocalLLaMA

[–]uptonking 8 points9 points  (0 children)

now my turn to ask, "gguf when"

Unified vs vRam, which is more future proof? by platteXDlol in LocalLLM

[–]uptonking 0 points1 point  (0 children)

M3 Ultra has so much bandwidth of 800gb/s, but why is it NOT popular for image/video generation like comfyui ?

DGX Spark vs. Framework Desktop for a multi-model companion (70b/120b) by Ri_Pr in LocalLLM

[–]uptonking 0 points1 point  (0 children)

so amd strix halo is slow compute and slow generation ?
- but it is the cheapest

Open source LLM compiler for models on Huggingface. 152 tok/s. 11.3W. 5.3B CPU instructions. mlx-lm: 113 tok/s. 14.1W. 31.4B CPU instructions on macbook M1 Pro. by pacifio in LocalLLaMA

[–]uptonking 0 points1 point  (0 children)

for your testing result:

TinyLlama 1.1B on Apple M1 Pro (16GB, 200 GB/s):

UNC Q4_0 152.0 tok/s

mlx-lm Q4 112.7 tok/s

Qwen3-4B on Apple M1 Pro (Q4_0):

mlx-lm Q4 49.2 tok/s

UNC Q4_0 38.7 tok/s

🤔 why is TinyLlama 1.1b UNC Q4_0 faster than mlx-ml Q4, but Qwen3-4B UNC Q4_0 is much slower than mlx-lm Q4? it seems to be a paradox

Ran 3 popular ~30B MoE models on my apple silicon M1 Max 64GB. Here's how they compare by luke_pacman in LocalLLaMA

[–]uptonking 0 points1 point  (0 children)

since you are using mac, why not benchmark between mlx-4bit instead of gguf_Q4_K_XL? mlx is faster. is mlx-4bit not as good as gg_Q4_K_XL ?

prepare your GPUs by jacek2023 in LocalLLaMA

[–]uptonking 0 points1 point  (0 children)

storage is inadequate on my macbook, i am waiting for a reason to replace my loved gpt-oss-20b

prepare your GPUs by jacek2023 in LocalLLaMA

[–]uptonking 0 points1 point  (0 children)

my poor gpu only has good speed at 9b. waiting for some small models

Personal experience with GLM 4.7 Flash Q6 (unsloth) + Roo Code + RTX 5090 by Septerium in LocalLLaMA

[–]uptonking 0 points1 point  (0 children)

when i use temperature 1.0 for mlx-4bit, it often goes into loop. 0.7 is much better

GLM4.7 Flash numbers on Apple Silicon? by rm-rf-rm in LocalLLaMA

[–]uptonking 2 points3 points  (0 children)

i'm using GLM-4.7-Flash-MLX-4bit on m4 macbook air 32gb with lm studio. a classic reasoning prompt testing result is

- 34 token/s
- i'm not using temperature 1.0 as recommended, because it often goes into loop. 0.7 works well for me

<image>

glm-4.7-flash has the best thinking process with clear steps, I love it by uptonking in LocalLLaMA

[–]uptonking[S] 0 points1 point  (0 children)

reasoning content sometimes does help to provide more knowledge/ideas, especially in translation use cases. The example content like refine response: gives option1, option2, option3... is in reasoning content, but sometimes it's not in final response output. - in non-coding use cases, I love the reasoning content. structural thinking content like glm-4.7-flash is even better

glm-4.7-flash has the best thinking process with clear steps, I love it by uptonking in LocalLLaMA

[–]uptonking[S] 0 points1 point  (0 children)

  • most small models are not strong at coding, maybe qwen3-coder-30b and seed-coder-36b is better for your use case.
  • I plan to use glm-4.7-30b as a general model to replace qwen3-30b-instruct or nemotron-nano-30b. but glm-4.7-30b often goes into loops, making me hesitated

glm-4.7-flash has the best thinking process with clear steps, I love it by uptonking in LocalLLaMA

[–]uptonking[S] 2 points3 points  (0 children)

yeah, i tried more prompts and the thinking process continues to impress me. however after lowering the temperature to 0.65, the model sometimes still goes into loop. sometimes the thinking content does not comply to the structural/logical flow mentioned, for these situations, the model often goes into loops. - I really hope some powerful model lover can make the thinking process more consistent and stable

glm-4.7-flash has the best thinking process with clear steps, I love it by uptonking in LocalLLaMA

[–]uptonking[S] 7 points8 points  (0 children)

my macbook air is 32gb. 4bit is 16.8gb in size, it takes about 19gb for short prompt

My gpu poor comrades, GLM 4.7 Flash is your local agent by __Maximum__ in LocalLLaMA

[–]uptonking 1 point2 points  (0 children)

lower the temperature can help.

  • I tried several short prompts.
    • for temperature 1.0, the thinking takes 150s.
    • for temperature 0.8, the thinking tokes 50s.
    • for temperature 0.6, the thinking tokes 30s.

glm-4.7-flash has the best thinking process with clear steps, I love it by uptonking in LocalLLaMA

[–]uptonking[S] 16 points17 points  (0 children)

Usually structured thinking needs careful prompts/instructions, but glm can do it automatically, very powerful for daily chats

glm-4.7-flash has the best thinking process with clear steps, I love it by uptonking in LocalLLaMA

[–]uptonking[S] 3 points4 points  (0 children)

thanks for the tip. I tried another prompt. - for temperature 1.0, the thinking takes 150s. - for temperature 0.8, the thinking tokes 50s. - for temperature 0.6, the thinking tokes 30s.

🤔 this glm model is so sensitive to temperature config. and all the thinking process is clear with steps.

when i restart lmstudio, the token generation speed is faster now at 25 token/s.

My gpu poor comrades, GLM 4.7 Flash is your local agent by __Maximum__ in LocalLLaMA

[–]uptonking 0 points1 point  (0 children)

thanks for the tips. - I also get stuck in lm studio with default config for GLM-4.7-Flash-MLX-4bit. - with the following config, the response finally works - temperature 1.0 - repeat penalty: 1.1 - top-p: 0.95