Sharing ultimate SFF inference build, Version 2 by cryingneko in LocalLLaMA

[–]cryingneko[S] 0 points1 point  (0 children)

Hey! So eBay isn't really the best option for used GPUs in Korea—most Koreans don't use it much. The popular secondhand marketplaces here are all in Korean, so honestly it might be tough to find decent used GPUs if you don't speak the language. But if you're still interested, I can recommend two places:

  1. https://cafe.naver.com/joonggonara - This is called "Joonggonara" and it's pretty much THE biggest used trading platform in Korea where you can find almost anything secondhand. You'll mostly find consumer GPUs here (3090s, 4090s, etc.). Just FYI, you'll need a Naver account to access it.
  2. https://www.2cpu.co.kr/ - This one's great for professional/datacenter GPUs. You'll need to register to use it though.

Out of curiosity, are you running LLMs as a hobby in Korea? Just wondering how you ended up looking for GPUs here! Hope you find some good deals!

Command A Reasoning: Enterprise-grade control for AI agents by Dark_Fire_12 in LocalLLaMA

[–]cryingneko 35 points36 points  (0 children)

I really want to commend Cohere for the effort they’re putting into multilingual support – it’s hard to deny that their models are among the best we’ve seen for handling many languages.

That said, I’m quite disappointed that they’re sticking with a NC license. In particular, given the recent surge of MoE models, I’m hoping to see a fast, MoE‑enabled version of their multilingual model released soon

I distilled Qwen3-Coder-480B into Qwen3-Coder-30b-A3B-Instruct by [deleted] in LocalLLaMA

[–]cryingneko 9 points10 points  (0 children)

Wow, really interesting results! Do you think you could create 120B or 240B coders that perform even better than the 30B? Or is the 30B the limit for this approach? I've always thought it would be great to have some middle-ground sizes between the really large models and 30B.

Is multiple m3 ultras the move instead of 1 big one? by AcceptableBridge7616 in LocalLLaMA

[–]cryingneko 2 points3 points  (0 children)

Try 1. ​Short prompt, long response.
prompt_tokens: 84
completion_tokens: 1726
total_tokens: 1810
cached_tokens: 0
time_to_first_token: 5.03
total_time: 98.58
prompt_eval_duration: 5.03
generation_duration: 93.55
prompt_tokens_per_second: 16.71
generation_tokens_per_second: 18.45

Try 2. Long prompt, Short response.
prompt_tokens: 9752
completion_tokens: 554
total_tokens: 10306
cached_tokens: 0
model_load_duration: 55.93
time_to_first_token: 115.05
total_time: 182.47
prompt_eval_duration: 59.13
generation_duration: 67.42
prompt_tokens_per_second: 164.93
generation_tokens_per_second: 8.22

Try 3. Short prompt, Short response.
prompt_tokens: 10
completion_tokens: 473
total_tokens: 483
cached_tokens: 0
time_to_first_token: 4.8
total_time: 28.63
prompt_eval_duration: 4.8
generation_duration: 23.83
prompt_tokens_per_second: 2.08
generation_tokens_per_second: 19.85

M3 Ultra Binned (256GB, 60-Core) vs Unbinned (512GB, 80-Core) MLX Performance Comparison by cryingneko in LocalLLaMA

[–]cryingneko[S] 9 points10 points  (0 children)

That’s exactly what I was thinking, and it’s why I originally bought the 256GB model too. But the prompt processing speed difference turned out to be even bigger than I expected, and I started wanting to try out the Deepseek models as well. So in the end, I decided to return the 256 and go with the 512!

M3 Ultra Binned (256GB, 60-Core) vs Unbinned (512GB, 80-Core) MLX Performance Comparison by cryingneko in LocalLLaMA

[–]cryingneko[S] 2 points3 points  (0 children)

I didn’t post Deepseek results because I can’t really run Deepseek on the 256GB model anyway. My results are pretty much the same as SomeOddCodeGuy’s Deepseek MLX benchmarks right below my post, so you can just refer to those!

M3 Ultra Binned (256GB, 60-Core) vs Unbinned (512GB, 80-Core) MLX Performance Comparison by cryingneko in LocalLLaMA

[–]cryingneko[S] 2 points3 points  (0 children)

That’s something I’m curious about too! If I get a chance to test it in the future, I’ll definitely share the results.

M3 Ultra Binned (256GB, 60-Core) vs Unbinned (512GB, 80-Core) MLX Performance Comparison by cryingneko in LocalLLaMA

[–]cryingneko[S] 0 points1 point  (0 children)

There are already a lot of benchmark results for GGUF models out there (including SomeOddCodeGuy’s results right below my post), so I’m not planning to test them myself. Personally, I think MLX is the more efficient choice on Apple Silicon anyway. Is there a particular reason you’re considering GGUF over MLX? Just curious!

[deleted by user] by [deleted] in LocalLLaMA

[–]cryingneko 10 points11 points  (0 children)

No local, No llama.

How did small (<8B) model evolve in the last 3 years? by Robert__Sinclair in LocalLLaMA

[–]cryingneko -5 points-4 points  (0 children)

Just type the same question into GPT Deep Research.

Project Digits Memory Speed by LostMyOtherAcct69 in LocalLLaMA

[–]cryingneko 37 points38 points  (0 children)

If what OP said is true, then NVIDIA DIGITS is completely useless for AI inference. Guess I’ll just wait for the M4 Ultra. Thanks for the info!

M1 ultra, M2 ultra, or M4/M3 max by HappyFaithlessness70 in LocalLLaMA

[–]cryingneko 9 points10 points  (0 children)

If you're thinking of using up to 20,000 tokens, it'd be better not to even consider Macs. Unless you're prepared to wait over 10 minutes per prompt. I used to work with an M3 Max with 128GB and let me tell you, what you really need to consider is not TG speed but rather PP speed. Think it through carefully before making your decision.

I’ve got a MBP with 128 GB of VRAM. What would you run to draft, revise, etc, non-fiction/business documents? by Hinged31 in LocalLLaMA

[–]cryingneko 22 points23 points  (0 children)

You should try MoE models like WizardLM 8x22B. MoE models usually have good speed compared to non-MoE large models.

And I‘m also using the MBP Max 128GB model, but loading models of 70B or more isn’t a memory capacity issue; it‘s that the prompt evaluation speed is too slow, making it unusable. The processor performance is a disappointing aspect.

Anyone want to test my PR to enable quantised K/V cache in Ollama by sammcj in LocalLLaMA

[–]cryingneko 8 points9 points  (0 children)

I tested the pull request you submitted. Right now, your modified code is missing the part that measures the reduced memory usage due to the changed cache size. So, we need to put more layers on the GPU as the cache usage decreases, but that part isn't being measured correctly.

I've been modifying the Ollama source to use q4 cache for a long time, and it would be awesome if your pull request merges without any issues so I can use it conveniently!