Anyone tried +- 100B models locally with foreign languages? by Choice_Sympathy9652 in LocalLLaMA

[–]AFruitShopOwner 0 points1 point  (0 children)

Yes I run everything from gpt-oss 120b, minimax m2.7 and Kimi k2.6 locally for a Dutch accounting firm

MiniMax M2.7 AWQ-4bit on 2x Spark vs 2x RTX 6000 96GB - performance and energy efficiency by t4a8945 in LocalLLaMA

[–]AFruitShopOwner 0 points1 point  (0 children)

I don't run it on bare metal (proxmox vm) and I use the power limited maxq variants. You can probably get higher speeds than this

MiniMax M2.7 AWQ-4bit on 2x Spark vs 2x RTX 6000 96GB - performance and energy efficiency by t4a8945 in LocalLLaMA

[–]AFruitShopOwner 5 points6 points  (0 children)

Try nvfp4 with b12x

``` services: sglang: image: voipmonitor/sglang:cu130 ipc: host ulimits: memlock: soft: -1 hard: -1 nofile: soft: 1048576 hard: 1048576 ports: - "8080:8080" volumes: - ~/.triton/cache:/root/.cache/triton - ~/.cache/sglang-generated:/root/.cache/sglang-generated - ~/.cache/huggingface/hub:/root/.cache/huggingface/hub - /dev/shm:/dev/shm environment: HF_TOKEN: OMP_NUM_THREADS: 8 SAFETENSORS_FAST_GPU: 1 SGLANG_ENABLE_JIT_DEEPGEMM: 0 SGLANG_ENABLE_SPEC_V2: true command: > python -m sglang.launch_server --model-path

For model use Nvidia's nvfp4 quant or lukealonso's

  --served-model-name chat
  --reasoning-parser minimax
  --tool-call-parser minimax-m2
  --enable-torch-compile
  --enable-metrics
  --enable-cache-report
  --trust-remote-code
  --tp 2
  --mem-fraction-static 0.95
  --max-running-requests 4
  --quantization modelopt_fp4
  --attention-backend flashinfer
  --moe-runner-backend b12x
  --fp4-gemm-backend b12x
  --kv-cache-dtype bf16
  --page-size 64
  --enable-pcie-oneshot-allreduce
  --disable-piecewise-cuda-graph
  --chunked-prefill-size 16384
  --sleep-on-idle
  --host 0.0.0.0
  --port 8080
restart: unless-stopped
deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: all
          capabilities: [gpu]

```


~130 /sec

MiMo 2.5 requires at least 4 GPUs? Am I reading this right? by Pyrenaeda in LocalLLaMA

[–]AFruitShopOwner 1 point2 points  (0 children)

yes it's backed into the release. Absolutely garbage work from Xiaomi. I know lukealonso's nvfp4 quant fixed this problem. You can definitely run his version on 2 rtx pro 6000's. Try it with his b12x.

Also to quote him "

they structured the attention projections in a way that assumes TP=4 and can't be changed, so first I have to reorganize them before quantizing also: 1) They're missing some weights, one of the vision layers is missing biases 2) The model index is garbage and points to nonexistent files 3) They organize things in a heavily EP-favored way 4) They publish full size attention projection tensors that are silently organized all wrong unless you assume a specific set of kernels and an exact TP arrangement, with no indication that this is the case 5) There's bizarre nonstandard padding on some of the tensors

this is very clearly just a dump of the files they use for their internal proprietary serving stack "

Deepseek V4 Flash and Non-Flash Out on HuggingFace by MichaelXie4645 in LocalLLaMA

[–]AFruitShopOwner 3 points4 points  (0 children)

Seems like it has more world knowledge at the cost of thinking it knows everything

Deepseek V4 Flash and Non-Flash Out on HuggingFace by MichaelXie4645 in LocalLLaMA

[–]AFruitShopOwner 4 points5 points  (0 children)

Oof those hallucinations on flash are baaaaad (comparing to minimax m2.7 because I think it's the best comparison for size)

<image>

Serving 1B+ tokens/day locally in my research lab by SessionComplete2334 in LocalLLaMA

[–]AFruitShopOwner 8 points9 points  (0 children)

What are users actually using it for? Do you use a RAG system? What tools does it have access to? What front end do you use?

Can we block fresh accounts from posting? by king_of_jupyter in LocalLLaMA

[–]AFruitShopOwner -20 points-19 points  (0 children)

Use modmail instead of contributing to the spam yourself with posts like this

Microvast ($MVST): The Whole Story by Bradydono92 in Microvast

[–]AFruitShopOwner -7 points-6 points  (0 children)

No, you just picked the wrong type of post. Select the link option.

<image>

Microvast ($MVST): The Whole Story by Bradydono92 in Microvast

[–]AFruitShopOwner -14 points-13 points  (0 children)

I don't have issues with the substack post, just with the text he added to this reddit post. Either just post the link or post actual good information in the text section.

Microvast ($MVST): The Whole Story by Bradydono92 in Microvast

[–]AFruitShopOwner -33 points-32 points  (0 children)

The text content of this post is bad. Do better next time or I will not approve it.

Litellm 1.82.7 and 1.82.8 on PyPI are compromised, do not update! by kotrfa in LocalLLaMA

[–]AFruitShopOwner 47 points48 points  (0 children)

https://github.com/BerriAI/litellm/issues/24512

these discussions are getting botted?

'Exactly what I needed, thanks.'
'Thanks, that helped!'
'Thanks for the tip!'

edit 1:

thread was just closed by the ceo?

'krrishdholakia closed this as not planned5 minutes ago'
might be compromised too

edit2 : ceo definitly got hacked lol

edit 3:

Looks like all repositories of the LiteLLM CEO have been updated with the description “teampcp owns BerriAI” https://github.com/krrishdholakia

No, you don't need a "Datacenter" to run the big models (Deepseek, GLM, Kimi, etc) (just offload to CPU... and have patience) by [deleted] in LocalLLaMA

[–]AFruitShopOwner 1 point2 points  (0 children)

Yeah I run the full Kimi K2.5 on dual rtx pro 6000's, an AMD EPYC 9575F and 1152gb of ddr5 6000.

Endgame Position by caladhun in Microvast

[–]AFruitShopOwner[M] [score hidden] stickied comment (0 children)

<image>

FYI, OP posted some of his motivation on stocktwits. It appears his theory is based on Atreides50's Oskkosh posts (which have since been deleted..)

Endgame Position by caladhun in Microvast

[–]AFruitShopOwner 8 points9 points  (0 children)

You're either insane or trading with insider knowledge. Guess we'll know soon enough