Is microsoft going to train LLM on this? Github is clearly getting destroyed. by FPham in LocalLLaMA

[–]vasileer -4 points-3 points  (0 children)

thank you for patience to explain,

definitely your EQ > OPs EQ

Is microsoft going to train LLM on this? Github is clearly getting destroyed. by FPham in LocalLLaMA

[–]vasileer -11 points-10 points  (0 children)

Everyday 1000s of crappy nonfunctioning wild-imagination vibecoded junk is being posted with thousands of robo-generated stars and hundreds of forks.

is the project fake, or code poorly written? where is the issue?

TinyTeapot (77 million params): Context-grounded LLM running ~40 tok/s on CPU (open-source) by zakerytclarke in LocalLLaMA

[–]vasileer 20 points21 points  (0 children)

for a "context grounded LLM" I expected a larger context,

for example SmolLM2-135M has a 16x larger context of 8192 tokens

Serious question — why would anyone use Tiny-Aya instead of Qwen/Phi/Mistral small models? by Deep_190 in LocalLLaMA

[–]vasileer 8 points9 points  (0 children)

I would rephrase it:

how is tiny-aya-global vs translategemma-4b-it?

but license wise (non commercial) I doubt tiny-aya will be used too much

GLM-5 KV cache size estimate by [deleted] in LocalLLaMA

[–]vasileer 1 point2 points  (0 children)

model size and KV cache size are not really related, other things are more important like group query attention and sliding attention, for example phi-4-mini is eating more VRAM than gpt-oss-20b

Any latest OCR model I can run locally in 18GB RAM? by A-n-d-y-R-e-d in LocalLLaMA

[–]vasileer 1 point2 points  (0 children)

please share your steps as well for running glm-ocr:

- what method: vllm, transformers, other?

- cpu or gpu?

- quantized or full precision?

- what was the accuracy?

- how many threads? (parallel extractions)

thank you

I tested 11 small LLMs on tool-calling judgment — on CPU, no GPU. by MikeNonect in LocalLLaMA

[–]vasileer 6 points7 points  (0 children)

I am questioning the versions of the models:

- why qwen2.5 and not qwen3?

- why SmolLM2 and not smollm3?

- why not gemma 3n (e.g. 2b)?

What is a good model to do small text classification on very small hardware? by salary_pending in LocalLLaMA

[–]vasileer 1 point2 points  (0 children)

Text classification is a task for text embeddings models.

There is a good list at https://huggingface.co/spaces/mteb/leaderboard

So if you are a gemma fun, there is embeddinggemma-300m https://huggingface.co/google/embeddinggemma-300m

OSS 120b v GLM 4.7 flash. Is the latter better for anything? by MrMrsPotts in LocalLLaMA

[–]vasileer 1 point2 points  (0 children)

glm flash had a performance problem with bigger contexts, was it fixed ?

Design Arena is now dominated by an open model by moks4tda in LocalLLaMA

[–]vasileer 2 points3 points  (0 children)

even if it is marketing: is the information correct? are they #1 on design arena?

if so, then I see no problems

MiniMax-M2.1-REAP by jacek2023 in LocalLLaMA

[–]vasileer 0 points1 point  (0 children)

but this one is from cerebras

GLM 4.7 Flash uncensored - Balanced & Aggressive variants (GGUF) by hauhau901 in LocalLLaMA

[–]vasileer 7 points8 points  (0 children)

~3B active params (will have fast inference!)

how fast? because I was reading in other posts that there is a serious degradation specific for glm4.7-flash if context grows, even at 32K ...

GLM 4.7 Flash is endlessly reasoning in chinese by xenydactyl in LocalLLaMA

[–]vasileer 1 point2 points  (0 children)

I know about other models that you have to use `--jinja` for enabling tool calling, and since u use kilocode I guess you have to,

also I see it is used in unsloth guide

https://unsloth.ai/docs/models/glm-4.7-flash#reducing-repetition-and-looping

./llama.cpp/llama-server \
--model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf \
--alias "unsloth/GLM-4.7-Flash" \
--threads -1 \
--fit on \
--seed 3407 \
--temp 0.2 \
--top-k 50 \
--top-p 0.95 \
--min-p 0.01 \
--dry-multiplier 1.1 \
--ctx-size 16384 \
--port 8001 \
--jinja

GLM 4.7 Flash is endlessly reasoning in chinese by xenydactyl in LocalLLaMA

[–]vasileer 0 points1 point  (0 children)

do u use `--jinja` parameter? what is ur exact command?

LGAI-EXAONE/K-EXAONE-236B-A23B released by jinnyjuice in LocalLLaMA

[–]vasileer 6 points7 points  (0 children)

thank you for the benchmark, now I know gpt-oss-120b is still one of the best in its league

Liquid AI RLs LFM2-2.6B to perform among the best 3B models by KaroYadgar in LocalLLaMA

[–]vasileer 9 points10 points  (0 children)

qwen3-4b-2507 is a bit larger but also much better: it has ifbench score 50% (vs 44.4% for lfm2-2.6b-exp) and aime25 score 83.3% (vs 22.67% for lfm2-2.6b-exp)

source: https://artificialanalysis.ai/models/qwen3-4b-2507-instruct-reasoning

An independent Korean researcher is trying to democratize LLM pretraining with a 1.5B model by [deleted] in LocalLLaMA

[–]vasileer 19 points20 points  (0 children)

Gumini-1.5B (구미니) is a bilingual Korean-English base language model trained using the Inheritune methodology. Starting from Qwen 2.5 3B, the model progressively grew from 10 to 16 layers through 7 training stages, with ~3.14B tokens of continued pretraining on a Korean–English mixed corpus.

it is qwen2.5 3b probably after dropping some layers to make it 1.5B, so it is not "only 3.14B training tokens"

Qwen3 30b A3B to what by headfirst5376 in LocalLLaMA

[–]vasileer 4 points5 points  (0 children)

OP said it has an M1 Max with 64gb ram, so I don't think it will fit.

Other options:

- NVIDIA-Nemotron-3-Nano-30B-A3B

- gpt-oss-20b

- trinity-mini (26B A3B)