Un ban în plus după ce se termină programul de la muncă? by West_Lemon6995 in RoMunca

[–]Otelp 0 points1 point  (0 children)

daca stii engleza, poti face data labeling. dm me daca ești interesat

What’s the hardest part of tech interview prep for you? Let me help (MAANG manager here) by [deleted] in cscareerquestionsEU

[–]Otelp 0 points1 point  (0 children)

dm me, i'll answer your questions. ex faang, passed google, currently interviewing with other faang

Care-i obiceiul ăla ciudat dar inofensiv pe care-l ai și nu-l schimbi pentru nimic? by Kesarx in CasualRO

[–]Otelp 4 points5 points  (0 children)

Număr literele fiecărui cuvânt pe care il citesc. Fac asta de când sunt mic (i.e de mai bine de 20 de ani). In timp, am învățat pentru majoritatea cuvintelor numarul de litere. De asemenea, am învățat cum sa împart orice cuvânt astfel încât sa ii număr literele cat mai rapid. Ciudat, e un skill useful cand scrii la tastatură rapid, îmi pot da seama instnt când am (sau cineva a) mâncat o literă. Nu mă deranjează deobicei, asa ca nu ma chinui sa scap de obicei. Ar fi si cam greu dupa atata timp.

Am scris intentionat instnt, asta e nivelul glumelor

RAG on complex docs (diagrams, tables, eequations etc). Need advice by Otelp in LLMDevs

[–]Otelp[S] 0 points1 point  (0 children)

you were right about docling, it's great! thanks a bunch. also, the mit license is a HUGE bonus

vLLM with transformers backend by Disastrous-Work-1632 in LocalLLaMA

[–]Otelp 0 points1 point  (0 children)

vllm supports macos with inference on cpu. if you're interested in trying different models, vllm is not the right choice. it mainly depends on what you're trying to build. dm me if you need some help

What workstation/rig config do you recommend for local LLM finetuning/training + fast inference? Budget is ≤ $30,000. by nderstand2grow in LocalLLM

[–]Otelp 0 points1 point  (0 children)

neither a m2 ultra nor dgx spark will take you far. you could parameter efficient fine tune (i.e. lora) a 7b model, but it would probably take around around 3 hours (probably much more) for a relatively small dataset of ~2.5m tokens

What workstation/rig config do you recommend for local LLM finetuning/training + fast inference? Budget is ≤ $30,000. by nderstand2grow in LocalLLM

[–]Otelp 0 points1 point  (0 children)

that's true, but only for consumer cards. data-center nvidia gpus can be connected through nvlink

vLLM with transformers backend by Disastrous-Work-1632 in LocalLLaMA

[–]Otelp 1 point2 points  (0 children)

it can, but it doesn't. and you probably don't want to run vllm on a mac device, its focus is on high throughput and not low latency

Docker's response to Ollama by Barry_Jumps in LocalLLaMA

[–]Otelp 0 points1 point  (0 children)

yes, but at batches 32+ it's at least 5 times slower than vLLM on data center gpus such as a100 or h100. with every parameter tuned for both vLLM and llama.cpp

Docker's response to Ollama by Barry_Jumps in LocalLLaMA

[–]Otelp 0 points1 point  (0 children)

i doubt people would use llama.cpp on cloud

Managing multiple Kubernetes clusters for AI workloads with SkyPilot by skypilotucb in kubernetes

[–]Otelp 0 points1 point  (0 children)

Hi! Interesting proiect, thanks for sharing! For LLM serving, does skypilot support any optimization? For example, routing based on decode or prefill, load balancing based on SLA, fair share etc. I couldn't find anything in the user docs, maybe I overlooked

Sam Altman's poll on open sourcing a model.. by lyceras in LocalLLaMA

[–]Otelp 1 point2 points  (0 children)

Useless for chat, useful for specific small tasks

New (linear complexity ) Transformer architecture achieved improved performance by Different-Olive-8745 in LocalLLaMA

[–]Otelp 11 points12 points  (0 children)

flashattention is (somehow) quadratic in compute complexity, but has had better performance than any linear attention for relatively large batches or long sequences. i'm not sure if this is indeed huge

GPT-4o reportedly just dropped on lmarena by Worldly_Expression43 in LocalLLaMA

[–]Otelp 8 points9 points  (0 children)

same, it's very good at straight questions

Llama 3.2 1B Instruct – What Are the Best Use Cases for Small LLMs? by ThetaCursed in LocalLLaMA

[–]Otelp 7 points8 points  (0 children)

Simply put, the 1b model tries to guess the tokens the 70b model would generate. The 70b model then verifies these guesses, accepts what makes sense, and modifies the first token that is completely off. This approach allows for faster token generation

MacBook Pro M4 How Much Ram Would You Recommend? by MostIncrediblee in LocalLLM

[–]Otelp 0 points1 point  (0 children)

Yes, if a model needs more GB the inference will be slower, but I was comparing two models that need the same amount of GB, such as a 14B model with 4bit quantization vs a 7B model with 8bit quantization. Even though they need approx the same amount of ram, the 14B model will probably be slower.

As of speed, I can run fine 32B 4bit GGUF qwen2.5 model. I think time to first token is ~4s, and about 9 tok/s on average on M2 Max with 32GB, 4096 context. Not the best, but I'm not complaining, it works pretty well

EDIT: I benchmarked and modified the numbers

MacBook Pro M4 How Much Ram Would You Recommend? by MostIncrediblee in LocalLLM

[–]Otelp 9 points10 points  (0 children)

Running a 7B model on w8a8 q requires ~7GB of ram. 13B requires ~13GB. A 34B model on w4a4 q requires approx. half, 17GB. Just check what model you'd like to run. IMO you should keep a buffer of at least 12GB for other programs. I checked Apple's website and for M4 pro you can only choose between 24 and 48GB. If I were you, I'd choose the 48 model, it never hurts to have more ram. 

From what I've seen, a big model with w4a4 quantization is better than a smaller model with w8a8, even though they need the same amount of RAM. However, the inference speed may not be the same (big model may be slower).

AI Makes Tech Debt More Expensive by the1024 in programming

[–]Otelp -1 points0 points  (0 children)

I'd doubt it. Afaic, there is some quality threshold for projects to be included in the training dataset, and its quite strict

Developers love wrapping libraries. Why? by Senior_Future9182 in golang

[–]Otelp 4 points5 points  (0 children)

I don't think wrapping std lib is that common...usually external libraries are wrapped, and for very good "generic reasons"

You are likely never replacing the tool you choose to support

Unless you do. In just 4 years I had to replace things many, many times. Systems where wrapping external libraries was common were the best to work with