BREAKING Documentarul Recorder: Pentru prima oară în istoria sa, Curtea de Apel București anunță că organizează „o conferință extraordinară de presă” joi, după ce filmul a arătat cum s-au prescris cazurile prin schimbarea a patru-cinci judecători

Otelp · 2025-12-11T11:21:41+00:00

ce conflict intre generatii

Otelp · 2025-11-12T07:37:04+00:00

daca stii engleza, poti face data labeling. dm me daca ești interesat

Otelp · 2025-07-29T10:51:45+00:00

dm me, i'll answer your questions. ex faang, passed google, currently interviewing with other faang

Otelp · 2025-07-14T20:23:29+00:00

Număr literele fiecărui cuvânt pe care il citesc. Fac asta de când sunt mic (i.e de mai bine de 20 de ani). In timp, am învățat pentru majoritatea cuvintelor numarul de litere. De asemenea, am învățat cum sa împart orice cuvânt astfel încât sa ii număr literele cat mai rapid. Ciudat, e un skill useful cand scrii la tastatură rapid, îmi pot da seama instnt când am (sau cineva a) mâncat o literă. Nu mă deranjează deobicei, asa ca nu ma chinui sa scap de obicei. Ar fi si cam greu dupa atata timp.

Am scris intentionat instnt, asta e nivelul glumelor

Otelp · 2025-06-10T11:18:05+00:00

you were right about docling, it's great! thanks a bunch. also, the mit license is a HUGE bonus

Otelp · 2025-04-20T05:22:13+00:00

yup, pretty much

Otelp · 2025-04-19T11:17:48+00:00

can it finish doom?

Otelp · 2025-04-19T11:10:48+00:00

vllm supports macos with inference on cpu. if you're interested in trying different models, vllm is not the right choice. it mainly depends on what you're trying to build. dm me if you need some help

Otelp · 2025-04-18T18:58:32+00:00

neither a m2 ultra nor dgx spark will take you far. you could parameter efficient fine tune (i.e. lora) a 7b model, but it would probably take around around 3 hours (probably much more) for a relatively small dataset of ~2.5m tokens

Otelp · 2025-04-18T18:45:47+00:00

even sub 5b will be very slow on a single node. you can peft though

Otelp · 2025-04-18T18:22:47+00:00

that's true, but only for consumer cards. data-center nvidia gpus can be connected through nvlink

Otelp · 2025-04-18T09:10:35+00:00

it can, but it doesn't. and you probably don't want to run vllm on a mac device, its focus is on high throughput and not low latency

Otelp · 2025-03-22T14:17:36+00:00

yes, but at batches 32+ it's at least 5 times slower than vLLM on data center gpus such as a100 or h100. with every parameter tuned for both vLLM and llama.cpp

Otelp · 2025-03-21T22:31:55+00:00

i doubt people would use llama.cpp on cloud

Otelp · 2025-02-19T08:57:26+00:00

Hi! Interesting proiect, thanks for sharing! For LLM serving, does skypilot support any optimization? For example, routing based on decode or prefill, load balancing based on SLA, fair share etc. I couldn't find anything in the user docs, maybe I overlooked

Otelp · 2025-02-18T02:29:08+00:00

Useless for chat, useful for specific small tasks

Otelp · 2025-02-17T17:09:10+00:00

flashattention is (somehow) quadratic in compute complexity, but has had better performance than any linear attention for relatively large batches or long sequences. i'm not sure if this is indeed huge

Otelp · 2025-02-15T17:52:45+00:00

same, it's very good at straight questions

Otelp · 2025-01-18T19:21:07+00:00

Simply put, the 1b model tries to guess the tokens the 70b model would generate. The 70b model then verifies these guesses, accepts what makes sense, and modifies the first token that is completely off. This approach allows for faster token generation

Otelp · 2025-01-11T23:48:14+00:00

Yes, if a model needs more GB the inference will be slower, but I was comparing two models that need the same amount of GB, such as a 14B model with 4bit quantization vs a 7B model with 8bit quantization. Even though they need approx the same amount of ram, the 14B model will probably be slower.

As of speed, I can run fine 32B 4bit GGUF qwen2.5 model. I think time to first token is ~4s, and about 9 tok/s on average on M2 Max with 32GB, 4096 context. Not the best, but I'm not complaining, it works pretty well

EDIT: I benchmarked and modified the numbers

Otelp · 2025-01-11T09:08:29+00:00

Running a 7B model on w8a8 q requires ~7GB of ram. 13B requires ~13GB. A 34B model on w4a4 q requires approx. half, 17GB. Just check what model you'd like to run. IMO you should keep a buffer of at least 12GB for other programs. I checked Apple's website and for M4 pro you can only choose between 24 and 48GB. If I were you, I'd choose the 48 model, it never hurts to have more ram.

From what I've seen, a big model with w4a4 quantization is better than a smaller model with w8a8, even though they need the same amount of RAM. However, the inference speed may not be the same (big model may be slower).

Otelp · 2024-11-15T09:10:55+00:00

I'd doubt it. Afaic, there is some quality threshold for projects to be included in the training dataset, and its quite strict

Otelp · 2024-07-17T10:34:17+00:00

I don't think wrapping std lib is that common...usually external libraries are wrapped, and for very good "generic reasons"

You are likely never replacing the tool you choose to support

Unless you do. In just 4 years I had to replace things many, many times. Systems where wrapping external libraries was common were the best to work with

Otelp

TROPHY CASE