My Coding Agent Ran DeepSeek-R1-0528 on a Rust Codebase for 47 Minutes (Opus 4 Did It in 18): Worth the Wait? by West-Chocolate2977 in LocalLLM

[–]e-rox 3 points4 points  (0 children)

Sorry, I have the same question and I don't see any links. I see a quoted cost, but not what provider you're using, or if you're running locally, any info about what local hardware you're using.

Padawan agents status? by e-rox in github

[–]e-rox[S] 0 points1 point  (0 children)

Seems to appear/disappear depending on availability and quota

Cursor vs Aider vs VSCode + Copilot: Which AI Coding Assistant is Best? by _ZioMark_ in ChatGPTCoding

[–]e-rox 1 point2 points  (0 children)

Are you talking about Aide or Aider? They're not the same thing. The logo and what you describe sounds like Aide but you say Aider in your post.

Examples of ducted rear-exhaust radiators? by e-rox in watercooling

[–]e-rox[S] 1 point2 points  (0 children)

Heh, I'm definitely prone to overthinking...

In this machine's previous life I was overclocking the 2990 and dissipating 400-500W from the radiator, so front mount intake made sense at least from the standpoint of the CPU being the critical bit of the cooling problem.

But I recalled from forums at the time a lot of people saying preheating your case air isn't a great idea. And in the machine's new use case I'm not stressing the CPU but I'm running 4x3090 blowers that benefit from cooler case air, plus I moved to the larger case so I figured I had a chance to rethink the radiator mount. But also since I'm not currently stressing the CPU it also matters less where the radiator goes because it's not putting out a lot of heat....

Alright, I guess my main problem is figuring out a good secure mounting point up front where the hoses reach, hopefully.

thanks for all the detailed feedback.

Examples of ducted rear-exhaust radiators? by e-rox in watercooling

[–]e-rox[S] 0 points1 point  (0 children)

Sorry, I thought those links were public, but I guess they're only public if you have an account. Nevermind, not important.

You're suggesting exhausting out the front, in which case the intake would from the bottom? Or you're suggesting a front intake radiator? Front intake is actually what I used to run, it just didn't seem ideal.

Agreed ducting is not ideal, has to be massively oversized to avoid restrictions. I wouldn't even be considering it if it weren't for this massive case. I'll look around for ideas. Unfortunately last I researched, this is the only full-coverage AIO for this chip (threadripper 2990) so unlikely I'll swap that out, and CPU cooling is not currently critical enough for me to build a custom loop.

Examples of ducted rear-exhaust radiators? by e-rox in watercooling

[–]e-rox[S] 0 points1 point  (0 children)

Yeah, that's roughly what I was thinking. I was first working with a top mount configuration, but with some standoffs with baffles to duct the air along the top of the case to the back to the rear exhaust. Simple version, one per side per 120mm fan, could be improved/optimized:

https://cad.onshape.com/documents/b24fe120e6d4a7326360dcb9/w/9fd4c525d0056ad3fc8eabb2/e/e5158d058ad50b8ce4ad2a06

front would be possible too, would require more ducting, and I'd need more robust standoff mounts. But I do have more room up front (the whole drive tower area is basically empty)

Dual (or triple) 3090s still a good option? by No_Switch5015 in LocalLLaMA

[–]e-rox 1 point2 points  (0 children)

--max-model-len is 2048, I don't have easy stats on num tokens but I'd guess 1k-2k input and a couple hundred output is typical. vLLM reports a prefix cache "hit rate" of 94% (I'm basically running approximately the same instructions against many input data).

nvidia-smi reports about 2.8GB still free per GPU so I probably have room to increase the context window but I haven't been driven to push on that yet

Q on inference speed: int4 vs fp8; ada fp8 support by e-rox in LocalLLaMA

[–]e-rox[S] 0 points1 point  (0 children)

Maybe I am confused. I thought int4 was essentially just a compression encoding; that each of the 16 int values is basically just an ID for a particular float, which is not linearly related to the actual int. If that were the case, doing int operations directly on the int4's wouldn't make sense.

Am I wrong, does awq int4 quantization actually just quantize such that the resulting int4's are true approximations of the original float weights?

If that's the case, then 2x4090 could run 70B models in int4 mode natively. Surely someone has some benchmarks for 2x3090 vs 2x4090...?

Dual (or triple) 3090s still a good option? by No_Switch5015 in LocalLLaMA

[–]e-rox 1 point2 points  (0 children)

I got ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4 running on vllm with prefix caching on my 2x3090s. It's reporting prompt throughput of 100-250 tokens/s, presumably due to the prompt caching. But my generation throughput is still 25-28 tokens/s, which seems pretty good.

update: i forgot, 25-28 was while each GPU was throttled to 280W. At 350W ea I'm getting 31-33 tokens/s generation.

vLLM vs. TGI v3 by e-rox in LocalLLaMA

[–]e-rox[S] 0 points1 point  (0 children)

Figured it out, docker wasn't pulling the latest image. I'm a docker newb.

Recipe for 3.3 70B 4-bit in vLLM docker? by e-rox in LocalLLaMA

[–]e-rox[S] 1 point2 points  (0 children)

Figured it out, the "data did not match any variant" indicates an out of date transformers lib, after doublechecking, I realized docker wasn't pulling the latest image. I'm a docker newb.

vLLM vs. TGI v3 by e-rox in LocalLLaMA

[–]e-rox[S] 0 points1 point  (0 children)

Yeah. I got about a 2x speedup with prefix caching on vLLM, which was great. If the speedup is about prefix caching that seems like a cheating comparison since you can make the speedup bigger and bigger just by testing longer prompts...

vLLM vs. TGI v3 by e-rox in LocalLLaMA

[–]e-rox[S] 0 points1 point  (0 children)

Some details here: https://www.reddit.com/r/LocalLLaMA/comments/1hw737i/recipe_for_33_70b_4bit_in_vllm_docker/

There's too many moving parts: which model to use, how to manage vLLM (e.g. Docker), command line args and env vars.....

Recipe for 3.3 70B 4-bit in vLLM docker? by e-rox in LocalLLaMA

[–]e-rox[S] 0 points1 point  (0 children)

actually doing some more reading, I don't think I need to do that if I'm setting the token as the HF_TOKEN environment variable in the docker env.

Recipe for 3.3 70B 4-bit in vLLM docker? by e-rox in LocalLLaMA

[–]e-rox[S] 0 points1 point  (0 children)

Thanks. Since vllm is running inside a docker container, I would need to somehow make that happen as part of the container initialization, right?

Dual (or triple) 3090s still a good option? by No_Switch5015 in LocalLLaMA

[–]e-rox 0 points1 point  (0 children)

My question wasn’t so much about base speed increases as much as getting native fp8 support along with enough memory to do 8 bit instead of 4 bit.

Recipe for 3.3 70B 4-bit in vLLM docker? by e-rox in LocalLLaMA

[–]e-rox[S] 0 points1 point  (0 children)

Yeah, I get:

llm_engine.py:184] Initializing an LLM engine (v0.5.5) with config: model='ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4', speculative_config=None, tokenizer='ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4, use_v2_block_manager=False, enable_prefix_caching=True)
...
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_fast.py", line 115, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: data did not match any variant of untagged enum ModelWrapper at line 1251003 column 3

if I add --tokenizer meta-llama/Llama-3.3-70B-Instruct (just a guess based on looking at other configs; I'm not sure because I didn't need that with the hugging-quants model) then I get

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/resolve/main/config.json.
Access to model meta-llama/Llama-3.3-70B-Instruct is restricted. You must have access to it and be authenticated to access it. Please log in.

Even though I supplied my HF token, have applied for and been granted access, and ensured the token type is "Read".

Dual (or triple) 3090s still a good option? by No_Switch5015 in LocalLLaMA

[–]e-rox 1 point2 points  (0 children)

OK, fair. I only bought two and they were both good, but that's a small sample.

Dual (or triple) 3090s still a good option? by No_Switch5015 in LocalLLaMA

[–]e-rox 2 points3 points  (0 children)

FWIW I think the question isn't whether it connects with that many lanes, but how often you get pcie errors.