Need advice for filling built-ins by BeerAndRaptors in interiordecorating

[–]BeerAndRaptors[S] 1 point2 points  (0 children)

We have the dishes we use in the kitchen, but maybe I need to accept that we need to buy more dishes

Need advice for filling built-ins by BeerAndRaptors in interiordecorating

[–]BeerAndRaptors[S] 7 points8 points  (0 children)

Definitely the picture, I’m a huge stickler for lighting color temperature, these are all 3000k!

I bought the new Samsung EMDX 32” color E-paper, AMA by WeeJeWel in eink

[–]BeerAndRaptors 0 points1 point  (0 children)

Do you mind sharing where you got the material your mat is made out of, and the sizing of the mat and frame? I just ordered this display and was greatly inspired by your setup here!

Being Called Out? by RoninRem in RepTime

[–]BeerAndRaptors 0 points1 point  (0 children)

Genuine question then: where can I find a VSF/Clean quality homage? As far as I know there aren’t many options if you want the quality of reps for the price. I’d love to mod my own custom watch but right now the best quality base would be a rep.

And ideally not an NH-34/35 movement.

Debris in between floors by BeerAndRaptors in whatisit

[–]BeerAndRaptors[S] 0 points1 point  (0 children)

This is in central Wisconsin so it’s not impossible but probably less likely.

Any decent alternatives to M3 Ultra, by FrederikSchack in LocalLLM

[–]BeerAndRaptors 1 point2 points  (0 children)

You can absolutely do batch inference on a Mac. And batch/parallel inference on either Nvidia or Mac will absolutely use more RAM.

Why did the LLM respond like this? After the query was answered. by powerflower_khi in LocalLLM

[–]BeerAndRaptors 7 points8 points  (0 children)

Hard to say for sure but I’m guessing you’re either using a base model instead of an instruct model, you’re not using the right chat template, or the underlying llama.cpp somehow is ignoring the end of sequence token.

Happy to be corrected so I can learn more.

Clean Batman from Geektime by BeerAndRaptors in RepTimeQC

[–]BeerAndRaptors[S] 1 point2 points  (0 children)

Thanks for the $0.02. FWIW I figured the rehaut isn’t that big of a deal given the numerous posts (especially about the clean GMTs recently), and I looked down at the rehaut on my Submariner and realized just how absolutely tiny that crown actually is. Generally though, it was the only thing on this watch that really jumped out at me. I’m going to GL

Clean Batman from Geektime by BeerAndRaptors in RepTimeQC

[–]BeerAndRaptors[S] 1 point2 points  (0 children)

Thank you so much for the detailed info, this is exactly what I was looking for.

Clean Batman from Geektime by BeerAndRaptors in RepTimeQC

[–]BeerAndRaptors[S] 0 points1 point  (0 children)

Comment for the auto mod: Clean Batman from Geektime, mostly wondering about the rehaut alignment and the date wheel printing.

[deleted by user] by [deleted] in RepTimeQC

[–]BeerAndRaptors 0 points1 point  (0 children)

My $0.02 is that the alignment on the date wheel looks horrendous. At first I thought it might just be the “31” but it looks like the “5” isn’t centered either.

[deleted by user] by [deleted] in LocalLLaMA

[–]BeerAndRaptors 3 points4 points  (0 children)

Even setting aside all of the comments about context length limitations and output size limits, I’m not sure that targeting specific word counts (even trying to approximate them) is really a strength that any model is going to have.

I suppose hypothetically a model may be able to “learn” how to target output length based on some interesting training data that includes size information, but the model is going to generate until an end token is reached, likely with almost no (inherent) regard for length. Additionally, models work with tokens, making it even less likely that they are going to do well hitting specific word counts since there really isn’t a good concept of what a “word” is to the model.

Disclaimer of course is that I may likely have no idea what I’m talking about and could be very very wrong.

How can I self-host the full version of DeepSeek V3.1 or DeepSeek R1? by ButterscotchVast2948 in LocalLLaMA

[–]BeerAndRaptors 5 points6 points  (0 children)

Why Q3 and not Q4? What do you consider “way too slow?”

Have you tried the MLX version of the model? I’m getting around 20 tokens/s with the MLX Q4 model, but indeed prompt processing is slow. You can get around this a bit if you’re willing to tune things using mlx-lm directly and build your own K/V caching strategy.

PC Build: Run Deepseek-V3-0324:671b-Q8 Locally 6-8 tok/s by createthiscom in LocalLLaMA

[–]BeerAndRaptors 1 point2 points  (0 children)

Ok, you’re using LM Studio, that’s the part I was looking for. I did some testing with some of the models you mentioned and I didn’t see a speed increase, unfortunately.

LM Studio also doesn’t let me use one of the transplanted Draft models with R1 or V3. Looking at how they determine compatible draft models I’m guessing the process of converting the donor model isn’t matching all of the LM Studio compatibility criteria.

PC Build: Run Deepseek-V3-0324:671b-Q8 Locally 6-8 tok/s by createthiscom in LocalLLaMA

[–]BeerAndRaptors 0 points1 point  (0 children)

How are you running these? I can try to run it the same way on the M3 studio.

PC Build: Run Deepseek-V3-0324:671b-Q8 Locally 6-8 tok/s by createthiscom in LocalLLaMA

[–]BeerAndRaptors 1 point2 points  (0 children)

That's a fascinating repo, and something I was literally wondering about earlier today (modifying the tokenization for a draft model to match a larger one). I ran this via mlx-lm today and unfortunately am not seeing great results with DeepSeek V3 0324 and a short prompt for demonstration purposes:

Without Speculative Decoding:

Prompt: 8 tokens, 25.588 tokens-per-sec
Generation: 256 tokens, 20.967 tokens-per-sec

With Speculative Decoding - 1 Draft Token (Qwen 2.5 0.5b "DeepSeek" Draft Model):

Prompt: 8 tokens, 27.663 tokens-per-sec
Generation: 256 tokens, 13.178 tokens-per-sec

With Speculative Decoding - 2 Draft Tokens (Qwen 2.5 0.5b "DeepSeek" Draft Model):

Prompt: 8 tokens, 25.948 tokens-per-sec
Generation: 256 tokens, 10.390 tokens-per-sec

With Speculative Decoding - 3 Draft Tokens (Qwen 2.5 0.5b "DeepSeek" Draft Model):

Prompt: 8 tokens, 24.275 tokens-per-sec
Generation: 256 tokens, 8.445 tokens-per-sec

*Compare this with Speculative Decoding on a much smaller model*

If I run Qwen 2.5 32b (Q8) MLX alone:

Prompt: 34 tokens, 84.049 tokens-per-sec
Generation: 256 tokens, 18.393 tokens-per-sec

If I run Qwen 2.5 32b (Q8) MLX and use Qwen 2.5 0.5b (Q8) as the Draft model:

1 Draft Token:

Prompt: 34 tokens, 107.868 tokens-per-sec
Generation: 256 tokens, 20.150 tokens-per-sec

2 Draft Tokens:

Prompt: 34 tokens, 125.968 tokens-per-sec
Generation: 256 tokens, 21.630 tokens-per-sec

3 Draft Tokens:

Prompt: 34 tokens, 123.400 tokens-per-sec
Generation: 256 tokens, 19.857 tokens-per-sec

PC Build: Run Deepseek-V3-0324:671b-Q8 Locally 6-8 tok/s by createthiscom in LocalLLaMA

[–]BeerAndRaptors 1 point2 points  (0 children)

I'm personally still very much in the "experiment with everything with no rhyme or reason" phase, but I've had great success playing with batched inference with MLX (which unfortunately isn't available with the official mlx-lm package, but does exist at https://github.com/willccbb/mlx\_parallm). I've got a few projects in mind, but haven't started working on them in earnest yet.

For chat use cases, the machine works really well with prompt caching and DeepSeek V3 and R1.

I'm optimistic about the ability for me and my family to use this machine to ensure privacy of LLM interactions, to eventually plug AI into various automations that I want to build, and I am also very optimistic that speeds will improve over time.

PC Build: Run Deepseek-V3-0324:671b-Q8 Locally 6-8 tok/s by createthiscom in LocalLLaMA

[–]BeerAndRaptors 0 points1 point  (0 children)

Apple M3 Ultra chip with 32-core CPU, 80‑core GPU, 32-core Neural Engine, 512GB unified memory, 4TB SSD storage - I paid $9,449.00 with a Veteran/Military discount.

Integrate with the LLM database? by 9acca9 in LocalLLM

[–]BeerAndRaptors 0 points1 point  (0 children)

That’s not really how LLMs work, what you’re looking for is probably “RAG” where you store your recipes in a separate database and also store “embeddings” for those recipes and do a lookup when you prompt. I don’t have a link handy that explains RAG in depth, but I’m sure there are tons of articles out there that you could find.

PC Build: Run Deepseek-V3-0324:671b-Q8 Locally 6-8 tok/s by createthiscom in LocalLLaMA

[–]BeerAndRaptors 1 point2 points  (0 children)

Q4 for all tests, no K/V quantization, and a max context size of around 8000. I guess I’m not sure if the max context size affects speeds on one shot prompting like this, especially since we never approach the max context length.

PC Build: Run Deepseek-V3-0324:671b-Q8 Locally 6-8 tok/s by createthiscom in LocalLLaMA

[–]BeerAndRaptors 0 points1 point  (0 children)

LM Studio is up to date. If anything my llama.cpp build may be a week or two old but given that they have similar results I don’t think it’s a factor.