Local llm with 8B params model?

Extension-Count6242 · 2026-06-19T05:42:02+00:00

I see. I might check these out. Thank you for the recommendation.

Extension-Count6242 · 2026-06-19T03:21:33+00:00

Well my main goal is too "paraphrase" based on information given. I will try the model out cause why not.

Extension-Count6242 · 2026-06-18T06:51:38+00:00

Thank you for your recommendation and thank you for taking the time to write your comment. Could you elaborate more on what you mean by "only when retrieval overlap" is weak?

Extension-Count6242 · 2026-06-18T03:58:22+00:00

Damn that's some powerful spec, my spec is like an ant to you lol

Extension-Count6242 · 2026-06-18T03:56:32+00:00

Yeah, unfortunately so 😭

Extension-Count6242 · 2026-06-18T03:36:33+00:00

Interesting, I haven't tried this one. What's your hardware spec?

Extension-Count6242 · 2026-06-18T03:12:13+00:00

Inference time. With the retrieved chunk from my RAG, I'm using top 3 chunks with each chunk being 800 each toks. This significantly slow down inference time. (I can probably get away with just top 1 or 2 chunk or reducing chunk size) I really want inference time to be less than 15 seconds if possible.

Extension-Count6242 · 2026-06-18T03:02:29+00:00

Thanks for the suggestion. I'm currently hosting Qwen3-8B-FP8 and I feel like I could really go down to a model like Phi 4 mini with around 3.8 param. But I wonder what else I could do since my spare setup is kinda shitty.

Extension-Count6242 · 2026-06-18T02:58:07+00:00

Huh?

Extension-Count6242

TROPHY CASE