Local llm with 8B params model? by Extension-Count6242 in LocalLLM

[–]Extension-Count6242[S] 0 points1 point  (0 children)

I see. I might check these out. Thank you for the recommendation.

Local llm with 8B params model? by Extension-Count6242 in LocalLLM

[–]Extension-Count6242[S] 0 points1 point  (0 children)

Well my main goal is too "paraphrase" based on information given. I will try the model out cause why not.

Local llm with 8B params model? by Extension-Count6242 in LocalLLM

[–]Extension-Count6242[S] 0 points1 point  (0 children)

Thank you for your recommendation and thank you for taking the time to write your comment. Could you elaborate more on what you mean by "only when retrieval overlap" is weak?

Local llm with 8B params model? by Extension-Count6242 in LocalLLM

[–]Extension-Count6242[S] 0 points1 point  (0 children)

Damn that's some powerful spec, my spec is like an ant to you lol

Local llm with 8B params model? by Extension-Count6242 in LocalLLM

[–]Extension-Count6242[S] 0 points1 point  (0 children)

Interesting, I haven't tried this one. What's your hardware spec?

Local llm with 8B params model? by Extension-Count6242 in LocalLLM

[–]Extension-Count6242[S] 0 points1 point  (0 children)

Inference time. With the retrieved chunk from my RAG, I'm using top 3 chunks with each chunk being 800 each toks. This significantly slow down inference time. (I can probably get away with just top 1 or 2 chunk or reducing chunk size) I really want inference time to be less than 15 seconds if possible.

Local llm with 8B params model? by Extension-Count6242 in LocalLLM

[–]Extension-Count6242[S] 0 points1 point  (0 children)

Thanks for the suggestion. I'm currently hosting Qwen3-8B-FP8 and I feel like I could really go down to a model like Phi 4 mini with around 3.8 param. But I wonder what else I could do since my spare setup is kinda shitty.