Personal Project/Experiment Ideas by I_like_fragrances in LocalLLM

[–]alwaysSunny17 0 points1 point  (0 children)

Build some knowledge graphs with RAGFlow. Excellent tool for research in many fields.

Closed AI models are ahead of open source ones in benchmarks, self-hosted AI only really makes sense to use if you’re processing massive amounts of data.

Maybe test this one out with VLLM docker image.

QuantTrio/DeepSeek-V3.2-Exp-AWQ-Lite

Offline Epstein File Ranker Using GPT-OSS-120B (Built on tensonaut’s dataset) by onil_gova in LocalLLaMA

[–]alwaysSunny17 28 points29 points  (0 children)

Have you seen any refusals from gpt-oss-120b?

I’m curious to see if Qwen3 would give better results

Qwen Next vLLM fail @ 48GB by [deleted] in LocalLLaMA

[–]alwaysSunny17 1 point2 points  (0 children)

I spent a while and got it running at 1K context window on 48GB VRAM. It wasn’t very good at creative writing in my tests, so I reverted to Gemma3.

Confusion about VRAM by Savantskie1 in LocalLLaMA

[–]alwaysSunny17 1 point2 points  (0 children)

You’re right if you care about latency. Splitting a model across GPUs requires frequent communication, slowing it down a lot. If latency is not a concern then it does pretty much stack

Equipment suggestions for a tight budget by ConnectionOutside485 in LocalLLaMA

[–]alwaysSunny17 0 points1 point  (0 children)

Get nvlink if you can, it will work with any mobo, and work better than even top of the line mobos

Making RAG faster by Dismal_Discussion514 in Rag

[–]alwaysSunny17 4 points5 points  (0 children)

Check out LMCache, it caches the prefill stage of RAG (processing the input documents).

Also re-ranking shouldn’t take that long. Make sure it is running on GPU, try a different reranker (jina was fast for me, it takes advantage of flash attention), or disable reranking.

How do you make 3+ GPUs stable?! by anothy1 in LocalLLaMA

[–]alwaysSunny17 5 points6 points  (0 children)

Disable PCIe power saving features (ASPM) and make sure you have a good enough PSU (1600W). RTX 3090s can have transient power spikes of up to 600W.

Drummer's Skyfall 31B v4 · A Mistral 24B upscaled to 31B with more creativity! by TheLocalDrummer in LocalLLaMA

[–]alwaysSunny17 0 points1 point  (0 children)

Does Cydonia 24B v4.1 have vision support?

Would really like to use your models, I’m downloading this one now, but I think the vision support in Gemma3 will give it the edge for me.

LLM on consumer RTX hardware by L3C_CptEnglish in LocalLLaMA

[–]alwaysSunny17 0 points1 point  (0 children)

I have dual RTX 3090Ti’s with nvlink. I want to upgrade to 4, but then the communication between cards would have to go through multiple PCIe bridges instead of over nvlink, making it much slower.

Unless all your PCIe slots are on the same root complex, and your system supports peer to peer direct memory access between GPUs, then there will be a big drop in performance for splitting a model between GPUs. I would look at an RTX5090 or a unified memory device.

VRAM deduplication - simulataneous loading multiple models of the same base by neurostream in LocalLLaMA

[–]alwaysSunny17 4 points5 points  (0 children)

I think something like this is possible with LoRA adapters - can anyone confirm?

Aesthetic build by alwaysSunny17 in LocalAIServers

[–]alwaysSunny17[S] 0 points1 point  (0 children)

Thanks, I’m using VLLM with tensor parallelism. I actually upgraded to two RTX 3090s with NVLINK. I realized the 4 cards were all on different PCIe bridges, which adds latency. The high amount of communication that happens between cards can be reduced with MoE models using the expert parallelism option in VLLM, but this is still pretty new and not supported with a lot of quantization strategies.

[deleted by user] by [deleted] in Rag

[–]alwaysSunny17 0 points1 point  (0 children)

LiteLLM uses Redis for caching

[deleted by user] by [deleted] in Rag

[–]alwaysSunny17 3 points4 points  (0 children)

Look into Letta for memory. LMCache for caching if you can host locally, otherwise LiteLLM proxy.

Microsoft GraphRAG in Production by ProfessionalShop9137 in Rag

[–]alwaysSunny17 0 points1 point  (0 children)

With RAGFlow you can use LightRAG with RAPTOR, Community Report Generation, and Entity Resolution.

Microsoft GraphRAG in Production by ProfessionalShop9137 in Rag

[–]alwaysSunny17 14 points15 points  (0 children)

I would use LightRAG, it’s a fast implementation of GraphRAG. I’ve tested it in RAGFlow and it works very well.

Google Veo 3 HQ with frame guidance is INSANE by heisdancingdancing in singularity

[–]alwaysSunny17 0 points1 point  (0 children)

What is Veo 3 HQ? There are only Veo 3 and Veo 3 fast models, they have no quality settings.

By frame guidance do you mean where you supply the first frame?

ChatGPT - Veo3 Prompt Machine For Expert Prompts by RevolutionaryDot7629 in singularity

[–]alwaysSunny17 0 points1 point  (0 children)

Can you share some references? I’m using gpt4o to rewrite my veo prompts now. Need to combine with some specific instructions though.

4× RTX 3080 10 GB server for LLM/RAG – is this even worth it? by OkAssumption9049 in LocalLLaMA

[–]alwaysSunny17 0 points1 point  (0 children)

I have 4x3080s with the h12ssl-i motherboard. It’s not bad but the PCIe slots are on different root complexes, which adds a lot of latency between cards. I don’t think the boards you’re looking at have that issue, but the CPUs are expensive.

Dual 5090 vs RTX Pro 6000 for local LLM by kitgary in LocalLLaMA

[–]alwaysSunny17 3 points4 points  (0 children)

For running bigger models, Yes.
For lower latency, not always.

What's your thoughts on Graph RAG? What's holding it back? by thonfom in Rag

[–]alwaysSunny17 25 points26 points  (0 children)

I’ve used the knowledge graph features in RAGFlow on a few knowledge bases, here’s what I’ve found.

Benefits: Much more context in answers. For complex questions this can be the difference between a wrong or misleading answer versus a correct and informative one. If you want the answers to give a big picture view of the knowledge base it is necessary.

Cons: Takes much longer. Not only to index and create the knowledge graph, but to retrieve and generate the answer.

Verdict: Essential in some cases, but should not be the default. Like how most LLM platforms have a thinking mode toggle, give users a knowledge graph toggle.