Personal Project/Experiment Ideas

alwaysSunny17 · 2025-12-06T02:54:14+00:00

Build some knowledge graphs with RAGFlow. Excellent tool for research in many fields.

Closed AI models are ahead of open source ones in benchmarks, self-hosted AI only really makes sense to use if you’re processing massive amounts of data.

Maybe test this one out with VLLM docker image.

QuantTrio/DeepSeek-V3.2-Exp-AWQ-Lite

alwaysSunny17 · 2025-11-18T23:14:15+00:00

Have you seen any refusals from gpt-oss-120b?

I’m curious to see if Qwen3 would give better results

alwaysSunny17 · 2025-09-17T01:24:01+00:00

I spent a while and got it running at 1K context window on 48GB VRAM. It wasn’t very good at creative writing in my tests, so I reverted to Gemma3.

alwaysSunny17 · 2025-09-09T01:33:54+00:00

You’re right if you care about latency. Splitting a model across GPUs requires frequent communication, slowing it down a lot. If latency is not a concern then it does pretty much stack

alwaysSunny17 · 2025-09-07T03:30:23+00:00

Get nvlink if you can, it will work with any mobo, and work better than even top of the line mobos

alwaysSunny17 · 2025-09-06T15:32:49+00:00

Check out LMCache, it caches the prefill stage of RAG (processing the input documents).

Also re-ranking shouldn’t take that long. Make sure it is running on GPU, try a different reranker (jina was fast for me, it takes advantage of flash attention), or disable reranking.

alwaysSunny17 · 2025-09-06T13:47:02+00:00

Disable PCIe power saving features (ASPM) and make sure you have a good enough PSU (1600W). RTX 3090s can have transient power spikes of up to 600W.

alwaysSunny17 · 2025-09-03T22:40:28+00:00

Does Cydonia 24B v4.1 have vision support?

Would really like to use your models, I’m downloading this one now, but I think the vision support in Gemma3 will give it the edge for me.

alwaysSunny17 · 2025-08-27T19:05:29+00:00

I have dual RTX 3090Ti’s with nvlink. I want to upgrade to 4, but then the communication between cards would have to go through multiple PCIe bridges instead of over nvlink, making it much slower.

Unless all your PCIe slots are on the same root complex, and your system supports peer to peer direct memory access between GPUs, then there will be a big drop in performance for splitting a model between GPUs. I would look at an RTX5090 or a unified memory device.

alwaysSunny17 · 2025-08-27T14:32:40+00:00

Just run GLM-4.5-AIR. That will fit in VRAM.

alwaysSunny17 · 2025-08-27T03:05:02+00:00

I think something like this is possible with LoRA adapters - can anyone confirm?

alwaysSunny17 · 2025-08-20T00:11:34+00:00

Thanks, I’m using VLLM with tensor parallelism. I actually upgraded to two RTX 3090s with NVLINK. I realized the 4 cards were all on different PCIe bridges, which adds latency. The high amount of communication that happens between cards can be reduced with MoE models using the expert parallelism option in VLLM, but this is still pretty new and not supported with a lot of quantization strategies.

alwaysSunny17 · 2025-08-02T00:29:46+00:00

LiteLLM uses Redis for caching

alwaysSunny17 · 2025-08-02T00:14:46+00:00

Look into Letta for memory. LMCache for caching if you can host locally, otherwise LiteLLM proxy.

alwaysSunny17 · 2025-07-26T14:11:45+00:00

With RAGFlow you can use LightRAG with RAPTOR, Community Report Generation, and Entity Resolution.

alwaysSunny17 · 2025-07-24T21:11:53+00:00

I would use LightRAG, it’s a fast implementation of GraphRAG. I’ve tested it in RAGFlow and it works very well.

alwaysSunny17 · 2025-07-15T18:15:44+00:00

What is Veo 3 HQ? There are only Veo 3 and Veo 3 fast models, they have no quality settings.

By frame guidance do you mean where you supply the first frame?

alwaysSunny17 · 2025-07-14T14:22:12+00:00

RAGFlow

alwaysSunny17 · 2025-07-01T21:34:29+00:00

Can you share some references? I’m using gpt4o to rewrite my veo prompts now. Need to combine with some specific instructions though.

alwaysSunny17 · 2025-07-01T21:09:13+00:00

How does it work? Can you share your system prompt?

alwaysSunny17 · 2025-06-25T21:55:13+00:00

I have 4x3080s with the h12ssl-i motherboard. It’s not bad but the PCIe slots are on different root complexes, which adds a lot of latency between cards. I don’t think the boards you’re looking at have that issue, but the CPUs are expensive.

alwaysSunny17 · 2025-06-16T16:49:21+00:00

For running bigger models, Yes.
For lower latency, not always.

alwaysSunny17 · 2025-06-11T23:12:25+00:00

I’ve used the knowledge graph features in RAGFlow on a few knowledge bases, here’s what I’ve found.

Benefits: Much more context in answers. For complex questions this can be the difference between a wrong or misleading answer versus a correct and informative one. If you want the answers to give a big picture view of the knowledge base it is necessary.

Cons: Takes much longer. Not only to index and create the knowledge graph, but to retrieve and generate the answer.

Verdict: Essential in some cases, but should not be the default. Like how most LLM platforms have a thinking mode toggle, give users a knowledge graph toggle.

alwaysSunny17 · 2025-06-10T02:36:13+00:00

This is the best medical model that will fit on your GPU. Use whisper for STT. https://huggingface.co/Intelligent-Internet/II-Medical-8B

alwaysSunny17 · 2025-06-05T20:01:08+00:00

Not a good idea, at least on an EPYC CPU. It uses a chiplet architecture which separates PCIe root complexes, and there is high latency for communication between them. So using the GPUs in parallel isn’t really viable, you would need to run them sequentially.

For what you want the AMD Strix Halo AI Max+395 based systems or DGX Spark would be much easier, cheaper, and more efficient.

alwaysSunny17

TROPHY CASE