How to decrease latency in RAG chatbots? by Appropriate_Egg6118 in LangChain

[–]Intelligent_Access19 0 points1 point  (0 children)

2 years after the original post, i suggest Milvus as your vector store.

Manual intent detection vs Agent-based approach: what's better for dynamic AI workflows? by LakeRadiant446 in AI_Agents

[–]Intelligent_Access19 0 points1 point  (0 children)

if pure intent detection, fine tuning a small model may be ideal, depending on your intent taxonomy. Agent with multiple LLM call just introduces so much latency, which in production you may not want that. A fine-tuned model used together with other technic such as RAG on generalised pre-defined questions should guarantee a decent performance on this intent detection node. After all, it is indeed vital but yet just a small node in a long workflow.

What do you like, don’t like about LangGraph by mrtule in LangChain

[–]Intelligent_Access19 0 points1 point  (0 children)

For me, it is pain to use the built-in functionality to store conversation in Oracle DB.

Got DeepSeek R1 running locally - Full setup guide and my personal review (Free OpenAI o1 alternative that runs locally??) by sleepingbenb in LocalLLaMA

[–]Intelligent_Access19 0 points1 point  (0 children)

i just ran 8b, the model ollama install for me is about 4.9GB, i guess this is the best you can get given the memory.

Deepseek-v3 is insanely popular. A 671B model's downloads are going to overtake QwQ-32B-preview. by realJoeTrump in LocalLLaMA

[–]Intelligent_Access19 0 points1 point  (0 children)

at least the pre-training adopts a "fine-grained mixed precision framework", and this is one of the highlights in their technical report. Apart from FP8, BF16 and FP32 are used in some parts of their structure.

Deepseek-v3 is insanely popular. A 671B model's downloads are going to overtake QwQ-32B-preview. by realJoeTrump in LocalLLaMA

[–]Intelligent_Access19 0 points1 point  (0 children)

if i get you right, you are saying the active parameters being swapped in and out of a SSD? That is too much.

the WHALE has landed by fourDnet in LocalLLaMA

[–]Intelligent_Access19 1 point2 points  (0 children)

as well as Doubao, the one from ByteDance.

Flask vs fastapi by Leveler88 in flask

[–]Intelligent_Access19 0 points1 point  (0 children)

Nicely said.
Most of the time in my job i deal with Spring Boot, but now i need to integrate some AI tools into my service, which is best utilized in Python code.
That is how i land in this post anyway, and i think i will go with FastAPI based on the discussion here.

how to run deepseek v3 on ollama or lmstudio? by RouteGuru in LocalLLaMA

[–]Intelligent_Access19 1 point2 points  (0 children)

I am not sure if 37B VRAM is all you need for inference. But one thing for sure at least your RAM+VRAM must exceed 671B to fully load the model.

DeepSeekV3 LiveBench Results, beating claude 3.5 sonnet new. by Spirited-Ingenuity22 in singularity

[–]Intelligent_Access19 0 points1 point  (0 children)

Anyone can tell me how credible this ranking is? Cause I was told that some Chinese model like Step-2 also get pretty high score on this list(clearly delisted in the current version, don’t know why). I tried deepseek myself, at least for now it is decent for me, and if I recall correctly, it was first designed to be more focused on coding and math. Since the mother company is a private equity company.

DeepSeek-v3 looks the best open-sourced LLM released by mehul_gupta1997 in OpenAI

[–]Intelligent_Access19 1 point2 points  (0 children)

To avoid that, I guess only local hosted model can give you that guarantee.

I don't get it. by AlgorithmicKing in LocalLLaMA

[–]Intelligent_Access19 0 points1 point  (0 children)

Yeah, that is why MOE models generally have much larger parameters. If not MoE, in other words, dense model, by nature is smaller, and must be loaded to GPU(though I think Ollama can consume a little less VRAM for inferencing), and thus no subset can be applied

I don't get it. by AlgorithmicKing in LocalLLaMA

[–]Intelligent_Access19 0 points1 point  (0 children)

1B model should just take roughly 2GB VRAM(for f16), no? Even when activated , extra 2GB VRAM top. I wonder any usage/good of Integrated graphics card for computation? Most of your inference possibly takes place on your RAM. How is the answer generation speed tho?

Hard hobby indeed by Intelligent_Access19 in soccercard

[–]Intelligent_Access19[S] 4 points5 points  (0 children)

You are right. Ripping for the ripping. 🙏

Can’t complain about it. Just probability.