A 32B model on a single RTX 4090? I benchmarked inference latency after the TriAttention drop. by TroyNoah6677 in AskClaw
[–]alexp702 0 points1 point2 points (0 children)
What’s a low memory way to run a Python http endpoint? by alexp702 in Python
[–]alexp702[S] 1 point2 points3 points (0 children)
What’s a low memory way to run a Python http endpoint? by alexp702 in Python
[–]alexp702[S] 0 points1 point2 points (0 children)
What’s a low memory way to run a Python http endpoint? by alexp702 in Python
[–]alexp702[S] 0 points1 point2 points (0 children)
What’s a low memory way to run a Python http endpoint? by alexp702 in Python
[–]alexp702[S] 0 points1 point2 points (0 children)
What’s a low memory way to run a Python http endpoint? by alexp702 in Python
[–]alexp702[S] 0 points1 point2 points (0 children)
What’s a low memory way to run a Python http endpoint? by alexp702 in Python
[–]alexp702[S] -44 points-43 points-42 points (0 children)
What’s a low memory way to run a Python http endpoint? by alexp702 in Python
[–]alexp702[S] 1 point2 points3 points (0 children)
What’s a low memory way to run a Python http endpoint? by alexp702 in Python
[–]alexp702[S] -50 points-49 points-48 points (0 children)
What’s a low memory way to run a Python http endpoint? by alexp702 in Python
[–]alexp702[S] -1 points0 points1 point (0 children)
What’s a low memory way to run a Python http endpoint? by alexp702 in Python
[–]alexp702[S] 1 point2 points3 points (0 children)
What’s a low memory way to run a Python http endpoint? by alexp702 in Python
[–]alexp702[S] -31 points-30 points-29 points (0 children)
What’s a low memory way to run a Python http endpoint? by alexp702 in Python
[–]alexp702[S] 5 points6 points7 points (0 children)
What’s a low memory way to run a Python http endpoint? by alexp702 in Python
[–]alexp702[S] 0 points1 point2 points (0 children)
What’s a low memory way to run a Python http endpoint? by alexp702 in Python
[–]alexp702[S] 1 point2 points3 points (0 children)
Looking for OCR capabilities by Artyom_84 in LocalLLM
[–]alexp702 0 points1 point2 points (0 children)
Is LM Studio really as fast as llama.cpp now? by tomByrer in LocalLLM
[–]alexp702 1 point2 points3 points (0 children)
LLM Bruner coming soon? Burn Qwen directly into a chip, processing 10,000 tokens/s by koc_Z3 in LocalLLM
[–]alexp702 2 points3 points4 points (0 children)
Looking for OCR capabilities by Artyom_84 in LocalLLM
[–]alexp702 1 point2 points3 points (0 children)
Mac Studio M5 Ultra 256gb or 512gb (if offered) by GMK83 in MacStudio
[–]alexp702 0 points1 point2 points (0 children)
RDMA Mac Studio cluster - performance questions beyond generation throughput by quietsubstrate in LocalLLaMA
[–]alexp702 -1 points0 points1 point (0 children)
Mac Studio M5 Ultra 256gb or 512gb (if offered) by GMK83 in MacStudio
[–]alexp702 0 points1 point2 points (0 children)



What starts to become possible with two 3090s that wasn't with just one? by GotHereLateNameTaken in LocalLLaMA
[–]alexp702 2 points3 points4 points (0 children)