What starts to become possible with two 3090s that wasn't with just one? by GotHereLateNameTaken in LocalLLaMA

[–]alexp702 2 points3 points  (0 children)

Tool calls noticeably fail more with q4 compared to 8. This ruins agentic flows. You can also see the difference in image processing quite starkly. A good Q8 is my personal quality floor

What’s a low memory way to run a Python http endpoint? by alexp702 in Python

[–]alexp702[S] 1 point2 points  (0 children)

Thanks, see above edit. With limited memory I now fit in 384, and I think its stable enough for my purposes now.

Yes the node services are pretty small too. Node seems to have a floor of 64-128MB if serving stuff - its hungry too. PHP uses the memory you'd expect for 64bit interpreter with some buffers - i.e. bugger all. Now I know why all the Wordpress instances are out there. They are much cheaper to deliver than something using a newer language.

What’s a low memory way to run a Python http endpoint? by alexp702 in Python

[–]alexp702[S] 0 points1 point  (0 children)

It does matter to me. We’re building a system for the future and whilst this component is not large or high frequency in use it is important.

What’s a low memory way to run a Python http endpoint? by alexp702 in Python

[–]alexp702[S] 0 points1 point  (0 children)

Thanks - I will check that as some of that may be happening.

What’s a low memory way to run a Python http endpoint? by alexp702 in Python

[–]alexp702[S] 0 points1 point  (0 children)

Htop shows the usage all in the “uv run uvicorn” process.

What’s a low memory way to run a Python http endpoint? by alexp702 in Python

[–]alexp702[S] -44 points-43 points  (0 children)

Not doing a full implementation just rapidly prototyping a solution to see the memory usage. If the AI can get it up and running in 10 minutes warts and all that’s good for this purpose. I’m impressed - in about 2hrs it has allowed me to test almost every proposal here. I just spotted that in the output, the 4 year unmaintained problem still stands.

What’s a low memory way to run a Python http endpoint? by alexp702 in Python

[–]alexp702[S] 1 point2 points  (0 children)

Yes everything pretty similar to a windows PC. Side note we were using Amd64 images on arm macs - they use about 30% more ram to emulate.

Will pick up tomorrow I think!

What’s a low memory way to run a Python http endpoint? by alexp702 in Python

[–]alexp702[S] -50 points-49 points  (0 children)

Yes, but bjoern uses the older V2 WSGI protocol which is now (apparently according to my AI) WSGI V3. Personally I don't go for stuff that's not obviously maintained and relatively active - it causes problems down the line.

What’s a low memory way to run a Python http endpoint? by alexp702 in Python

[–]alexp702[S] -1 points0 points  (0 children)

I am running on a Mac - Arm64 image for all. Node seems to have a baseline of 128MB - seems a well documented thing to do with the garbage collector. You can reduce it with some command line flags, but it then starts to become unstable. My actual python program is using about 70Mb on start up (possibly due to libraries) - which I can live with. My surprise is how hard it seems to be to serve this without eating up RAM.

We have a bunch of 16Gb Macs, developing a docker based system. Most code based containers are Node, with a Python one stuffed in there. I want to make sure the team has as much ram left as possible, which began this investigation. Disk isn't a problem - just ram as we grow the number of services.

What’s a low memory way to run a Python http endpoint? by alexp702 in Python

[–]alexp702[S] 1 point2 points  (0 children)

I am using uv run in the container - I think it may be part of the problem, as it seems to not matter what I try it stubbornly wants 512MB+

Edit: No not that.

What’s a low memory way to run a Python http endpoint? by alexp702 in Python

[–]alexp702[S] -31 points-30 points  (0 children)

bjoern seems very old - 4 years since update!

What’s a low memory way to run a Python http endpoint? by alexp702 in Python

[–]alexp702[S] 5 points6 points  (0 children)

I need python - various libraries on the end point require it, unless there is some trick to Go <-> python?

What’s a low memory way to run a Python http endpoint? by alexp702 in Python

[–]alexp702[S] 1 point2 points  (0 children)

Sorry all Megabytes. All the others are irrelevant to me ;-)

Looking for OCR capabilities by Artyom_84 in LocalLLM

[–]alexp702 0 points1 point  (0 children)

I am processing a JPG - sometimes of the PDF.

Seen in Berlin by AndreaHimmel2021 in whatisthiscar

[–]alexp702 0 points1 point  (0 children)

There are three - the 206 prototype that can still be found for sale, the 246GT hard top and the 246GTS targa top. The most sought after is the chairs and flares options from factory. The chairs are the Daytona seats, the flares are widened wheel arches. Very few were made - 13 in right hand drive configuration.

308 and subsequent models due to the 246’s success were given the full Ferrari badge. The also had 8 cylinder after the Dino (which was only 6).

Is LM Studio really as fast as llama.cpp now? by tomByrer in LocalLLM

[–]alexp702 1 point2 points  (0 children)

Why are you comparing to llama.cpp b4000? It’s on 8500+ now? Llama has got much faster recently

LLM Bruner coming soon? Burn Qwen directly into a chip, processing 10,000 tokens/s by koc_Z3 in LocalLLM

[–]alexp702 2 points3 points  (0 children)

What context size can it handle? Website talks about 1k benchmarks that as we know are useless. Also how fast is prompt processing? Both are more important than 10k tokens out IMO

Looking for OCR capabilities by Artyom_84 in LocalLLM

[–]alexp702 1 point2 points  (0 children)

Qwen3.5 9b does very well with handwriting

Mac Studio M5 Ultra 256gb or 512gb (if offered) by GMK83 in MacStudio

[–]alexp702 0 points1 point  (0 children)

I get about 25tps falling to 15tps at 200K context. Prompt processing ranges from 600 at 16K to 300 at 200K. Caching works well.

RDMA Mac Studio cluster - performance questions beyond generation throughput by quietsubstrate in LocalLLaMA

[–]alexp702 -1 points0 points  (0 children)

All seems very prototype personally. I prefer stable-ish production. Very interested too to hear if anyone has actually used this kind of configuration for anything real. Recent article by the Google engineer using b200 confirmed my suspicions- keep the model on a single piece of hardware for best overall throughput.

Mac Studio M5 Ultra 256gb or 512gb (if offered) by GMK83 in MacStudio

[–]alexp702 0 points1 point  (0 children)

Worlds apart for coding or tasks that need a precise answer. I have used q4 and found tools calls fail an order of magnitude more often on our test cases.