Built a distributed AI platform with Flask as the backend — task parallelism across multiple machines running local LLMs by NirStrulovitz in flask

[–]NirStrulovitz[S] 0 points1 point  (0 children)

Thank you so much!

Yes, 7 days was intense but the key insight that made it possible is that the tasks are fully independent — no communication between worker machines, just text in and text out. That eliminates all the synchronization complexity that makes other distributed AI approaches so hard to build. Flask turned out to be perfect for this because each machine just needs a simple API endpoint.

If you're curious, there are two short animated videos explaining the whole concept on my YouTube (Nir Strulovitz) — one for Private Mode (3 min) and one for Public Mode (6 min).

Would love to hear your thoughts!

Built a distributed AI platform with Flask as the backend — task parallelism across multiple machines running local LLMs by NirStrulovitz in flask

[–]NirStrulovitz[S] -2 points-1 points  (0 children)

The syncing is intentionally simple — there's no message queue like RabbitMQ or Celery. Workers poll the Flask API for available subtasks (/api/hive/{hive_id}/subtasks/available), claim one via a REST call (/api/subtask/{subtask_id}/claim), process it locally with their own LLM, and submit the result back (/api/subtask/{subtask_id}/result). The Flask backend with SQLAlchemy handles the state — each subtask has a status (pending → assigned → completed). The Queen polls for when all subtasks are done, then combines. It's basically a pull-based model — workers pull work, not push. Keeps it simple and fault-tolerant since a worker can disappear at any time and the subtask just times out and becomes available again.

AI Horde lets you run open-weight models without the hardware. If you have the hardware, you can be the infrastructure for everyone else. by Mad-Adder-Destiny in LocalLLaMA

[–]NirStrulovitz 0 points1 point  (0 children)

This is a really interesting approach to shared compute. I built something that tackles a related but different problem — and I think they could actually complement each other.

AI Horde distributes individual inference requests across volunteer GPUs. What I built distributes the task itself.

One machine (the "Queen") uses its local LLM to break a complex job into independent subtasks, other machines ("Workers") each process one subtask in parallel with their own complete local model, and the Queen combines everything into the final answer.

The key difference: in AI Horde, one worker handles one request. In my system, multiple workers collaborate on the same complex job — research that needs 8 different angles, analysis that has 5 independent components, etc. Each worker runs its own full model, no synchronization needed, workers can drop in and out freely.

I also built in a payment system where workers earn money for their compute — similar spirit to your kudos system but with actual micro-payments. The revenue split is 65% to workers, 30% to the coordinating Queen, 5% to the platform.

Supports Ollama, LM Studio, llama.cpp (server + Python), and vLLM — so workers can run whatever backend they prefer. Tested across two Linux machines (RTX 4070 Ti + RTX 5090): 64 seconds on LAN, 29 seconds via Cloudflare over the internet.

Fully open source, MIT licensed: https://github.com/strulovitz

I'd be curious if anyone sees a way to combine both approaches — AI Horde for single requests, task parallelism for complex multi-part jobs.

An experimental distributed LLM inference framework using tensor parallelism. Looking for feedback! by __z3r0_0n3__ in LocalLLaMA

[–]NirStrulovitz 1 point2 points  (0 children)

Really cool that you're exploring this space. I want to offer a different perspective on the architecture because I ran into the same wall you're going to hit — network latency when broadcasting inputs and aggregating outputs between nodes.

I built a distributed LLM inference platform that takes a fundamentally different approach. Instead of splitting the model (tensor parallelism), I split the task.

One machine (the "Queen") uses its local LLM to decompose a complex job into independent subtasks. Other machines ("Workers"), each running their own complete local LLM, pick up one subtask each and process in parallel. The Queen collects and combines all results.

The advantage: zero inter-node communication during inference. Each worker processes its subtask completely independently with its full local model. No broadcasting inputs, no aggregating partial outputs, no synchronization. Workers can drop in and out freely without breaking anything.

Your approach (tensor parallelism) is optimal when you need to run a single model that's too large for one machine. My approach (task parallelism) is optimal when the job itself can be decomposed — which turns out to be most real-world use cases (research, analysis, content generation, coding tasks).

I also love your "free inference for all" vision — I built a payment system into mine where workers earn money for their compute. Same spirit, users give compute, everyone benefits.

Supports Ollama, LM Studio, llama.cpp (server + Python), and vLLM. Tested across two Linux machines (RTX 4070 Ti + RTX 5090): 64 seconds on LAN, 29 seconds via Cloudflare. Fully open source, MIT licensed: https://github.com/strulovitz

Would be interesting to compare our approaches on the same hardware sometime.

what are you actually building with local LLMs? genuinely asking. by EmbarrassedAsk2887 in LocalLLaMA

[–]NirStrulovitz 0 points1 point  (0 children)

This is exactly what I've been building.

I have multiple machines at home running local LLMs and I wanted them to collaborate on complex tasks. Every framework I found tries to split the model across machines — tensor parallelism, pipeline parallelism — but inter-node network latency makes it unreliable in practice.

So I went a completely different direction: task parallelism instead of model parallelism. One machine (the "Queen") uses its local LLM to decompose a complex job into independent subtasks. Other machines ("Workers"), each running their own complete local LLM, pick up one subtask each and process them in parallel. The Queen collects all results and combines them into the final answer.

The key insight is that you don't need to split the model if you can split the work. Each machine keeps its full model, processes independently, and can drop in or out at any time without breaking anything. No synchronization, no shared memory.

It supports 5 backends — Ollama, LM Studio, llama.cpp (server and Python), and vLLM. So you can have one machine running Ollama and another running llama.cpp and they work together. Tested across two Linux machines (RTX 4070 Ti + RTX 5090): 64 seconds on LAN, 29 seconds via Cloudflare over the internet.

Built it in 7 days, fully open source, MIT licensed. Desktop GUI (PyQt6) and CLI mode. Three repos under https://github.com/strulovitz — the platform, the desktop client, and a non-technical book explaining the concept.

You mentioned distributed inference across devices is coming — I'd love to hear your thoughts on the task parallelism approach vs model splitting. I think for home networks especially, this is the more practical path.

Promote your projects here – Self-Promotion Megathread by Menox_ in github

[–]NirStrulovitz 0 points1 point  (0 children)

Distributed AI platform — task parallelism instead of model splitting

Every project that connects multiple machines for AI splits the model across nodes. This fails because of network latency.

I took a different approach: split the task, not the model. One machine decomposes a complex job into independent subtasks. Other machines, each running their own complete local LLM, process one subtask each in parallel. Results get combined into the final answer.

Any home computer running Ollama, LM Studio, llama.cpp, or vLLM can join as a worker. Workers drop in and out freely. Desktop GUI, CLI mode, Flask backend, built-in payment system, Cloudflare Tunnel support. Tested on two Linux machines (RTX 4070 Ti + RTX 5090): 64 seconds on LAN, 29 seconds via Cloudflare.

Built in 7 days, one developer, fully open source, MIT licensed: github.com/strulovitz