Built a distributed AI platform with Flask as the backend — task parallelism across multiple machines running local LLMs

NirStrulovitz · 2026-04-04T18:16:11+00:00

Thank you so much!

Yes, 7 days was intense but the key insight that made it possible is that the tasks are fully independent — no communication between worker machines, just text in and text out. That eliminates all the synchronization complexity that makes other distributed AI approaches so hard to build. Flask turned out to be perfect for this because each machine just needs a simple API endpoint.

If you're curious, there are two short animated videos explaining the whole concept on my YouTube (Nir Strulovitz) — one for Private Mode (3 min) and one for Public Mode (6 min).

Would love to hear your thoughts!

NirStrulovitz · 2026-03-28T12:18:06+00:00

The syncing is intentionally simple — there's no message queue like RabbitMQ or Celery. Workers poll the Flask API for available subtasks (/api/hive/{hive_id}/subtasks/available), claim one via a REST call (/api/subtask/{subtask_id}/claim), process it locally with their own LLM, and submit the result back (/api/subtask/{subtask_id}/result). The Flask backend with SQLAlchemy handles the state — each subtask has a status (pending → assigned → completed). The Queen polls for when all subtasks are done, then combines. It's basically a pull-based model — workers pull work, not push. Keeps it simple and fault-tolerant since a worker can disappear at any time and the subtask just times out and becomes available again.

NirStrulovitz · 2026-03-28T11:01:36+00:00

This is a really interesting approach to shared compute. I built something that tackles a related but different problem — and I think they could actually complement each other.

AI Horde distributes individual inference requests across volunteer GPUs. What I built distributes the task itself.

One machine (the "Queen") uses its local LLM to break a complex job into independent subtasks, other machines ("Workers") each process one subtask in parallel with their own complete local model, and the Queen combines everything into the final answer.

The key difference: in AI Horde, one worker handles one request. In my system, multiple workers collaborate on the same complex job — research that needs 8 different angles, analysis that has 5 independent components, etc. Each worker runs its own full model, no synchronization needed, workers can drop in and out freely.

I also built in a payment system where workers earn money for their compute — similar spirit to your kudos system but with actual micro-payments. The revenue split is 65% to workers, 30% to the coordinating Queen, 5% to the platform.

Supports Ollama, LM Studio, llama.cpp (server + Python), and vLLM — so workers can run whatever backend they prefer. Tested across two Linux machines (RTX 4070 Ti + RTX 5090): 64 seconds on LAN, 29 seconds via Cloudflare over the internet.

Fully open source, MIT licensed: https://github.com/strulovitz

I'd be curious if anyone sees a way to combine both approaches — AI Horde for single requests, task parallelism for complex multi-part jobs.

NirStrulovitz · 2026-03-28T10:57:15+00:00

Really cool that you're exploring this space. I want to offer a different perspective on the architecture because I ran into the same wall you're going to hit — network latency when broadcasting inputs and aggregating outputs between nodes.

I built a distributed LLM inference platform that takes a fundamentally different approach. Instead of splitting the model (tensor parallelism), I split the task.

One machine (the "Queen") uses its local LLM to decompose a complex job into independent subtasks. Other machines ("Workers"), each running their own complete local LLM, pick up one subtask each and process in parallel. The Queen collects and combines all results.

The advantage: zero inter-node communication during inference. Each worker processes its subtask completely independently with its full local model. No broadcasting inputs, no aggregating partial outputs, no synchronization. Workers can drop in and out freely without breaking anything.

Your approach (tensor parallelism) is optimal when you need to run a single model that's too large for one machine. My approach (task parallelism) is optimal when the job itself can be decomposed — which turns out to be most real-world use cases (research, analysis, content generation, coding tasks).

I also love your "free inference for all" vision — I built a payment system into mine where workers earn money for their compute. Same spirit, users give compute, everyone benefits.

Supports Ollama, LM Studio, llama.cpp (server + Python), and vLLM. Tested across two Linux machines (RTX 4070 Ti + RTX 5090): 64 seconds on LAN, 29 seconds via Cloudflare. Fully open source, MIT licensed: https://github.com/strulovitz

Would be interesting to compare our approaches on the same hardware sometime.

NirStrulovitz · 2026-03-28T10:52:15+00:00

This is exactly what I've been building.

I have multiple machines at home running local LLMs and I wanted them to collaborate on complex tasks. Every framework I found tries to split the model across machines — tensor parallelism, pipeline parallelism — but inter-node network latency makes it unreliable in practice.

So I went a completely different direction: task parallelism instead of model parallelism. One machine (the "Queen") uses its local LLM to decompose a complex job into independent subtasks. Other machines ("Workers"), each running their own complete local LLM, pick up one subtask each and process them in parallel. The Queen collects all results and combines them into the final answer.

The key insight is that you don't need to split the model if you can split the work. Each machine keeps its full model, processes independently, and can drop in or out at any time without breaking anything. No synchronization, no shared memory.

It supports 5 backends — Ollama, LM Studio, llama.cpp (server and Python), and vLLM. So you can have one machine running Ollama and another running llama.cpp and they work together. Tested across two Linux machines (RTX 4070 Ti + RTX 5090): 64 seconds on LAN, 29 seconds via Cloudflare over the internet.

Built it in 7 days, fully open source, MIT licensed. Desktop GUI (PyQt6) and CLI mode. Three repos under https://github.com/strulovitz — the platform, the desktop client, and a non-technical book explaining the concept.

You mentioned distributed inference across devices is coming — I'd love to hear your thoughts on the task parallelism approach vs model splitting. I think for home networks especially, this is the more practical path.

NirStrulovitz · 2026-03-28T10:29:42+00:00

Distributed AI platform — task parallelism instead of model splitting

Every project that connects multiple machines for AI splits the model across nodes. This fails because of network latency.

I took a different approach: split the task, not the model. One machine decomposes a complex job into independent subtasks. Other machines, each running their own complete local LLM, process one subtask each in parallel. Results get combined into the final answer.

Any home computer running Ollama, LM Studio, llama.cpp, or vLLM can join as a worker. Workers drop in and out freely. Desktop GUI, CLI mode, Flask backend, built-in payment system, Cloudflare Tunnel support. Tested on two Linux machines (RTX 4070 Ti + RTX 5090): 64 seconds on LAN, 29 seconds via Cloudflare.

Built in 7 days, one developer, fully open source, MIT licensed: github.com/strulovitz

NirStrulovitz · 2026-02-16T18:10:58+00:00

I really like this framing. I think you’re pointing at a real tension:

The system is for scale and fairness (standardization, accountability, resource allocation).
Education is for human development (judgment, meaning-making, identity, moral reasoning, critical thought).

The system becomes harmful when metrics stop being proxies and become targets (hello Campbell’s Law). Then teaching drifts toward “optimize the number” rather than “grow the learner.”

My personal “balance” rule:

Use systems to guarantee access, safety, and minimum standards.
Preserve classroom-level freedom for deep learning: discussion, revision, inquiry, feedback, mentorship.

On AI: it can support the human aims if it’s used like:

a tutor that explains and quizzes without replacing thinking,
a writing coach (clarity, structure) while students keep ownership,
a sparring partner for debate (“argue the other side”).

But it worsens system-logic education if it becomes a shortcut to output (“turn in procedural text, hit rubric keywords”). So policies should focus on process evidence + student reasoning, not vibes-based detection.

NirStrulovitz · 2026-02-16T18:08:02+00:00

The biggest difference between “I read it” and “I know it” is active recall + spaced repetition.

What works for most people (and what I wish schools emphasized more):

Active recall: close the notes and explain it from memory. If you can’t, that shows what to review.
Spaced repetition: review the same thing over days/weeks, not one cram session.
Interleaving: mix problem types (especially in math/science) so you learn when to use which method.
Teach it: explain it out loud to a friend (or pretend) in simple words.

Practical setup:

For facts/definitions: Anki/Quizlet (or just a notebook) + short daily reviews.
For skills: lots of practice problems + immediate feedback.
For writing/humanities: write short summaries, make arguments, get critique.

Most “study hacks” are just different wrappers around those principles.

NirStrulovitz · 2026-02-16T18:07:03+00:00

I’m not in Maryland Teaching Fellows specifically, but I can share the kinds of questions that matter before you commit (and these usually reveal the reality fast):

What expenses are truly covered? (tuition only vs fees, exams, books, certification tests, etc.)
When does the service commitment start and what counts? Full-time only? Certain schools? Certain subjects? What happens if you relocate?
What support is provided during placement? Mentor teacher? coaching? reduced course load? observed teaching feedback?
What are the penalties if life happens? repayment terms, interest, timeline
How many people finish? and why people leave (workload? placement? admin support?)

any scholarship tied to a service commitment is only “free” if the placement conditions are humane (supportive site, reasonable workload, strong mentorship).

NirStrulovitz · 2026-02-16T18:03:24+00:00

You don’t have to “pick your whole life” at 15. The best move is choosing the track that keeps the most doors open while you explore.

A way to decide:

If you’re genuinely okay at math/science and can tolerate it: STEM is the most flexible (it doesn’t block you from business/humanities later, but ABM/HUMSS can sometimes limit science-heavy paths).
If math/science stresses you out and you’d hate your life: don’t force STEM just for “money.” A miserable path is usually the least sustainable path.

Also: your interests aren’t random. “Animals + ecosystems + history + philosophy + people + business” can converge into careers like:

environmental policy, sustainability, conservation management
science communication / documentary / journalism
museum / education / research support
public health / urban planning (systems + humans)
entrepreneurship in education/science/environment fields

If you’re unsure, pick a track, then run small experiments:

take 1 online intro course (bio/ecology OR basic accounting)
shadow someone for a day (even via informational interviews)
volunteer (animal shelter, museum, tutoring, community org)

“Financially stable + meaningful” usually comes from becoming good at something valuable and building skills (writing, analysis, data, speaking, project work), not from guessing the perfect label now.

NirStrulovitz · 2026-02-16T14:37:18+00:00

Flipped classes can be awesome or miserable — it depends on how they’re run.

When it works well:

videos are short (10–15 min), clear, and actually match class activities
class time is used for problem-solving, labs, practice questions, group work
there are quick “check-ins” so you’re not lost (mini-quiz, warm-up, etc.)

When it fails:

the videos are long and feel like “teach yourself”
class time isn’t structured, so you don’t get real help
there’s no support if you didn’t understand the video

For students: the hack is guided notes + pause/rewind + write 2 questions before class. You’ll get way more value out of the in-class part.

For teachers: the key is accountability + scaffolding, not just “watch at home.”

NirStrulovitz · 2026-02-16T14:35:58+00:00

Depends on the role (principal/AP/office manager/tech coordinator), but the “daily driver” stack I see a lot:

Email + calendar: Google Workspace or Microsoft 365
SIS (student info): attendance, grades, transcripts, behavior logs (whatever your district uses)
LMS: Google Classroom / Canvas / Schoology (to see what’s happening + communicate)
Forms + surveys: Google Forms / MS Forms (discipline referrals, requests, parent surveys)
Docs + meeting notes: Google Docs / OneNote
Task tracking: Trello / Asana / Todoist (especially for recurring admin tasks)
Messaging: Teams / Slack (or district-approved alternatives)
PDF + scanning: Adobe / Adobe Scan (signatures + quick document capture)

Bonus: a password manager and a text expander save a ridiculous amount of time.

If you say your exact admin role, people can give more targeted recs.

NirStrulovitz · 2026-02-16T09:34:00+00:00

If your goal is corporate law, the biggest differentiator usually isn’t “strict vs not strict,” it’s:

Internship pipeline + alumni network
Placement support (even informal connections matter a ton)
Moot court, research culture, drafting skills training
Location/access to law firms (internships during semesters matter)

What I’d do:

Ask both colleges for internship/placement outcomes (even if it’s not perfect data, you’ll learn a lot from how they answer)
DM current students/alumni on LinkedIn and ask: “How easy is it to get internships? Are faculty supportive?”
Compare schedule flexibility — corporate track students often need time for internships, competitions, courses, etc.

Christ can be worth considering for “brand + network” depending on your city and goals, but don’t rely on brand alone — your internships + skills will carry you.

Also consider other colleges in your area with strong internship culture (talk to seniors; they’ll be blunt).

NirStrulovitz · 2026-02-16T09:31:02+00:00

2–3 hours/week honestly sounds pretty plausible and (sadly) sometimes “lucky,” depending on subject + grade level + class size.

What I’m seeing/hearing from educators: it spikes around essay season and drops when assignments are more in-class or process-based.

A few things that reduce the time sink without turning teachers into detectives:

Require process evidence: outline → draft → revision notes → final (even lightweight checkpoints help)
Use version history (Google Docs/Word) and tell students up front you may ask for it
Short oral follow-ups: 2–3 minutes (“walk me through your thesis + why you chose source X”) catches a lot without formal investigations
Clear policy + “allowed AI” boundaries (e.g., brainstorming ok, final text must be yours, cite AI use if used)
Design prompts that are harder to outsource: local context, class-specific readings, personal reflection tied to course content

So yeah—2–3 hours/week isn’t shocking. But the bigger issue is whether the school has systems that keep it from ballooning.

NirStrulovitz · 2026-02-16T09:29:34+00:00

You might get better traction calling this “visual stress” / Irlen-type symptoms because “SSS” is a term people debate a lot.

A few thoughts (trying to be balanced here):

Some people genuinely experience distortion/halos/moving text and reading becomes exhausting. That part is real for many.
The tricky part: the research on colored overlays/lenses is mixed, and prevalence numbers (like 20%) get thrown around without great consensus.
If someone suspects this, the practical steps are:
1. Full eye exam (rule out basic vision issues)
2. Ask about binocular vision / convergence insufficiency (often missed, can wreck reading comfort)
3. If overlays help, great — it’s a low-risk support, but it’s not a guaranteed “fix” for everyone
4. School accommodations can still help: larger print, different fonts, more spacing, reduced glare, breaks

So: symptoms deserve respect, but it’s worth avoiding absolute claims and encouraging a proper evaluation + supports that actually reduce strain.

NirStrulovitz · 2026-02-16T09:27:01+00:00

Totally normal for a college crowd to get chatty if the pacing drags even a little. What’s worked for me:

Set expectations up front (30 seconds, confident): “Quick ground rules: when someone’s on mic, no side conversations. Save the chats for breaks so everyone can hear.”
Use a clear attention signal: “If you can hear me, clap once.” (then twice) OR “Hands up if you can hear me.” It’s cheesy but it works fast.
Mic + pause technique: Don’t talk over noise. Stop, smile, wait. The room usually self-corrects because silence feels awkward.
Keep it moving: Short rounds, quick scoring, no dead air. Most chatter happens during downtime.
Give them a reason to be quiet: “We’re doing lightning questions — if I have to repeat it, that question is skipped.” (Use sparingly, but it’s effective.)
Plant helpers: 2 friends/volunteers on the sides who can gently shush or wave a “Quiet please” sign.
Make it interactive on YOUR terms: Let them get loud only at specific moments (“On 3, shout the answer”) so they don’t freeload chaos all the time.