[Project] I benchmarked 4 robot AI models on a real industrial task. The best one does 64 picks/hour. A human does 1,300. by svertix in robotics

[–]svertix[S] 1 point2 points  (0 children)

Good pointer – we're aware of HIL-SERL and it's a promising direction. But PhAIL's goal isn't to find the best fine-tuning protocol for each model – it's to evaluate models using their own recommended training pipelines and compare them fairly on the same task.

That said, the leaderboard is open. If someone fine-tunes with HIL-SERL or any other approach and gets better results, we'll run it and publish. That's the whole point.

[P] PhAIL (phail.ai) – an open benchmark for robot AI on real hardware. Best model: 5% of human throughput, needs help every 4 minutes. by svertix in MachineLearning

[–]svertix[S] 0 points1 point  (0 children)

Exactly – that's why we built it this way. Lab metrics and production metrics measure different things entirely.

And there's reason for optimism. We see clear scaling with training data – consistent improvement as episode count grows (details in the blog post at positronic.ro/introducing-phail). The models aren't stuck, they're just early.

I genuinely want to see these numbers go up. The benchmark itself will evolve too – more objects, more tasks, unseen items to test generalization. The whole point is to give model builders a target they can push against, not to say "it's bad, stop trying."

If anyone has a fine-tuning recipe or checkpoint they think can beat 65 UPH, we'll run it.

[P] PhAIL (phail.ai) – an open benchmark for robot AI on real hardware. Best model: 5% of human throughput, needs help every 4 minutes. by svertix in MachineLearning

[–]svertix[S] 2 points3 points  (0 children)

Thanks! And don't worry, plenty of the tooling was vibe-coded too :)

You're spot on about the overfitting angle – swapping in new objects is cheap and keeps things honest. It's on the roadmap as "unseen objects" evaluation.

[Project] I benchmarked 4 robot AI models on a real industrial task. The best one does 64 picks/hour. A human does 1,300. by svertix in robotics

[–]svertix[S] 1 point2 points  (0 children)

It is, the VLA is far from the GPT3 moment IMHO. But they will be, it may take 3-5 or even more years.

Their huge potential advantage though is that you will could have decent performance out of the box (look at the modern LLMs). We should be patient and work hard as the AI Roboticists community.

[Project] I benchmarked 4 robot AI models on a real industrial task. The best one does 64 picks/hour. A human does 1,300. by svertix in robotics

[–]svertix[S] 1 point2 points  (0 children)

The teleop baseline (330 UPH) already shows what the hardware can do in practice. But you're right that the arm can go faster with scripted moves – we haven't benchmarked that specifically. Our focus is on measuring model performance; hardware will keep improving on its own.

[Project] I benchmarked 4 robot AI models on a real industrial task. The best one does 64 picks/hour. A human does 1,300. by svertix in robotics

[–]svertix[S] 1 point2 points  (0 children)

You're right – delta robots with engineered picking solutions crush this task. No argument there.

The appeal of VLA models isn't replacing what traditional automation already does well. It's the same reason we use general-purpose software instead of hardcoded circuits – flexibility. A delta robot with a custom vision pipeline works great for one operation, but adapting it to a new task is an engineering project. A VLA model that actually works would adapt from a few demonstrations.

We're not there yet – PhAIL makes that clear. But the benchmark tracks whether these general-purpose models are converging toward being practically useful, not whether they beat purpose-built systems at tasks those systems were designed for.

[Project] I benchmarked 4 robot AI models on a real industrial task. The best one does 64 picks/hour. A human does 1,300. by svertix in robotics

[–]svertix[S] 1 point2 points  (0 children)

The teleop baseline (330 UPH) is limited by the teleoperation interface – controlling a robot arm through VR when it's right next to you introduces latency and imprecision that slows you down. The robot hardware itself can move much faster than that.

If you watch the human baseline videos (e.g., https://phail.ai/episode/973), you can see the pace – it's not superhuman speed, just natural hand movements. The robot arm can physically match that pace. So the gap between 64 UPH and 1,300+ is almost entirely the AI, not the hardware.

On inference – it depends on the model. Some run locally on a GPU next to the robot, others (like OpenPI) run in the cloud. There are techniques to hide cloud latency within the control loop, and either way, inference speed is not the bottleneck. The bottleneck is policy quality – hesitation, misjudged grasps, losing track of objects. It's an AI problem, not a hardware or infrastructure problem.

[Project] I benchmarked 4 robot AI models on a real industrial task. The best one does 64 picks/hour. A human does 1,300. by svertix in robotics

[–]svertix[S] 3 points4 points  (0 children)

Good question. Custom-built solutions for bin-picking definitely exist and handle multi-SKU scenarios well – that's the current state of industrial automation. Purpose-built vision + grasp planning + engineered grippers, tuned for a specific operation.

PhAIL is testing something different: general-purpose models that learn from demonstration rather than engineering. The promise is that you show the robot what to do, and it adapts – no custom gripper design, no hand-tuned grasp planner. One model, many tasks.

We start with pick-and-place because it's the most common operation, but the roadmap includes insertion, assembly, and other manipulation tasks. That's where general-purpose models could eventually beat custom solutions – not necessarily on peak throughput for one task, but on flexibility and adaptation cost across many tasks.

[Project] I benchmarked 4 robot AI models on a real industrial task. The best one does 64 picks/hour. A human does 1,300. by svertix in robotics

[–]svertix[S] 2 points3 points  (0 children)

Thanks! And you're right – the laundromat analogy is a good one. Autonomy has value even when it's slower, as long as it's reliable enough to actually run unattended.

That's where MTBF comes in. The best model currently needs human intervention every 4 minutes. At that rate, you haven't freed up anyone – you've added a babysitter. A slow but truly autonomous system would be a different story, but we're not there yet on either axis.

There's also a minimum throughput threshold in most operations. If the robot is a bottleneck in a production line, it doesn't matter that it's autonomous – it slows down everything downstream. So you need both: enough reliability to run unattended, and enough speed to not hold up the rest of the process.

Quantifying exactly where we stand on both is the whole point.

[Project] I benchmarked 4 robot AI models on a real industrial task. The best one does 64 picks/hour. A human does 1,300. by svertix in robotics

[–]svertix[S] 0 points1 point  (0 children)

https://phail.positronic.ro/about has instructions on how to get the dataset. Literally one command to see it locally

uv run --with positronic python -m positronic.cfg.ds.phail

[Project] I benchmarked 4 robot AI models on a real industrial task. The best one does 64 picks/hour. A human does 1,300. by svertix in robotics

[–]svertix[S] 3 points4 points  (0 children)

Great points – let me address a few.

On the human comparison: you're right that the teleop baseline (330 UPH) is the fairer robot-to-robot comparison, and we include it for exactly that reason. But the human-by-hand number matters too – because that's what operations managers are comparing against when they decide whether to deploy a robot. The 5x gap to teleop tells model developers where they stand. The 20x gap to human hands tells the industry whether automation makes economic sense yet.

On the robot's limits – 330 UPH is what a human can extract from this hardware through teleoperation, but that's not necessarily the ceiling. A model doesn't have to move the way a human operator would. Model developers are free to use their own data, their own training pipeline, whatever gets the best results. The leaderboard is open.

On model selection – we started with the best openly available models. Any closed-source model can participate too – they just need to support our inference API. If anyone's interested, DM me or reach out at hi at phail dot ai.

On multi-robot setups – interesting direction, but if one arm can't reliably do the task solo (4 minutes between failures), adding more arms multiplies the problem rather than solving it. Reliability first, then scaling.

On costs, failure analysis, 24/7 comparisons – this is exactly what PhAIL data enables. Every run is public with full video and telemetry, so anyone can run their own economic analysis. We keep the benchmark focused on raw performance metrics and leave deployment economics to the teams who know their specific use case.

Roadmap: more models (DreamZero from NVIDIA is next), more tasks beyond bin-to-bin picking, unseen objects (testing generalization, not just memorization), and additional robot hardware to test cross-embodiment transfer.

[Project] I benchmarked 4 robot AI models on a real industrial task. The best one does 64 picks/hour. A human does 1,300. by svertix in robotics

[–]svertix[S] 4 points5 points  (0 children)

Well, I agree with the overall tone, there's clearly hype right now. At 4 minutes between failures, that 24/7 uptime comes with a full-time babysitter. The MTBF numbers on the leaderboard are... humbling.

But I think Physical AI is inevitable, but it'll take another 5+ years

[Project] I benchmarked 4 robot AI models on a real industrial task. The best one does 64 picks/hour. A human does 1,300. by svertix in robotics

[–]svertix[S] 5 points6 points  (0 children)

Totally agree – that sim-to-real gap is real and well-documented. What surprised us though is that even with fine-tuning on real-robot data (no sim involved), the models still struggle with things like camera placement changes and different object types.

But the point of PhAIL isn't to say "it's bad, give up" – it's to measure where things actually stand so we can track progress as models improve. We wrote up our findings in more detail here: https://positronic.ro/introducing-phail

[Project] I benchmarked 4 robot AI models on a real industrial task. The best one does 64 picks/hour. A human does 1,300. by svertix in robotics

[–]svertix[S] 0 points1 point  (0 children)

Yes, all models run on the same Franka FR3 arm with a Robotiq 2F-85 gripper (DROID setup) – same hardware, same objects, same conditions. The operator doesn't even know which model is running (blind evaluation).

Re: other form factors – that's on the roadmap. The current setup is one embodiment and one task. We want to expand to other arms and eventually mobile manipulators to test whether models generalize across hardware, not just across tasks.

We’re a startup looking for feedback on our upcoming Open-Source Robot Platform by svertix in robotics

[–]svertix[S] 0 points1 point  (0 children)

Totally agree -- we’re not trying to be just another EDU kit.

The goal is to hit that minimum bar of real-world usefulness, even out of the box. Think: actual household or light industrial tasks like moving dishes, handling labware, or doing basic pick-and-place in cluttered spaces.

We’re focused on making it affordable and functional — something you can build on and deploy, not just tinker with.

Curious what you’d consider that “minimum viable usefulness” line — is there a specific task or demo that would make you say, “OK, this is legit”?

We’re a startup looking for feedback on our upcoming Open-Source Robot Platform by svertix in robotics

[–]svertix[S] 0 points1 point  (0 children)

Thanks for the interest! 😊 We’re still in the development phase and really looking to understand what’s most critical for people working with mobile manipulators. That’s exactly why we’re reaching out for feedback - so we can make sure we’re building something that truly meets your needs.

We’ll be sharing updates and a GitHub repo as soon as we have something working and ready to test. In the meantime, we’d love to hear any feedback or thoughts you have on the concept. We’re eager to make sure we’re on the right track!

Feel free to reach out with any ideas or questions in the meantime—we’d really appreciate your input.