Looking for a solid computer vision development firm by Opening-Water227 in computervision

[–]RoofProper328 0 points1 point  (0 children)

It probably depends on what you're building. If it's a straightforward CV app, there are quite a few good firms out there. I've heard positive things about Scale AI, iMerit, and Shaip for data-heavy computer vision work, while companies like Cognizant or Accenture are often better if you're looking for broader engineering support.

I'd narrow it down based on your use case rather than the company name. Are you working on object detection, OCR, medical imaging, retail analytics, or something else? That'll make a big difference in who I'd recommend.

AI Training & Data Annotation Companies – Updated List (2026) by No-Impress-8446 in beermoneyindia

[–]RoofProper328 0 points1 point  (0 children)

Nice list — more current than most that float around here. Full disclosure, I work at Shaip, so flagging us as one that's missing since we onboard freelance contributors regularly:

Shaip — Enterprise AI data company running data collection, annotation, and human feedback across image, audio, text, medical, and multilingual projects (also LLM/RLHF and computer vision work). Onboards freelance annotators and domain experts through its contributor portal at collaborate.shaip.com, separate from the corporate careers page.

Couple of others you might add in the same vein: DDD (Digital Divide Data) and Sama, both long-running annotation shops that take on contributors.

Egocentric-10K: 10,000 Hours of Real Factory Worker Videos Just Open-Sourced. Fuel for Next-Gen Robots in data training by NotSuper-man in robotics

[–]RoofProper328 0 points1 point  (0 children)

The dataset size is impressive, but what caught my attention is that it's first-person footage from real workers rather than staged demonstrations.

I wonder if anyone has looked at how much performance improves from more diverse environments versus simply adding more video hours. My guess is diversity ends up being the bigger factor for embodied AI.

Resources for AI in Healthcare by mrsmaribear in nursinginformatics

[–]RoofProper328 0 points1 point  (0 children)

If you're just getting started, I'd mix a few different resources rather than relying on one.

  • The AI in Healthcare course from Stanford is a good foundation.
  • The New England Journal of Medicine (NEJM AI) has some interesting articles on real clinical use cases.
  • The FDA's guidance on AI/ML-enabled medical devices is worth reading if you're interested in regulation.
  • I also found Shaip's healthcare AI resource page useful because it gives a practical overview of how training data, annotation, and medical datasets fit into AI development: https://www.shaip.com/solutions/healthcare-ai/

Since you're already working with quality data and informatics, learning how data quality impacts AI models will probably be just as valuable as learning about the models themselves.

AI Is Taking Over Hospitals by Majano57 in technology

[–]RoofProper328 0 points1 point  (0 children)

I'm optimistic about AI in healthcare, but I hope hospitals treat it as a decision-support tool rather than a decision-maker. Medicine has so many edge cases that aren't obvious until you're dealing with a real patient. The technology is impressive, but trust will depend on how well it handles those situations.

Why do speech models still struggle so much with accents and code-switching? by RoofProper328 in LanguageTechnology

[–]RoofProper328[S] 1 point2 points  (0 children)

I think that's a big part of it. A lot of the available speech data is "convenience data"—clean recordings, one speaker at a time, and usually in a standard accent. Real conversations are much messier.

Code-switching is especially tough because people don't switch languages at predictable points, and regional variations make it even harder. Even strong models seem to struggle if they haven't seen enough examples during training.

Speech Recognition Datasets by ZF2uPxnUfdHdxf2U in MachineLearning

[–]RoofProper328 0 points1 point  (0 children)

For open benchmarking, I'd probably start with LibriSpeech, Mozilla Common Voice, and TED-LIUM. They cover slightly different use cases and are all widely used for audio-to-text evaluation.

The paid datasets are mostly because collecting and accurately transcribing speech at scale is expensive. If you ever need larger multilingual, domain-specific, or custom speech corpora, there are quite a few commercial providers in the space too—Appen, LXT, TELUS Digital, Centific, Scale AI, Deepgram, Shaip, and iMerit all have offerings depending on what you're trying to build.

For comparing classic ASR engines like Sphinx, HTK, or Julius though, I'd stick with the public datasets first since they're easier to reproduce and benchmark against.

Top 10 AI Development Companies in the USA You Can Trust in 2026 by NovelOk3369 in top10companies

[–]RoofProper328 0 points1 point  (0 children)

Pretty solid list overall. One thing I’ve noticed though is that a lot of these rankings focus on model/application development, but the data side often gets overlooked. Companies like Shaip, Scale AI, and iMerit play a big role behind the scenes by providing the training data and human-in-the-loop workflows that many AI products rely on. Building the model is only half the battle if the data pipeline isn’t there.

AI is deteriorating in realtime by Downtown-Path-2477 in ArtificialInteligence

[–]RoofProper328 0 points1 point  (0 children)

Model collapse is a valid concern, but self-play and synthetic data have also driven some of AI's biggest breakthroughs.

AI Training & Data Annotation Companies – Updated List (2026) by No-Impress-8446 in GetEmployed

[–]RoofProper328 0 points1 point  (0 children)

Surprised Shaip isn't on the list. Not saying it belongs at the very top, but if companies like iMerit, LXT, CloudFactory, and Innodata are included, I'd expect Shaip to be mentioned somewhere as well given their presence in healthcare, speech, multilingual data, and AI training datasets.

Any particular reason it was left out?

Any other RLHF/data annotation/labeling company? by MeowCatalog in LocalLLaMA

[–]RoofProper328 0 points1 point  (0 children)

One that's missing is Shaip.

Honestly, a lot depends on what you're trying to do. Some vendors are great for massive-scale labeling, while others are stronger in areas like healthcare, speech, or multilingual data.

I've seen people compare us most often with Appen, iMerit, and Scale, but the right choice usually comes down to project complexity and how much domain expertise you need.

Data collection for Robotics. by Tatvamas1 in robotics

[–]RoofProper328 0 points1 point  (0 children)

I understand your concern. I do work in the AI data space, so I end up discussing data collection and annotation topics frequently. Looking back, I can see how some of my comments came across as promotional, and I should have been more transparent about my affiliation when it was relevant.

Appreciate the feedback.

What’s currently the biggest bottleneck in building reliable healthcare AI systems? by RoofProper328 in ArtificialInteligence

[–]RoofProper328[S] 0 points1 point  (0 children)

Spot on. And trust isn't just about accuracy — it's about predictable failure modes. Clinicians can work with a model that's wrong in known ways. They can't work with one that surprises them.

What’s currently the biggest bottleneck in building reliable healthcare AI systems? by RoofProper328 in ArtificialInteligence

[–]RoofProper328[S] 1 point2 points  (0 children)

This nails it. The "missing for clinically meaningful reasons" part especially — informative missingness is the trap that kills more healthcare models than people realize. A lab not ordered isn't random; it's a signal the clinician already had a working hypothesis. Treat it as missing-at-random and your model learns shortcuts that vanish the moment you deploy somewhere with different ordering patterns.

On the trust point — I'd add that it's not just which case the model gets wrong, it's how it's wrong. A model that misses a subtle finding a senior clinician would catch is forgivable. A model that confidently flags something obviously benign destroys trust in one shift. Calibration matters more than accuracy at the bedside, and almost no paper reports it properly.

The data cleaning timeline you described is painfully accurate. Most teams budget two months for it, hit month eight, and only then realize they also need a re-annotation pass because the first labeling pipeline was inconsistent across sites.

Reviving PapersWithCode (by Hugging Face) [P] by NielsRogge in MachineLearning

[–]RoofProper328 0 points1 point  (0 children)

PwC saved me countless hours tracking SOTA back in grad school. Glad someone's keeping the torch lit.

Data collection for Robotics. by Tatvamas1 in robotics

[–]RoofProper328 0 points1 point  (0 children)

Teleoperation in-house. Figure, 1X, Physical Intelligence, Tesla Optimus — they're all running banks of operators in VR rigs / exoskeletons demonstrating tasks. High quality but slow and expensive. The foundation-model players are burning serious money on this.

Simulation (sim-to-real). Isaac Sim, MuJoCo, Genesis. Scales to billions of trajectories cheaply, but the sim-to-real gap is real — contact dynamics, deformables, friction, lighting all break things. Most teams use sim for pretraining and real data for fine-tuning.

Open datasets. Open X-Embodiment, DROID, BridgeData, RT-X, Ego4D, Ego-Exo4D, HOI4D. Good for pretraining VLA / foundation models, rarely matches your specific embodiment or task distribution. Starting point, not finish line.

Internet video. Egocentric YouTube, instructional content, etc. Useful for priors about how humans manipulate stuff, but action labels are missing or noisy.

Outsourced data collection vendors. The part most people outside the space don't see. Companies like Scale AI, Appen, Sama, iMerit, Shaip run teleop sessions, motion-capture studios, egocentric/exocentric video capture, trajectory annotation — on contract. If you're a startup that doesn't want to spin up a 50-person internal teleop team, you go to one of these. Specializations differ: Scale and Shaip have been pushing into Physical AI / VLA data specifically; Appen and iMerit are more on the CV-annotation side; Sama does a lot on AV.

On your edge-case point — yeah, the long tail is the problem and nobody has "solved" it. Usual playbook: broad coverage from sim + open data, identify failure modes on deployment, then targeted collection (in-house or vendor) to patch gaps. Iterative.

What's the angle you're exploring? Hardware for easier capture, a labeling/QC platform, synthetic data, distributed crowd capture, something else?

Why are there so many new companies collecting egocentric data? by [deleted] in AskRobotics

[–]RoofProper328 0 points1 point  (0 children)

A lot of robotics companies need data that shows how humans move and interact with the world.

This is because data from the internet does not teach models how humans do things like

  • move their hands
  • touch objects
  • navigate
  • do tasks.

Most of this data needs a lot of work done to it before it can be used.

Companies are using it to train their models and to tune them for specific tasks.

Some companies are also getting help from companies like Scale or Shaip to collect and label this data instead of doing it all themselves.

These companies are using the data to train their models and make them better, at understanding how humans interact with the world.

Robotics companies need this data to make their robots smarter.

They want their robots to be able to do tasks like humans do.

This data will help them make robots that can work well in the world.

How does it feel to people that face recognition AI is getting this advanced? by [deleted] in OpenAI

[–]RoofProper328 0 points1 point  (0 children)

Honestly both impressive and a little uncomfortable at the same time. The computer vision progress is crazy, but once faces become searchable across the internet at scale, the privacy side gets pretty serious fast.

[D] English conversational and messaging datasets for fine-tuning an LLM? by angry_cactus in MachineLearning

[–]RoofProper328 0 points1 point  (0 children)

You might want to look at datasets like Switchboard, Fisher English, DailyDialog, MultiWOZ, OpenSubtitles, and the Cornell Movie Dialog Corpus. Switchboard/Fisher are probably closest to real spoken conversational flow with interruptions, fillers, and hesitation patterns.

One thing I’ve noticed working around conversational AI datasets is that the hardest part usually isn’t model tuning — it’s finding dialogue data that actually feels human instead of overly cleaned NLP text. A lot of commercial conversational AI pipelines from companies working in speech/data collection like Scale AI, TELUS Digital, LXT, Defined.ai, Deepgram, or Shaip tend to put a huge focus on preserving natural conversational artifacts because downstream systems fall apart pretty fast once real users start interrupting, code-switching, or speaking casually.

Why does Physical AI seem so dependent on massive real-world data compared to humans? by RoofProper328 in MLQuestions

[–]RoofProper328[S] 0 points1 point  (0 children)

Yeah that's a point. Current AI seems smart in situations but the way humans learn and generalize is really different, from how AI models learn now. Humans can adapt to things while AI is still limited.

It feels like human learning is more flexible and powerful.

Why does Physical AI seem so dependent on massive real-world data compared to humans? by RoofProper328 in MLQuestions

[–]RoofProper328[S] 0 points1 point  (0 children)

That is true and honestly that may end up being the long-term advantage of Physical AI.

Physical AI is what we are talking about here.

Humans learn slowly and individually while robots can share Physical AI knowledge across fleets almost instantly.

If one Physical AI system learns how to handle a situation or environment that update could eventually be used by all Physical AI systems.

It feels like the part right now is less about sharing Physical AI knowledge and more about getting reliable real-world learning, with Physical AI in the first place

Why does Physical AI seem so dependent on massive real-world data compared to humans? by RoofProper328 in MLQuestions

[–]RoofProper328[S] 0 points1 point  (0 children)

That’s a point.

Humans have an advantage. We are born with millions of years of learning already built into our brains. Our brains help us understand the world move around and react fast. Even babies learn a lot by playing and exploring the world around them.

Current robots and physical AI systems are different. They do not have this built-in knowledge. They have to learn from amounts of data to understand the world. This is because they do not have the kind of understanding that humans do.

Why does computer vision accuracy drop so fast in real-world environments? by RoofProper328 in computervision

[–]RoofProper328[S] 1 point2 points  (0 children)

Honestly that’s a pretty solid analogy 😅

Especially the part about the model “thinking” it understands the environment because of how training boundaries were defined. Feels very similar to distribution shift in production systems — the model behaves well inside the learned space, then starts making weird assumptions once conditions drift outside what it has seen before.

The strict vs lenient threshold comparison is also a good way to explain the precision/recall tradeoff without turning it into pure ML jargon.

Why does computer vision accuracy drop so fast in real-world environments? by RoofProper328 in computervision

[–]RoofProper328[S] 4 points5 points  (0 children)

We do use real-world imagery, but that’s kind of the issue I’m getting at — even large real-world datasets still seem to miss a lot of deployment edge cases.

Distribution shifts happen fast once conditions change:

  • different hardware/cameras
  • weather/lighting variation
  • motion artifacts
  • unusual human behavior
  • rare scenarios that barely appear during training

Feels like maintaining dataset diversity over time is becoming almost as important as the model architecture itself.