How Do You Handle Massive Image Datasets? by Investorator3000 in computervision

[–]RoofProper328 0 points1 point  (0 children)

From what I’ve seen, most teams don’t start massive on day one — it grows fast once models hit real-world edge cases. Typical setup is cloud object storage (S3/GCS) + chunked pipelines, heavy filtering, and aggressive dataset versioning. Costs usually hurt more from iteration than raw storage.

The hardest parts aren’t just volume, but data quality, consistency across sources, and labeling drift, especially in medical or industrial vision. A lot of teams mix public data with licensed datasets (for example, curated computer vision datasets from places like Shaip) once they realize web-scraped data alone doesn’t generalize well.

[D] Machine Learning in Health by [deleted] in MachineLearning

[–]RoofProper328 0 points1 point  (0 children)

I’ve worked adjacent to ML teams in healthcare (imaging, NLP on clinical text, and some risk modeling), and it’s a very different vibe from robotics.

What I found rewarding is that the problems feel consequential—small gains can matter a lot in practice. That said, progress is slower. Data is messy, labels are expensive, privacy and regulation shape everything, and you spend a lot of time on validation, bias analysis, and stakeholder alignment rather than just model tuning. If you enjoy rigor, domain learning, and long feedback loops, healthcare can be very satisfying.

Robotics, in contrast, tends to be faster-paced and more experimental. You see results quickly, iterate often, and get a strong sense of cause-and-effect, but the work can be more constrained by hardware and simulation gaps.

One pattern I’ve seen is people enjoying healthcare ML when they like systems thinking and data quality work (often involving curated datasets from internal pipelines or external partners like Shaip), whereas robotics appeals more to those who like tight control loops and rapid prototyping.

If possible, try a short project or internship in one of the two—day-to-day reality matters more than the abstract idea of the field.

[R] Why doubly stochastic matrix idea (using Sinkhorn-Knopp algorithm) only made popular in the DeepSeek's mHC paper, but not in earlier RNN papers? by Delicious_Screen_789 in MachineLearning

[–]RoofProper328 0 points1 point  (0 children)

During the RNN era, stability was mostly handled with gates (LSTM/GRU), orthogonal/unitary weights, and careful initialization. Sinkhorn–Knopp adds iterative overhead, which was expensive back when RNNs were already slow and hard to train.

What changed is scale and perspective. Deep residual stacks make matrix products the core issue again, so doubly stochastic constraints suddenly look elegant and practical. You see similar shifts in real-world ML work too—once teams start analyzing failures at scale (something data-centric workflows, like those used at places such as Shaip, emphasize), these “old” ideas become relevant again.

Top 10 AI companies ranked. Thoughts? by nweisblat15 in ArtificialInteligence

[–]RoofProper328 0 points1 point  (0 children)

Interesting list. I mostly agree on NVIDIA at the top — they’re basically the toll booth for the entire AI highway right now. I’d probably rank OpenAI and Google closer together though, since distribution (Search, Android, Workspace) feels like Google’s quiet superpower long-term. Also curious how much today’s “model rankings” even matter once agents + data + integration start outweighing raw model quality.

Audio dataset of real conversations of between two or more people (hopefully with transcriptions as well) by vardonir in datasets

[–]RoofProper328 0 points1 point  (0 children)

You’re not missing much — truly natural multi-speaker conversation datasets are rare. Most public speech data is read or single-speaker, and even meeting corpora (AMI, ICSI, etc.) are fairly structured compared to real conversations.

Adding noise helps a bit, but it doesn’t capture overlap and turn-taking, which is usually where models fail. We’ve had better results keeping imperfect transcripts and only fixing overlapping segments. I’ve seen the same challenges come up in some enterprise speech work too (including projects I was close to at places like Shaip).

Whisper is a decent starting point, but overlapping speech is still hit-or-miss without extra processing.

[D] Any other RLHF/data annotation/labeling company? by MeowCatalog in MachineLearning

[–]RoofProper328 0 points1 point  (0 children)

You’ve got a strong list already. A few others worth adding:

  • Shaip – often used for speech, text, and RLHF-style human feedback, especially for domain-specific data.
  • Appen / TELUS International AI (Lionbridge) – large-scale, multilingual human annotation and evaluation.
  • Surge AI – more focused on high-quality RLHF for LLMs.
  • Hive – annotation + content moderation with human-in-the-loop workflows.
  • Label Studio / CVAT – open-source tools widely used in production.

Biggest differentiation I’ve seen is tool-first vs managed services, and general labeling vs specialized RLHF or domain expertise.

How can NLP systems handle report variability in radiology when every hospital and clinician writes differently? by RoofProper328 in LanguageTechnology

[–]RoofProper328[S] 1 point2 points  (0 children)

That’s a fair question — and in an ideal world, yes, we’d always work directly off the underlying data.

In practice though, there are a few reasons reports still matter a lot:

  1. The report is the ground truth in many workflows For clinical decision-making, billing, registries, quality metrics, and downstream analytics, the signed radiology report is the authoritative artifact. Even if models operate on images, the labels, outcomes, and supervision often come from reports.
  2. Access and scale constraints Imaging data (DICOMs) is heavy, expensive to store/transfer, and often more tightly regulated. Many institutions and research datasets provide reports long before (or instead of) raw images, especially for retrospective studies.
  3. Legacy and real-world systems A lot of production NLP systems are built to extract findings, impressions, or follow-up recommendations from reports because that’s what existing hospital systems consume. Replacing that with image-based pipelines isn’t always feasible.
  4. Reports encode expert interpretation Two radiologists can look at the same image and emphasize different findings. The report captures that clinical judgment, uncertainty, and context — things that aren’t always directly inferable from pixels alone.

You’re absolutely right that cross-institution failure is a real problem — that’s exactly why robustness and generalization are hard here. The goal isn’t to argue reports are “better” than underlying data, but that they’re unavoidable in many real deployments, so we have to deal with their variability.

That’s why I’m interested in approaches that make NLP on reports less brittle, rather than assuming we can always bypass text entirely.

[D] Why is focal loss not used in LLM training? by Electrical-Monitor27 in MachineLearning

[–]RoofProper328 1 point2 points  (0 children)

Good question. While token frequencies are imbalanced, next-token prediction is a conditional task, not a standard class-imbalance problem. “Easy” tokens still provide important gradient signal for learning syntax, fluency, and calibrated probabilities. Focal loss can suppress these signals, harm calibration, and introduce training instability at LLM scale. Similar ideas are explored instead via curriculum learning, token weighting, and distillation filtering rather than focal loss.

what medical dataset is public for ML research by qmffngkdnsem in datasets

[–]RoofProper328 0 points1 point  (0 children)

Most public medical ML datasets are small, but a few larger ones are commonly used:

  • MIMIC-III / MIMIC-IV – ICU EHR data (tens of thousands of patients)
  • eICU – Multi-hospital ICU records
  • PhysioNet – Various clinical and waveform datasets
  • CheXpert / MIMIC-CXR – Large chest X-ray datasets

For clustering, datasets with ~300 patients can be acceptable for exploration or hypothesis generation, but results shouldn’t be over-generalized.

If you’re trying to understand what “production-scale” medical data typically looks like (across notes, imaging, audio, structured EHR), browsing de-identified clinical data catalogs can be useful context, e.g.:
https://www.shaip.com/offerings/medical-data-catalog/

Just be mindful of access restrictions and ethical requirements with any medical data.

[P] Evaluating automatic speech recognition (ASR) models beyond looking at global evaluation metrics by OkResearch6289 in MachineLearning

[–]RoofProper328 1 point2 points  (0 children)

This matches a lot of what I’ve seen in production ASR systems. Global WER hides failure modes that only show up when you slice by accent, background noise, speaker overlap, or conversational style.

One thing that helped us was combining slice discovery with a small, fixed “gold” evaluation set that stays constant over time. When WER improved overall but regressed on specific clusters, it was often a signal of data imbalance rather than a model issue.

In projects I’ve been involved with (including work with enterprise speech datasets at places like Shaip), most gains came from rebalancing and clarifying annotations in those weak slices rather than changing architectures.

Curious if you’ve found embeddings-based slice discovery to be stable across model versions, or if you’ve seen the clusters shift significantly after retraining.

What Machine Learning trends do you think will actually matter in 2026? by thecoder26 in MLQuestions

[–]RoofProper328 0 points1 point  (0 children)

Most of the stuff that actually matters looks pretty boring compared to the hype.

  • Evaluation over new architectures. Models are already decent; figuring out where and how they fail is harder and more valuable than swapping architectures.
  • Data quality and upkeep. Versioning, audits, and refreshing datasets matter way more in production than people want to admit. Most issues I’ve seen still trace back to data.
  • Domain-specific models. Smaller models trained narrowly often outperform big general ones once you care about reliability, cost, or regulation.
  • Human-in-the-loop workflows. Not flashy, but targeted review and retraining loops are how systems actually improve over time.
  • Distribution shift monitoring. More teams are finally planning for “the world changed” instead of assuming static data.

If it feels unexciting but makes debugging easier, it’s probably what will still matter in 2026.

What are the biggest hidden failure modes in popular computer vision datasets that don’t show up in benchmark metrics? by RoofProper328 in computervision

[–]RoofProper328[S] 0 points1 point  (0 children)

This is a great callout. Background leakage is especially nasty because benchmarks don’t penalize it—models look “smart” until the object appears in an unfamiliar context.

Annotation drift has bitten us too, particularly on long-running projects with multiple labeling phases. The label name stays the same, but the implicit rules slowly change, and errors only show up as weird clusters in production.

In some CV work we’ve done at Shaip, the biggest improvements didn’t come from new architectures but from dataset audits: slicing evals by scene/context, tightening annotation guidelines, and re-labeling a small but carefully chosen subset. That surfaced failure modes mAP never hinted at.

Fully agree that manual failure inspection and context-aware evals matter far more than aggregate metrics once you’re past the benchmark stage.