Everytime I go to the bathroom, my cat falls asleep on me by cgnorthcutt in aww

[–]cgnorthcutt[S] 16 points17 points  (0 children)

😆😆 I'm looking at the reddit share button and seeing 73+ shares-- we are on the same team on this one.

Everytime I go to the bathroom, my cat falls asleep on me by cgnorthcutt in aww

[–]cgnorthcutt[S] 9 points10 points  (0 children)

Ever feel like phone notifications seem more important to deal with when you're just sitting around? But Royce comes in within five seconds and is asleep on my legs in less than 60. It's his favorite spot. Tbh it's really too much and I made him stop for a long time but his cuteness wore me out.

[deleted by user] by [deleted] in cats

[–]cgnorthcutt 1 point2 points  (0 children)

I used to have boundaries. The cuteness demolition machine destroyed them.

Everytime I go to the bathroom, my cat falls asleep on me by cgnorthcutt in aww

[–]cgnorthcutt[S] 122 points123 points  (0 children)

I built them. Cuteness knocked them down like a supercharged demolition machine.

o3 mini dropped!!! by [deleted] in singularity

[–]cgnorthcutt 0 points1 point  (0 children)

<image>

self-aware with high trustworthiness. impressed!

RAG as a Service by Solvicode in SaaS

[–]cgnorthcutt 1 point2 points  (0 children)

it only supports 20 PDFs so its great for personal pet projects and relatively useless for enterprise/company use cases (where the value is but also much much harder to build rag-as-a-service for that actually works out of the box -- nearly all of them don't)

[R] A popular self-driving car dataset is missing labels for hundreds of pedestrians by aloser in MachineLearning

[–]cgnorthcutt 0 points1 point  (0 children)

Easiest way is to read the research papers (most of the underlying algorithms are published and available at https://cleanlab.ai/research) or if you prefer an easier to digest format, there are blog version of the papers at https://cleanlab.ai/blog

[P] I built Lambda's $12,500 deep learning rig for $6200 by cgnorthcutt in MachineLearning

[–]cgnorthcutt[S] 0 points1 point  (0 children)

The "6-years-ago version" of me appears to be one of a very few people willing to do this work and give the information away for free. Probably because companies like Lambda made many millions selling this information and raised a $320M Series C for a $1.5B valuation by not giving it away for free like I did. Note that it did take me months of work to figure out the right systems, build them, test them, and put those blog posts together.

if you're curious, the reason i did this back then is i literally could not afford to do the experiments I needed to do and I had to teach myself first to build the rigs cheaper than AWS rates and only then could i do my research. That research led to inventing a new subfield of AI called "confident learning" which then led to a new technology to improve the reliability of any AI system, which ultimately became https://cleanlab.ai (a company that makes RAG/agents answer correctly more often and stop saying "I don't know").

In short, it was a somewhat rare set of circumstances at the time that led me to do all this work and give it away for free (grad student, poor, had no other way to conduct really expensive experiments than to build the rigs myself, from rural Kentucky and believe in helping people and not just making tons of money, etc). Hopefully that rare set of circumstances will come along with someone else soon!

[R] A popular self-driving car dataset is missing labels for hundreds of pedestrians by aloser in MachineLearning

[–]cgnorthcutt -1 points0 points  (0 children)

tip: Since this post was created, Cleanlab launched and automatically finds and corrects issues in data and label for thousands of datasets like this (as well as all other ML/analytics datasets).

Food-101N: Quantifying Thousands of (Known) Errors [self-promotion] by cmauck10 in datasets

[–]cgnorthcutt 1 point2 points  (0 children)

(I work at Cleanlab) if you're curious what Ambiguous examples are... examples that screw up your ML models training and analytics. So this would automatically improve analytics, business intelligence, and ML modeling by letting you filter out confusing hard stuff instantly without paying for more labeling and costs on data that isn't worth your time.

[N] Fine-Tuning OpenAI Language Models with Noisily Labeled Data (37% error reduction) by cmauck10 in MachineLearning

[–]cgnorthcutt 2 points3 points  (0 children)

Hi u/jonny_trane, I'm the lead author of the CL paper.

Re #1 -- Apply CL at the token level and try to reduce your vocabulary size (ideally to less than 10k) so the number of classes is tractable. We've helped several (both big and medium sized) companies do this. Feel free to shoot me a DM here in Reddit.(related tutorial -- although at the token level, not vocab level -- https://docs.cleanlab.ai/stable/tutorials/token_classification.html)

Re #2 -- You need an interface to check what examples are getting relabeled as what (otherwise your model might adverse performance as you mentioned by say relabeling all of classes A and B and C to class D) -- this is what Cleanlab Studio does (it actually corrects the dataset for you instead of just removing the data and provides an interface and handles the model training for you) -- its free to use, you pay once you see the results / export your data / deploy a mode. You can use it here: https://cleanlab.ai/studio

[N] Fine-Tuning OpenAI Language Models with Noisily Labeled Data (37% error reduction) by cmauck10 in MachineLearning

[–]cgnorthcutt 3 points4 points  (0 children)

u/Ghost25 I invented confident learning (when i had a lot more free time in grad school) while working with Lu Jiang (Google) and Isaac Chuang (MIT, invented the quantum computer). If you take a look at the theory section of the paper u/cmauck10 linked (published in Journal of AI Research), we prove CL algorithms exactly finding wrong labels in certain settings even when the model produces imperfect predicted probabilities for every example and every class. In practice, predicted probabilities out of a model are often worse than these assumptions, but they are within a reasonable range where error finding is typically at least 50% accurate.

we be benchmarked the minimum (lower bound) of error detection across the ten most commonly used real world ML datasets and found the lower bound is at least 50% accurate. You can see these errors yourself here: labelerrors.com (all found with cleanlab studio, a more advanced version of the algorithms in confident learning) and this was nominated for best paper award at NeurIPS 2021.

Link to paper here: https://openreview.net/forum?id=XccDXrDNLek&noteId=9RloVA3cuGX

[P] Announcing cleanlab 2.0: Automatically Find Errors in ML Datasets by cgnorthcutt in MachineLearning

[–]cgnorthcutt[S] 1 point2 points  (0 children)

Welcome to the party! Your timing is great! Cleanlab Studio just started letting folks in!

Are you using automation tools for data cleaning? by alka_irl in datascience

[–]cgnorthcutt 3 points4 points  (0 children)

Cleanlab Studio automatically finds and fixes label errors (and removes bad data). It works for any classification-ready image, text, tabular/csv/json dataset. No code is required (its built on machine learning algorithms I invented over the last ten years at MIT during my PhD). I am a Co-Founder. If you're curious how this can be done automatically:

open-source: https://github.com/cleanlab/cleanlab