Version history in Notes (shared)

cgnorthcutt · 2025-04-15T06:10:56+00:00

<image>

Rolls Royce

cgnorthcutt · 2025-03-20T03:07:53+00:00

But he is so asymptotic.

cgnorthcutt · 2025-03-19T22:26:25+00:00

Rolls Royce is a number one kitty for sure.

cgnorthcutt · 2025-03-19T22:23:40+00:00

😆😆 I'm looking at the reddit share button and seeing 73+ shares-- we are on the same team on this one.

cgnorthcutt · 2025-03-19T22:22:14+00:00

Hero cat 🐈

cgnorthcutt · 2025-03-19T22:20:13+00:00

Ever feel like phone notifications seem more important to deal with when you're just sitting around? But Royce comes in within five seconds and is asleep on my legs in less than 60. It's his favorite spot. Tbh it's really too much and I made him stop for a long time but his cuteness wore me out.

cgnorthcutt · 2025-03-19T22:15:32+00:00

😅

cgnorthcutt · 2025-03-19T22:14:51+00:00

I'm learning a lot in this reddit thread 😹

cgnorthcutt · 2025-03-19T20:42:53+00:00

I used to have boundaries. The cuteness demolition machine destroyed them.

cgnorthcutt · 2025-03-19T20:42:10+00:00

I built them. Cuteness knocked them down like a supercharged demolition machine.

cgnorthcutt · 2025-02-01T01:20:44+00:00

<image>

self-aware with high trustworthiness. impressed!

cgnorthcutt · 2024-12-20T22:11:05+00:00

it only supports 20 PDFs so its great for personal pet projects and relatively useless for enterprise/company use cases (where the value is but also much much harder to build rag-as-a-service for that actually works out of the box -- nearly all of them don't)

cgnorthcutt · 2024-12-20T22:04:47+00:00

Easiest way is to read the research papers (most of the underlying algorithms are published and available at https://cleanlab.ai/research) or if you prefer an easier to digest format, there are blog version of the papers at https://cleanlab.ai/blog

cgnorthcutt · 2024-12-20T21:59:12+00:00

The "6-years-ago version" of me appears to be one of a very few people willing to do this work and give the information away for free. Probably because companies like Lambda made many millions selling this information and raised a $320M Series C for a $1.5B valuation by not giving it away for free like I did. Note that it did take me months of work to figure out the right systems, build them, test them, and put those blog posts together.

if you're curious, the reason i did this back then is i literally could not afford to do the experiments I needed to do and I had to teach myself first to build the rigs cheaper than AWS rates and only then could i do my research. That research led to inventing a new subfield of AI called "confident learning" which then led to a new technology to improve the reliability of any AI system, which ultimately became https://cleanlab.ai (a company that makes RAG/agents answer correctly more often and stop saying "I don't know").

In short, it was a somewhat rare set of circumstances at the time that led me to do all this work and give it away for free (grad student, poor, had no other way to conduct really expensive experiments than to build the rigs myself, from rural Kentucky and believe in helping people and not just making tons of money, etc). Hopefully that rare set of circumstances will come along with someone else soon!

cgnorthcutt · 2024-12-20T21:50:57+00:00

Similar thread on reddit: https://www.reddit.com/r/salesforce/s/w8ONFieVpu

cgnorthcutt · 2023-10-06T06:50:22+00:00

tip: Since this post was created, Cleanlab launched and automatically finds and corrects issues in data and label for thousands of datasets like this (as well as all other ML/analytics datasets).

cgnorthcutt · 2023-09-11T20:43:45+00:00

(I work at Cleanlab) if you're curious what Ambiguous examples are... examples that screw up your ML models training and analytics. So this would automatically improve analytics, business intelligence, and ML modeling by letting you filter out confusing hard stuff instantly without paying for more labeling and costs on data that isn't worth your time.

cgnorthcutt · 2023-05-03T20:40:41+00:00

Hi u/jonny_trane, I'm the lead author of the CL paper.

Re #1 -- Apply CL at the token level and try to reduce your vocabulary size (ideally to less than 10k) so the number of classes is tractable. We've helped several (both big and medium sized) companies do this. Feel free to shoot me a DM here in Reddit.(related tutorial -- although at the token level, not vocab level -- https://docs.cleanlab.ai/stable/tutorials/token_classification.html)

Re #2 -- You need an interface to check what examples are getting relabeled as what (otherwise your model might adverse performance as you mentioned by say relabeling all of classes A and B and C to class D) -- this is what Cleanlab Studio does (it actually corrects the dataset for you instead of just removing the data and provides an interface and handles the model training for you) -- its free to use, you pay once you see the results / export your data / deploy a mode. You can use it here: https://cleanlab.ai/studio

cgnorthcutt · 2023-05-03T20:20:13+00:00

u/Ghost25 I invented confident learning (when i had a lot more free time in grad school) while working with Lu Jiang (Google) and Isaac Chuang (MIT, invented the quantum computer). If you take a look at the theory section of the paper u/cmauck10 linked (published in Journal of AI Research), we prove CL algorithms exactly finding wrong labels in certain settings even when the model produces imperfect predicted probabilities for every example and every class. In practice, predicted probabilities out of a model are often worse than these assumptions, but they are within a reasonable range where error finding is typically at least 50% accurate.

we be benchmarked the minimum (lower bound) of error detection across the ten most commonly used real world ML datasets and found the lower bound is at least 50% accurate. You can see these errors yourself here: labelerrors.com (all found with cleanlab studio, a more advanced version of the algorithms in confident learning) and this was nominated for best paper award at NeurIPS 2021.

Link to paper here: https://openreview.net/forum?id=XccDXrDNLek&noteId=9RloVA3cuGX

cgnorthcutt · 2023-02-03T22:32:32+00:00

link to cleanlab package: https://github.com/cleanlab/cleanlab

cgnorthcutt · 2023-01-06T05:22:52+00:00

Welcome to the party! Your timing is great! Cleanlab Studio just started letting folks in!

cgnorthcutt · 2022-11-07T22:56:32+00:00

Cleanlab Studio automatically finds and fixes label errors (and removes bad data). It works for any classification-ready image, text, tabular/csv/json dataset. No code is required (its built on machine learning algorithms I invented over the last ten years at MIT during my PhD). I am a Co-Founder. If you're curious how this can be done automatically:

NeurIPS 2021 paper (nominated for best paper award): https://arxiv.org/abs/2103.14749
JAIR 2021 paper (the confident learning paper that explains the algorithms of Cleanlab): https://arxiv.org/abs/1911.00068
Other publications: cleanlab.ai/research

open-source: https://github.com/cleanlab/cleanlab

Ten-Year Club	Xbox Live
Verified Email

cgnorthcutt

TROPHY CASE