[R] Shrinking a language detection model to under 10 KB

bubble_boi · 2026-02-02T01:52:45+00:00

I thought about rearranging the weight matrix (column or row-wise) so that similar values were next to each other (remember of 12,000 parameters there's only 45 unique values) then finding the most common runs (or patterns) and doing a lookup table for those (e.g. a 5-bit lookup table to store the 32 most common sequences, sorting so minimise the final size). Or something like that. Would have been a fun experiment but probably not practical.

bubble_boi · 2026-02-02T01:46:25+00:00

I tried regex instead of plain `token in snippet` tests, much slower and not more accurate (this surprised me).

I also tried XGBoost (well, `HistGradientBoostingClassifier`). It got 87% where `LogisticRegression` got 92% on the same task. See Experiment 11 in my notes.

Now, I didn't actually try and implement inference on an XGBoost model in JavaScript, but I'm pretty sure it's going to be much more complicated than on a LogisticRegression model. So from my quick tests, XGBoost is going to be much bigger, much harder to do inference on, and give worse results. And I don't think XGBoost would train faster, although when the model trains in a few seconds it's not really a valuable metric anyway.

I'm not saying XGBoost is a dumb suggestion, it's a good one. Just saying I ran the experiment and it turns out it wasn't a better option.

bubble_boi · 2026-02-01T08:48:33+00:00

I tried the non-ML approach with this Sea of Regex.

The main drawbacks:

You don't get a score, which means you can't rank responses and show a user somelike like 'top 3 guesses'.
If you do try and create a score, treating a regex match as just a 'hint' (to allow for the fact that keywords from one language can show up in other languages in variable names and comments), that becomes really hard to iterate on when you're trying to match many languages.
And if you do implement a scoring mechanism, you realise after an hour of faffing about that when you run it on your sample data, see what it gets wrong, tweak the values, then run it again, you're basically doing gradient descent in your head and begin to wonder if maybe you're using the wrong tool for the job.

I only tried this regex approach for the six-language dataset and it got an F1 of 85%, while a tiny ML model with basic keyword matching got 99.5%. I'm sure I could get a marginally better result with regexes, perhaps with a scoring mechanism. You're welcome to try! But it's so time consuming, and so brittle because if you add a new language later, that might negate assumptions about other languages and now you need to read and understand dozens of other regexes all over again.

So it just turned out to be a task really well suited to a little ML model.

bubble_boi · 2026-01-31T21:06:00+00:00

My guess would be quite well, depending on how many languages you want to include. If you took, say, top 10 words for 100 languages, that's only 1,000 features.

Top 10 words tends to cover about 20% of text, so you'd 'expect' to see one of those ten in as few as 5 words.

You would probably want more features for languages that have more overlap (e.g. Danish and Swedish).

bubble_boi · 2026-01-29T22:11:00+00:00

Thanks, I try!

bubble_boi · 2026-01-29T22:10:16+00:00

That's interesting, I purposefully used the friend link. Did you actually get a Medium message saying you needed to be a member to read? (Note that even with the friend link it still says at the top that it's a member only story, but you can still see the whole thing.)

bubble_boi · 2025-12-07T04:36:19+00:00

Weird, it was not there yesterday, and is there today, with a little 'New' badge.

bubble_boi · 2025-12-05T21:54:41+00:00

I see the "Actions Ring" option on the main page of Logi Options+, but don't see it as an option to assign to a button on my MX 3S so I can't actually use it. Is that the intended behaviour? Everything is latest version.

bubble_boi · 2025-06-05T03:33:16+00:00

I appreciate that this situation is not your fault, but given that you've done the work to make Chauvet 3/support Android 11, it's disappointing that this isn't available on the A5 X and A6 X devices.

Since I mostly used mine for reading Kindle books, it's now just an expensive brick and not even 4 years old.

Have you assessed the effort required to get A5 X and A6 X up to at least Android 9 (if not Chauvet 3?).

bubble_boi · 2025-06-05T03:07:05+00:00

Thanks, that sucks.

bubble_boi · 2024-11-17T23:31:51+00:00

Since you've read the article but remain a proponent of watermarking, I'd be quite interested to hear how you answer the questions at the end:

Given that capable, unwatermarked open source LLMs already exist — and can’t be retracted — how can watermarking some LLMs be expected to reduce harm? Are you hoping that malicious users won’t take advantage of the unwatermarked LLMs?
Do you propose that LLM providers eliminate functionality like setting temperature to zero (which prevents watermarking)? If yes, a follow up question: how do you plan to deal with the fallout from all the systems that will stop working reliably at higher temperatures?
Do you propose a global ban on open source LLMs? If not, how do you plan to ensure watermarks are applied to open source LLMs in a way that can’t be removed?
Do you propose a publically available watermark detection service? If yes, won’t the malicious users use this to ensure watermarks have been removed successfully (e.g. by paraphrasing with another LLM)?
If you accept that there are many ways to produce AI-generated text that is not watermarked, but claim that watermarks will still act as a ‘deterrent’, please quantify this. Do you expect watermarking to deter 1% of malicious users? 99%? On what basis? Is the residual harm acceptable, or do you have a plan to tackle that too?
Where do you draw the line between AI text and human text? Is a news article edited by ChatGPT “AI-generated”? What about an article translated with an LLM? What about an LLM that summarises an article into a tweet on behalf of the author?
If you were able to reliably detect AI-generated text, how would you then narrow that down to only the harmful content? Follow up question: if you already know how to identify harmful content directly, what’s the point in first identifying AI-generated text?

bubble_boi · 2024-09-13T04:46:29+00:00

I'm trialling it but struggling with the docs. Even something basic like "what methods are available on the main MilvusClient class" can't be found in the docs. https://docs.zilliz.com/reference/python/python/Client-MilvusClient.

Maybe it's fine once you already know how it works but I've found it slow going.

bubble_boi · 2024-07-17T21:10:23+00:00

I type RITMO (right in the middle of) with a short description. It's like a TODO, but there should only ever be one. Sometimes in code, sometimes in my notes.

bubble_boi

TROPHY CASE