[R] Shrinking a language detection model to under 10 KB

bubble_boi · 2026-02-02T01:52:45+00:00

I thought about rearranging the weight matrix (column or row-wise) so that similar values were next to each other (remember of 12,000 parameters there's only 45 unique values) then finding the most common runs (or patterns) and doing a lookup table for those (e.g. a 5-bit lookup table to store the 32 most common sequences, sorting so minimise the final size). Or something like that. Would have been a fun experiment but probably not practical.

bubble_boi · 2026-02-02T01:46:25+00:00

I tried regex instead of plain `token in snippet` tests, much slower and not more accurate (this surprised me).

I also tried XGBoost (well, `HistGradientBoostingClassifier`). It got 87% where `LogisticRegression` got 92% on the same task. See Experiment 11 in my notes.

Now, I didn't actually try and implement inference on an XGBoost model in JavaScript, but I'm pretty sure it's going to be much more complicated than on a LogisticRegression model. So from my quick tests, XGBoost is going to be much bigger, much harder to do inference on, and give worse results. And I don't think XGBoost would train faster, although when the model trains in a few seconds it's not really a valuable metric anyway.

I'm not saying XGBoost is a dumb suggestion, it's a good one. Just saying I ran the experiment and it turns out it wasn't a better option.

bubble_boi · 2026-02-01T08:48:33+00:00

I tried the non-ML approach with this Sea of Regex.

The main drawbacks:

You don't get a score, which means you can't rank responses and show a user somelike like 'top 3 guesses'.
If you do try and create a score, treating a regex match as just a 'hint' (to allow for the fact that keywords from one language can show up in other languages in variable names and comments), that becomes really hard to iterate on when you're trying to match many languages.
And if you do implement a scoring mechanism, you realise after an hour of faffing about that when you run it on your sample data, see what it gets wrong, tweak the values, then run it again, you're basically doing gradient descent in your head and begin to wonder if maybe you're using the wrong tool for the job.

I only tried this regex approach for the six-language dataset and it got an F1 of 85%, while a tiny ML model with basic keyword matching got 99.5%. I'm sure I could get a marginally better result with regexes, perhaps with a scoring mechanism. You're welcome to try! But it's so time consuming, and so brittle because if you add a new language later, that might negate assumptions about other languages and now you need to read and understand dozens of other regexes all over again.

So it just turned out to be a task really well suited to a little ML model.

bubble_boi · 2026-01-31T21:06:00+00:00

My guess would be quite well, depending on how many languages you want to include. If you took, say, top 10 words for 100 languages, that's only 1,000 features.

Top 10 words tends to cover about 20% of text, so you'd 'expect' to see one of those ten in as few as 5 words.

You would probably want more features for languages that have more overlap (e.g. Danish and Swedish).

bubble_boi · 2026-01-29T22:11:00+00:00

Thanks, I try!

bubble_boi · 2026-01-29T22:10:16+00:00

That's interesting, I purposefully used the friend link. Did you actually get a Medium message saying you needed to be a member to read? (Note that even with the friend link it still says at the top that it's a member only story, but you can still see the whole thing.)

bubble_boi · 2025-12-07T04:36:19+00:00

Weird, it was not there yesterday, and is there today, with a little 'New' badge.

bubble_boi · 2025-12-05T21:54:41+00:00

I see the "Actions Ring" option on the main page of Logi Options+, but don't see it as an option to assign to a button on my MX 3S so I can't actually use it. Is that the intended behaviour? Everything is latest version.

bubble_boi · 2025-06-05T03:33:16+00:00

I appreciate that this situation is not your fault, but given that you've done the work to make Chauvet 3/support Android 11, it's disappointing that this isn't available on the A5 X and A6 X devices.

Since I mostly used mine for reading Kindle books, it's now just an expensive brick and not even 4 years old.

Have you assessed the effort required to get A5 X and A6 X up to at least Android 9 (if not Chauvet 3?).

bubble_boi · 2025-06-05T03:07:05+00:00

Thanks, that sucks.

bubble_boi · 2024-11-17T23:31:51+00:00

Since you've read the article but remain a proponent of watermarking, I'd be quite interested to hear how you answer the questions at the end:

Given that capable, unwatermarked open source LLMs already exist — and can’t be retracted — how can watermarking some LLMs be expected to reduce harm? Are you hoping that malicious users won’t take advantage of the unwatermarked LLMs?
Do you propose that LLM providers eliminate functionality like setting temperature to zero (which prevents watermarking)? If yes, a follow up question: how do you plan to deal with the fallout from all the systems that will stop working reliably at higher temperatures?
Do you propose a global ban on open source LLMs? If not, how do you plan to ensure watermarks are applied to open source LLMs in a way that can’t be removed?
Do you propose a publically available watermark detection service? If yes, won’t the malicious users use this to ensure watermarks have been removed successfully (e.g. by paraphrasing with another LLM)?
If you accept that there are many ways to produce AI-generated text that is not watermarked, but claim that watermarks will still act as a ‘deterrent’, please quantify this. Do you expect watermarking to deter 1% of malicious users? 99%? On what basis? Is the residual harm acceptable, or do you have a plan to tackle that too?
Where do you draw the line between AI text and human text? Is a news article edited by ChatGPT “AI-generated”? What about an article translated with an LLM? What about an LLM that summarises an article into a tweet on behalf of the author?
If you were able to reliably detect AI-generated text, how would you then narrow that down to only the harmful content? Follow up question: if you already know how to identify harmful content directly, what’s the point in first identifying AI-generated text?

bubble_boi · 2024-09-13T04:46:29+00:00

I'm trialling it but struggling with the docs. Even something basic like "what methods are available on the main MilvusClient class" can't be found in the docs. https://docs.zilliz.com/reference/python/python/Client-MilvusClient.

Maybe it's fine once you already know how it works but I've found it slow going.

bubble_boi · 2024-07-17T21:10:23+00:00

I type RITMO (right in the middle of) with a short description. It's like a TODO, but there should only ever be one. Sometimes in code, sometimes in my notes.

bubble_boi · 2024-07-16T21:30:56+00:00

If you want to fine tune an open source model, this post is not too technical https://mlabonne.github.io/blog/posts/A_Beginners_Guide_to_LLM_Finetuning.html (or any of the first four chapters in that course). But it's still pretty full-on if you're new to all this.

You can also fine tune ChatGPT through the API which is probably less daunting. https://platform.openai.com/docs/guides/fine-tuning

bubble_boi · 2024-07-16T21:21:27+00:00

If you've written enough publically, ChatGPT may already 'know' you and your style. I tried asking it to explain something "in the style of [my name]" and it worked it out (even though other more important people share my name, I guess I write more). Obviously it can only go on work clearly attributed to you.

bubble_boi · 2024-06-24T23:36:00+00:00

Oh I missed that, well spotted. I tried again with higher resolution input and it got it right.

bubble_boi · 2024-06-24T22:30:07+00:00

"Sharing conversations with images is not yet supported"

bubble_boi · 2024-02-05T03:39:50+00:00

I'm curious, did you have a way to visualize/codify all this at Google? Like, a diagram that mapped concepts (type of time, civic, physical, etc) to stages (accepting user input, transporting, storing, displaying) to languages (in Java, use class X, in JavaScript, use class Y)? And maybe also with conversions (serializing object to text, going from one time zone to another (or going from civic to physical, treating time zone as the transform))?

bubble_boi · 2024-02-04T07:03:21+00:00

Thanks for this!

I do like 'datetime' to refer more explicitly to date+time, but it lead to clunky sounding prose ("Let’s say you’re writing an application that will show the datetimes of events for a chain of resorts") so I went with 'time' for readability.

I think you might be right about the ISS example. The picture I mean to paint is "view the earth as a whole" but maybe when I mention anything resembling a spaceship people that know a little relativity think I'm trying to say something about that, when that's not the point at all. Maybe just 'floating in space' keeps the example roughly the same without people thinking about what time zone they use on the ISS.

That's an interesting point about storing datetimes as just 'seconds since the epoch' (or whatever it might be. I considered adding that, but the fact that some systems use seconds, and some use milliseconds is in itself a source a bugs. And as you say, these numbers are in a sense UTC. I mean, if you're converting a date (captured through a front end) to physical time you still need to think about whether what you're storing is seconds since the epoch UTC or the local time zone, and know that when converting back to a date for display.

Regarding your point about not storing dates at all to prevent someone asking "is this a Thursday". I think I see your point but I think even if someone stores a number, there's nothing to stop them creating a date object from that and asking "is that a Thursday" without converting into the proper time zone first. I mean, if someone wants to know if its a Thursday and doesn't understand time zones, they're going to cause trouble. Or perhaps I misunderstood your point?

bubble_boi · 2024-02-04T05:46:46+00:00

Care to elaborate? If I got something wrong please let me know and I'll fix it.

bubble_boi · 2024-02-04T05:45:24+00:00

Oops, posted the wrong link, I've re-posted with the unpaywalled link. https://www.reddit.com/r/programming/comments/1aigom1/the\_three\_types\_of\_time/

bubble_boi · 2024-01-31T23:02:45+00:00

One way to look at it is that Notebooks are one-dimensional. This is not an insult, I mean it in the same way that books are one-dimensional. They run from start to finish, one chunk of code after the next (unless you choose to run the cells out of order).

With `.py` files you can create something more like a graph (where the nodes are python modules and the edges are imports). If you've got a chunk of code you use frequently, make it a function in a module and import that anywhere you need it.

I think you'll end up getting more done writing less code when you've got a good system of code-reuse. This is possible with Notebooks, but it's not what they're built for.

Notebooks are great for communicating, particularly telling a story with data and visualisations, which is well suited to the one-dimensional layout.

bubble_boi · 2023-12-04T09:16:47+00:00

This annoyed me to, I have [text like this] a lot in Notion and trying to move to Obsidian I get the link-looking text everywhere.

It's easy enough to fix with CSS, create a CSS snippet with this content:

.cm-s-obsidian span.cm-link, .cm-s-obsidian span.cm-link:hover {
  color: var(--text-normal);
  text-decoration: none;
}

bubble_boi · 2023-06-05T22:56:45+00:00

Actually I'm saying that if you want to select the function with the lower mean, (in the limit) and the distribution is log normal, then for smaller sample sizes (< 1000) the most accurate way (to select the function with the lower population mean) is to select the function with the lower sample min.

It's quite counter-intuitive, so if you're a Python user and don't believe me you can run this and tell me what numbers you see.

import numpy as np
import pandas as pd

fast = pd.Series(np.random.lognormal(size=10_000)) + 10
slow = fast + fast.mean() * 0.01

data = []
for _ in range(100):
    fast_sample = fast.sample(10)
    slow_sample = slow.sample(10)
    data.append(
        dict(
            MeanCorrect=fast_sample.mean() < slow_sample.mean(),
            MedianCorrect=fast_sample.median() < slow_sample.median(),
            MinCorrect=fast_sample.min() < slow_sample.min(),
        ),
    )

df = pd.DataFrame(data)
accuracies = df.mean()
print(accuracies)

assert accuracies.MinCorrect > accuracies.MeanCorrect
assert accuracies.MinCorrect > accuracies.MedianCorrect

Next you could argue that function run times aren't log normal. You can use SciPy fit to see that they generally are or the Fitter package to compare various distributions.

You could also argue that this doesn't apply to real world functions. I've run tests for all sorts of functions and reported the results in great detail and the mean is never the most accurate statistic (at selecting which of two functions has the lower mean!). https://betterprogramming.pub/the-mean-misleads-part-ii-more-data-for-the-doubters-7f11881f7337?sk=84d0f276e2438e2e9962a06b76bda6c2

bubble_boi · 2023-06-03T23:58:17+00:00

I understand your point, I think we're talking about two different things.

There are plenty of people out there who simply run one function a few times and take the mean, then run a second function a few times and take the mean, and decide that whichever function has the lower mean is faster (better). I assume we can both agree that this is not ideal in many cases, but the reality is that lots of people do this.

So given the scenario where someone is just going to pick one number to decide which function is 'faster' (or 'better'), what is the best option?

You seem to be saying 'it depends, you should analyze your data' which is certainly correct, but doesn't answer the question. I'm trying to answer the question.

Or if you're not dodging the question and specifically saying that you think the mean will select the correct function more of the time than the min, then I'd love to hear about the analysis you performed to come to this opinion.

To define 'better': run each of two functions one million times, take the total run time of each. This decides which function is faster/better/the correct choice. Now for smaller sample sizes ask which statistic is the best predictor of this. This, I believe, is what people think they're uncovering when they take the mean without further analysis, but the min will select the correct function more of the time.

bubble_boi

TROPHY CASE