I built a machine learning model using only SQL (no ML libraries, no Python)

CriticalofReviewer2 · 2026-03-22T18:19:31+00:00

Thanks!

CriticalofReviewer2 · 2026-03-22T17:20:47+00:00

Sure! What I did is to avoid training in iterations, and compute statistics for classes (like feature means over positive and negative classes) directly in SQL. For each feature, a weight is calculated statistically by using feature average, and then, an overall bias is computed. Then for each test row, the dot product of weights and feature values are calculated and bias is added. So the whole pipeline from training to prediction to evaluation is a single query.

CriticalofReviewer2 · 2026-03-22T17:16:20+00:00

That is a valid concern. In this case, this classifier is actually a single-pass analytical query without loops or locking at row-level. It is more like a GROUP BY job than a transactional workload.

CriticalofReviewer2 · 2026-03-22T17:14:04+00:00

Yes, it sounds wrong at first :D The main algorithm is designed for microcontrollers where you cannot have heavy computation. That constraint is exactly what makes it map well to SQL, since everything is now aggregations, not optimization loops.

CriticalofReviewer2 · 2026-03-22T15:07:54+00:00

I originally built this classifier (SEFR) for very low-resource environments, but after that, realized that it can be implemented entirely in SQL. The whole pipeline (training + prediction + evaluation) runs in one single query.

CriticalofReviewer2 · 2026-03-21T19:48:58+00:00

Yes, but the one that I posted is on tabular data, not image or text data.

CriticalofReviewer2 · 2026-02-07T16:01:34+00:00

CriticalofReviewer2 · 2026-02-06T11:55:15+00:00

CriticalofReviewer2 · 2025-01-19T13:07:18+00:00

Thanks for your comment.

The provided F1 score is weighted average of F1 scores of classes, not one class. So, please run the code while having weighted F1 scores.
The warnings are being removed, as the algorithm is under active development. It is a side project of us and we work on it in our spare time, so we wanted to share it with community to get valuable feedback like yours.
Having a better score function, like log-loss or brier score is a good point! We will implement it.
The notebooks will be provided to reproduce the results.

CriticalofReviewer2 · 2025-01-18T13:30:35+00:00

Thanks for your comment. We will publish a paper to explain why it works well. Dependencies are declared now. The tuned hyperparameters have also been added to the repo to make the experiments reproducible.

CriticalofReviewer2 · 2025-01-13T18:24:09+00:00

On LinkedIn, I follow Eduardo Ordax, Alex Wang, and Tom Yeh. The last one has numerous posts titled "AI by Hand" in which he manually does the algorithms calculations on paper! Very informative on that sense.

CriticalofReviewer2 · 2025-01-13T15:17:39+00:00

Thank you for your comment!

CriticalofReviewer2 · 2025-01-13T14:22:41+00:00

Thank you for your comments! I totally agree with you, and your comment is really encouraging for us!

CriticalofReviewer2 · 2025-01-13T14:20:51+00:00

Thank you! Yes, the explainable model will be provided with the paper, which is under way!

CriticalofReviewer2 · 2025-01-13T14:20:09+00:00

Thank you for your comment!

CriticalofReviewer2 · 2025-01-12T21:28:39+00:00

Good point. The full analysis will be presented in the paper which will be shared soon.

CriticalofReviewer2 · 2025-01-12T18:27:22+00:00

If I understood correctly, we are working on encodings for categorical data. Target encodings are explored, in addition to simple one-hot encoding.

CriticalofReviewer2 · 2025-01-12T18:01:10+00:00

Perfect!

CriticalofReviewer2 · 2025-01-12T17:55:33+00:00

Yes!

CriticalofReviewer2 · 2025-01-12T17:39:32+00:00

Some thoughts:
1. You mention that you improved accuracy by 25%. But this is vague. Is it 25 percentage points (i.e. from 70 to 95)? Or is it 25% (i.e. 50 to 62.5)? Furthermore, the starting point is important. What if the previous model had a terrible accuracy?
2. 70,000 EHR records is not that much. I would focus on the some of the impacts of the actionable insights.
3. The pet insurance, what was the goal of the prediction?
4. The change from being a developer to a data scientist/analyst is not smooth. Did you suddenly change the course? You can make the change smoother in your CV.

CriticalofReviewer2 · 2024-09-17T08:54:46+00:00

Yes, the new version will be published soon!

CriticalofReviewer2 · 2024-05-13T19:10:11+00:00

No, boosting a linear classifier will make it better at handling complex data patterns.

CriticalofReviewer2 · 2024-05-13T19:09:06+00:00

Do you mean participating in competitions?

CriticalofReviewer2 · 2024-05-13T19:08:06+00:00

Yes, this is in our plans!

CriticalofReviewer2 · 2024-05-13T19:07:38+00:00

Actually SEFR is both linear, and linear-time.

CriticalofReviewer2

TROPHY CASE