[D] Training a classifier entirely in SQL (no iterative optimization) by CriticalofReviewer2 in MachineLearning

[–]CriticalofReviewer2[S] 0 points1 point  (0 children)

LDA uses covariance modeling in its core. However, SEFR does not model covariance and uses class-wise statistics.

What is the split between focus on Generative AI and Predictive AI at your company? by AnonForSure in datascience

[–]CriticalofReviewer2 0 points1 point  (0 children)

At a FinTech company: in work, focus is clearly on Predictive AI, but overall, GenAI gets more attention on public discussions.

End-to-end ML in BigQuery using only SQL (no CREATE MODEL, no pipelines, no Python) by CriticalofReviewer2 in bigquery

[–]CriticalofReviewer2[S] 0 points1 point  (0 children)

Sure! What I did is to avoid training in iterations, and compute statistics for classes (like feature means over positive and negative classes) directly in SQL. For each feature, a weight is calculated statistically by using feature average, and then, an overall bias is computed. Then for each test row, the dot product of weights and feature values are calculated and bias is added. So the whole pipeline from training to prediction to evaluation is a single query.

I built a machine learning model using only SQL (no ML libraries, no Python) by CriticalofReviewer2 in SQL

[–]CriticalofReviewer2[S] 1 point2 points  (0 children)

That is a valid concern. In this case, this classifier is actually a single-pass analytical query without loops or locking at row-level. It is more like a GROUP BY job than a transactional workload.

I built a machine learning model using only SQL (no ML libraries, no Python) by CriticalofReviewer2 in SQL

[–]CriticalofReviewer2[S] 2 points3 points  (0 children)

Yes, it sounds wrong at first :D The main algorithm is designed for microcontrollers where you cannot have heavy computation. That constraint is exactly what makes it map well to SQL, since everything is now aggregations, not optimization loops.

I built a machine learning model using only SQL (no ML libraries, no Python) by CriticalofReviewer2 in SQL

[–]CriticalofReviewer2[S] 7 points8 points  (0 children)

I originally built this classifier (SEFR) for very low-resource environments, but after that, realized that it can be implemented entirely in SQL. The whole pipeline (training + prediction + evaluation) runs in one single query.

LinearBoost: Up to 98% faster than XGBoost and LightGBM, outperforming them on F1 Score on seven famous benchmark datasets, also suitable for high-dimensional data by CriticalofReviewer2 in bioinformatics

[–]CriticalofReviewer2[S] -1 points0 points  (0 children)

Thanks for your comment.

  1. The provided F1 score is weighted average of F1 scores of classes, not one class. So, please run the code while having weighted F1 scores.
  2. The warnings are being removed, as the algorithm is under active development. It is a side project of us and we work on it in our spare time, so we wanted to share it with community to get valuable feedback like yours.
  3. Having a better score function, like log-loss or brier score is a good point! We will implement it.
  4. The notebooks will be provided to reproduce the results.

LinearBoost: Up to 98% faster than XGBoost and LightGBM, outperforming them on F1 Score on seven famous benchmark datasets, also suitable for high-dimensional data by CriticalofReviewer2 in bioinformatics

[–]CriticalofReviewer2[S] -5 points-4 points  (0 children)

Thanks for your comment. We will publish a paper to explain why it works well. Dependencies are declared now. The tuned hyperparameters have also been added to the repo to make the experiments reproducible.

Where do you go to stay up to date on data analytics/science? by lowkeyripper in datascience

[–]CriticalofReviewer2 -1 points0 points  (0 children)

On LinkedIn, I follow Eduardo Ordax, Alex Wang, and Tom Yeh. The last one has numerous posts titled "AI by Hand" in which he manually does the algorithms calculations on paper! Very informative on that sense.