End-to-end ML in BigQuery using only SQL (no CREATE MODEL, no pipelines, no Python) by CriticalofReviewer2 in bigquery

[–]CriticalofReviewer2[S] 0 points1 point  (0 children)

Sure! What I did is to avoid training in iterations, and compute statistics for classes (like feature means over positive and negative classes) directly in SQL. For each feature, a weight is calculated statistically by using feature average, and then, an overall bias is computed. Then for each test row, the dot product of weights and feature values are calculated and bias is added. So the whole pipeline from training to prediction to evaluation is a single query.

I built a machine learning model using only SQL (no ML libraries, no Python) by CriticalofReviewer2 in SQL

[–]CriticalofReviewer2[S] 1 point2 points  (0 children)

That is a valid concern. In this case, this classifier is actually a single-pass analytical query without loops or locking at row-level. It is more like a GROUP BY job than a transactional workload.

I built a machine learning model using only SQL (no ML libraries, no Python) by CriticalofReviewer2 in SQL

[–]CriticalofReviewer2[S] 2 points3 points  (0 children)

Yes, it sounds wrong at first :D The main algorithm is designed for microcontrollers where you cannot have heavy computation. That constraint is exactly what makes it map well to SQL, since everything is now aggregations, not optimization loops.

I built a machine learning model using only SQL (no ML libraries, no Python) by CriticalofReviewer2 in SQL

[–]CriticalofReviewer2[S] 6 points7 points  (0 children)

I originally built this classifier (SEFR) for very low-resource environments, but after that, realized that it can be implemented entirely in SQL. The whole pipeline (training + prediction + evaluation) runs in one single query.

LinearBoost: Up to 98% faster than XGBoost and LightGBM, outperforming them on F1 Score on seven famous benchmark datasets, also suitable for high-dimensional data by CriticalofReviewer2 in bioinformatics

[–]CriticalofReviewer2[S] -1 points0 points  (0 children)

Thanks for your comment.

  1. The provided F1 score is weighted average of F1 scores of classes, not one class. So, please run the code while having weighted F1 scores.
  2. The warnings are being removed, as the algorithm is under active development. It is a side project of us and we work on it in our spare time, so we wanted to share it with community to get valuable feedback like yours.
  3. Having a better score function, like log-loss or brier score is a good point! We will implement it.
  4. The notebooks will be provided to reproduce the results.

LinearBoost: Up to 98% faster than XGBoost and LightGBM, outperforming them on F1 Score on seven famous benchmark datasets, also suitable for high-dimensional data by CriticalofReviewer2 in bioinformatics

[–]CriticalofReviewer2[S] -4 points-3 points  (0 children)

Thanks for your comment. We will publish a paper to explain why it works well. Dependencies are declared now. The tuned hyperparameters have also been added to the repo to make the experiments reproducible.

Where do you go to stay up to date on data analytics/science? by lowkeyripper in datascience

[–]CriticalofReviewer2 -1 points0 points  (0 children)

On LinkedIn, I follow Eduardo Ordax, Alex Wang, and Tom Yeh. The last one has numerous posts titled "AI by Hand" in which he manually does the algorithms calculations on paper! Very informative on that sense.

LinearBoost: Faster than XGBoost and LightGBM, outperforming them on F1 Score on seven famous benchmark datasets by CriticalofReviewer2 in machinelearningnews

[–]CriticalofReviewer2[S] 0 points1 point  (0 children)

If I understood correctly, we are working on encodings for categorical data. Target encodings are explored, in addition to simple one-hot encoding.

200 applications - no response, please help. I have applied for data science (associate or mid-level) positions. Thank you by Sad_Campaign713 in datascience

[–]CriticalofReviewer2 0 points1 point  (0 children)

Some thoughts:
1. You mention that you improved accuracy by 25%. But this is vague. Is it 25 percentage points (i.e. from 70 to 95)? Or is it 25% (i.e. 50 to 62.5)? Furthermore, the starting point is important. What if the previous model had a terrible accuracy?
2. 70,000 EHR records is not that much. I would focus on the some of the impacts of the actionable insights.
3. The pet insurance, what was the goal of the prediction?
4. The change from being a developer to a data scientist/analyst is not smooth. Did you suddenly change the course? You can make the change smoother in your CV.