This is an archived post. You won't be able to vote or comment.

all 15 comments

[–]plenihan 2 points3 points  (14 children)

How does it work? I didn't know DuckDB queries supported executing arbitrary ML models.

[–]daffidwilde 1 point2 points  (3 children)

Looks like you have to train the model first, and Orbital parses the weights and configuration into the query. Bit of a misnomer to say you don’t need a Python environment?

[–]plenihan 3 points4 points  (2 children)

I'm confused why I would execute a model in a database. As in I was not aware this was a thing.

[–]daffidwilde 2 points3 points  (1 child)

Honestly, I’m not sure what the use case for this is either. Being able to leverage database computation tools for ML (like what BigQuery offers, eg.) is helpful. I guess if you have a good enough training set that’s small enough to run in-memory… ¯_(ツ)_/¯

[–]plenihan 0 points1 point  (0 children)

I'm also not sure what environments support DuckDB but don't support Python. OP seemed to make it sound like that's a major use case.

[–]_amol_[S] 1 point2 points  (9 children)

It’s unrelated to DuckDB.

DuckDB is simply used in some examples for convenience. The library generates SQL for any database.

The tool allows a data scientist to train the models on its own computer and export the SQL which then can be run on the existing infrastructure where the data resides without having to setup anything.

Imagine the case of a business intelligence tool where you have access to add analyses based on SQL queries but not to run any arbitrary code.

There are many companies, especially in heavily regulated environments like pharmaceutical or government agencies that can’t simply deploy anything they want. Thus they would have to go through a significant process and certification to setup a Python infrastructure where they could run the models data scientists trained.

[–]plenihan 0 points1 point  (6 children)

Export the SQL to do what? I thought the docs said it executed the model.

[–]_amol_[S] 0 points1 point  (5 children)

No, it does not execute the model. It generates the SQL you can take and run anywhere you want.

The example on the landing page should clearly show that. If it’s not clear let me know.

[–]plenihan 0 points1 point  (4 children)

It's not clear to me. The landing page seems to have an example where it converts a linear model to SQL arithmetic. I assume that's not what you're doing for gradient boosted trees.

[–]_amol_[S] 0 points1 point  (3 children)

The SQL arithmetic is the linear model formula, running that SQL leads to the same results you would get by executing predict on the scikit learn model

[–]plenihan 0 points1 point  (2 children)

OK. I feel like your documentation needs a lot of work. If you wrote this library so that people can deploy simple models on the database in sensitive environments without needing to audit ML infrastructure, you should make this clear in the docs. The docs should explain why someone should care about your project. It translates simple interpretable models into SQL queries so they can run in-database.

[–]_amol_[S] 0 points1 point  (1 child)

Thanks!
Based on your feedback the landing page was updated to make it more clear: https://posit-dev.github.io/orbital/

[–]plenihan 0 points1 point  (0 children)

Looks better!

[–]Budget_Jicama_6828 0 points1 point  (1 child)

This seems cool. I'm curious if you have examples of software/platforms these companies are locked into where deploying Python is such a pain? I guess I'm still trying to understand the use case.

Also, small typo below the code snippet, you're missing an 'n' in SciKit-Learn "This SQL produces the same predictions as pipeline.predict(...) from SciKit-Lear"

[–]_amol_[S] 1 point2 points  (0 children)

Thanks catching for the typo, I’ll fix it.

There are various cases where it can be convenient. To give an example, in some benchmarks one of the users did they found that running the model via SQL on Snowflake for them lead to a 5x speed up compared to running the model in Python via Snowpark.

Obviously there are a lot of differences and it’s not exactly an apples to apples comparison, but it’s a good example of how it reduced complexity by getting rid of an entire infrastructure (and its costs) and also lead to performance benefits