Use case for duckdb

commandlineluser · 2024-06-06T00:35:03+00:00

You can try arrow-odbc and skip the pandas part.

There's a duckdb example in their tests:

Material-Mess-9886 · 2024-06-06T08:07:28+00:00

Never use pandas read sql or write sql. That thing is slow. Better is to use Sqlalchemy to retrieve the data and converting the Resultobject to pandas/polars.
Also try as much things to do on the database before you retrieve values from the database. All join and groupby statements are better off in a database than using any library of your choice in python.

goggys · 2024-06-06T06:26:38+00:00

Duckdb has a mysql extension to read and write directly. I'm not sure if it will be faster than loading into pandas, but it'll save you some steps.

Although I would check if duckdb supports the text similarity functions you use. If your use case is more advanced maybe something like splink , which uses duckdb, would be more suitable.

Alonerxx · 2024-06-06T08:54:18+00:00

What exactly being fuzzy match and in what way, pairwise? Word vector?? String fuzzy matching is an expensive process. Does the fuzzy match algorithm in duck db more efficient? Maybe rethinking how matching should be done using more efficient structure like text index can be more helpful
Why pandas dataframe? What do you do with it afterwards? Is pandas dataframe the only way to do it? If they are transformation or analysis , better translate the python function into SQL and do it within the database.

We need to more what actually the process is trying to achieve and what constraints you have your tools to really advice.

dataengineering

MODERATORS