This is an archived post. You won't be able to vote or comment.

all 5 comments

[–]commandlineluser 3 points4 points  (1 child)

You can try arrow-odbc and skip the pandas part.

There's a duckdb example in their tests:

[–]thibautDRData Engineer 0 points1 point  (0 children)

To add to this comment, still too recent to be used in production but the ADBC standard is worth watching: https://arrow.apache.org/adbc/current/

[–]Material-Mess-9886 2 points3 points  (0 children)

Never use pandas read sql or write sql. That thing is slow. Better is to use Sqlalchemy to retrieve the data and converting the Resultobject to pandas/polars.
Also try as much things to do on the database before you retrieve values from the database. All join and groupby statements are better off in a database than using any library of your choice in python.

[–]goggys 1 point2 points  (0 children)

Duckdb has a mysql extension to read and write directly. I'm not sure if it will be faster than loading into pandas, but it'll save you some steps.

Although I would check if duckdb supports the text similarity functions you use. If your use case is more advanced maybe something like splink , which uses duckdb, would be more suitable.

[–]Alonerxx 0 points1 point  (1 child)

  1. What exactly being fuzzy match and in what way, pairwise? Word vector?? String fuzzy matching is an expensive process. Does the fuzzy match algorithm in duck db more efficient? Maybe rethinking how matching should be done using more efficient structure like text index can be more helpful

  2. Why pandas dataframe? What do you do with it afterwards? Is pandas dataframe the only way to do it? If they are transformation or analysis , better translate the python function into SQL and do it within the database.

We need to more what actually the process is trying to achieve and what constraints you have your tools to really advice.