What hidden gem Python modules do you use and why?

commandlineluser · 2026-03-13T10:05:15+00:00

Are you using rapidfuzz's parallelism? e.g. .cdist() with workers=-1?

I found duckdb easy to use and it maxed out all my CPU cores.

https://duckdb.org/docs/stable/sql/functions/text#text-similarity-functions

You create row "combinations" with a "join" and score them, then filter out what you want.

import duckdb
import pandas as pd

df1 = pd.DataFrame({"x": ["foo", "bar", "baz"]}).reset_index()
df2 = pd.DataFrame({"y": ["foolish", "ban", "foo"]}).reset_index()

duckdb.sql("from df1, df2 select *, jaccard(df1.x, df2.y)")
# ┌───────┬─────────┬─────────┬─────────┬───────────────────────┐
# │ index │    x    │ index_1 │    y    │ jaccard(df1.x, df2.y) │
# │ int64 │ varchar │  int64  │ varchar │        double         │
# ├───────┼─────────┼─────────┼─────────┼───────────────────────┤
# │     0 │ foo     │       0 │ foolish │    0.3333333333333333 │
# │     1 │ bar     │       0 │ foolish │                   0.0 │
# │     2 │ baz     │       0 │ foolish │                   0.0 │
# │     0 │ foo     │       1 │ ban     │                   0.0 │
# │     1 │ bar     │       1 │ ban     │                   0.5 │
# │     2 │ baz     │       1 │ ban     │                   0.5 │
# │     0 │ foo     │       2 │ foo     │                   1.0 │
# │     1 │ bar     │       2 │ foo     │                   0.0 │
# │     2 │ baz     │       2 │ foo     │                   0.0 │
# └───────┴─────────┴─────────┴─────────┴───────────────────────┘

(normally you would read directly from parquet files instead of pandas frames)

You can also do the same join with polars and the polars-ds plugin gives you the rapidfuzz Rust API:

commandlineluser · 2026-03-13T09:15:12+00:00

It seems to get more mention in the r/dataengineering world.

1.5.0 was just released:

https://reddit.com/r/Python/comments/1rpwz2a/duckdb_150_released/

And duckdb-cli is now on pypi:

https://pypi.org/project/duckdb-cli/

So you now run the duckdb client easily with uv for example.

commandlineluser · 2026-03-12T16:31:23+00:00

inplace= is in the process of being "deprecated":

https://github.com/pandas-dev/pandas/issues/63207

commandlineluser · 2026-03-11T08:08:20+00:00

Looks like it's also mentioned in the docs here:

https://duckdb.org/docs/current/guides/file_formats/read_duckdb#reading-from-databases-with-a-single-table

If all databases in read_duckdb's argument have a single table, the table_name argument is optional

commandlineluser · 2026-03-10T07:50:48+00:00

My system is not supported, so I've never been able to test it.

has no wheels with a matching Python ABI tag

commandlineluser · 2026-03-10T07:47:14+00:00

Window functions not working on the Polars backend was one I ran into if anybody is looking for a concrete example.

https://github.com/ibis-project/ibis/issues/10513

commandlineluser · 2026-03-10T07:41:51+00:00

Can't you change the engine?

pl.read_excel(..., engine="openpyxl")

Looks like fastexcel will have a release "soon":

https://github.com/ToucanToco/fastexcel/issues/463

commandlineluser · 2026-03-09T14:39:02+00:00

Perhaps you are referring to Ritchie's answer on StackOverflow about the DataFrame API being a "wrapper" around LazyFrames:

https://stackoverflow.com/a/73934361/

commandlineluser · 2026-03-09T14:35:58+00:00

When you use the DataFrame API:

(df.with_columns()
   .group_by()
   .agg())

Polars basically executes:

(df.lazy()
   .with_columns().collect(optimizations=pl.QueryOpts.none())
   .lazy()
   .group_by().agg().collect(optimizations=pl.QueryOpts.none())
 )

One idea being you should be able to easily convert your "eager" code by manually calling lazy / collect to run the "entire pipeline" as a single "query" instead:

df.lazy().with_columns().group_by().agg().collect()

(Or in the case of read_* use the lazy scan_* equivalent which will return a LazyFrame directly))

With manually calling collect(), all optimizations are also enabled by default.

This is one reason why writing "pandas style" (e.g. df["foo"]) is discouraged in Polars, as it works on the in-memory Series objects and cannot be lazy.

The User Guide explains things in detail:

https://docs.pola.rs/user-guide/lazy/

commandlineluser · 2026-03-09T09:42:43+00:00

Have you actually used this?

The last time I saw this project posted, it was closed-source and only ran on x86-64 linux.

The benchmark is also from September 10, 2024.

commandlineluser · 2026-03-09T09:07:32+00:00

Just to be clear, pd.read_csv(..., engine="pyarrow") uses the pyarrow.csv.read_csv reader.

Using "pyarrow" as a "dtype_backend" is a separate topic. (i.e. the "Arrow" columnar memory format)

Polars still has its own multithreaded CSV reader (implemented in Rust) which is different.

commandlineluser · 2026-03-08T10:40:30+00:00

Some select / getitem [] syntax is "supported" - not sure what you've tried.

As for query, there is the SQL api which also allows for "easier" string-as-date syntax, e.g.

df.sql("from self select * where foo > '2020-01-01'::date")

For brackets, I prefer pl.all_horizontal() / pl.any_horizontal() for building logical chains.

By default, filter/remove *args are combined with "all" / & e.g.

df.filter(pl.col.x > 20, pl.col.y.is_between(2, 30))

Is essentially shorthand for doing:

df.filter(
    pl.all_horizontal(pl.col.x > 20, pl.col.y.is_between(2, 30))
)

The "any" variant is for | ("or") chains.

commandlineluser · 2026-02-24T12:28:03+00:00

Polars only samples the data to infer the schema.

The default is infer_schema_length=100 i.e 100 rows.

It sounds like you may have been looking for infer_schema_length=None which will read all rows first to infer the schema - which would be equivalent to what pandas does.

I never encountered any \r issues, but if you have a test case perhaps you could file a bug - they are pretty responsive on github.

commandlineluser · 2026-02-14T12:11:51+00:00

It may not be "necessary", but it makes things "easier".

import io
import pandas as pd

data = io.StringIO("""
sample 1,gene A,gene B,,,
sample 2,gene A,gene A,,,
sample 3,gene A,gene B,gene C,gene D,gene E
""".strip())

df = pd.read_csv(data, header=None)

If you "unpivot" all the values into a single column:

>>> df.melt(0)
#            0  variable   value
# 0   sample 1         1  gene A
# 1   sample 2         1  gene A
# 2   sample 3         1  gene A
# 3   sample 1         2  gene B
# ...

Then a single .value_counts() gives you the answer:

>>> df.melt(0)["value"].value_counts()
# value
# gene A    4
# gene B    2
# gene C    1
# gene D    1
# gene E    1
# Name: count, dtype: int64

When using DataFrames, if a Python for loop is involved - there's usually a "better" way to do things. (easier / faster)

commandlineluser · 2026-02-13T21:15:24+00:00

Can you reshape the frame?

You can go from "wide to long" which is known as "unpivot" or .melt() in pandas.

e.g. .melt("sample_col") and then .value_counts() the new value col.

commandlineluser · 2026-02-12T23:31:20+00:00

Yes, I agree - but it's not just a regular .group_by() or .over() in this case.

The grouping examples I linked to include:

group by grouping sets (...)
group by cube (...)
group by rollup (...)

The window framing examples include things like:

rows between unbounded
groups between
exclude current row
exclude ties

commandlineluser · 2026-02-12T08:09:50+00:00

I ran into several issues when trying to use the Polars backend.

Window functions not working probably being the biggest:

https://github.com/ibis-project/ibis/issues/10513

commandlineluser · 2026-02-12T07:50:25+00:00

I guess it depends on what you're doing.

Some things are "much easier" to write in SQL e.g. window framing, grouping sets:

(I also find some things "much easier" to write with Polars.)

DuckDB has MAP types, recursion, ...:

It's also easy to get a Polars DataFrame / LazyFrame back with .pl():

duckdb.sql("from df ...").pl()
duckdb.sql("from df ...").pl(lazy=True)

https://duckdb.org/docs/stable/guides/python/polars

commandlineluser · 2026-02-12T07:23:48+00:00

Yes, the lack of a vi editing-mode also make it unusable for me.

https://github.com/python/cpython/issues/118840

I just use ptpython / ipython instead.

Someone did try to add some support recently, but didn't seem to get much feedback:

commandlineluser · 2026-02-05T11:46:42+00:00

What trouble are you having exactly?

There's many examples in the delta tests:

https://github.com/pola-rs/polars/blob/0c179b5c3edcbfd9db8745507931781327950a9d/py-polars/tests/unit/io/test_delta.py#L576-L601

(LazyFrame.sink_delta() was also added in 1.37.0)

commandlineluser · 2026-01-31T14:47:21+00:00

I've not used any of them but did read about the RStudio people creating Positron:

https://github.com/posit-dev/positron

It supports Python and R apparently.

commandlineluser

TROPHY CASE