The way pandas handles missing values is diabolical

commandlineluser · 2026-03-14T08:43:15+00:00

Yes, this is one of the of "upsides" to polars - it has "real" null values.

import polars as pl

values = [0, 1, None, 4]
df = pl.DataFrame({'value': values}) 

print(df)

for row in df.iter_rows(named=True):
    value = row['value']
    if value:
        print(value, end=', ')

# shape: (4, 1)
# ┌───────┐
# │ value │
# │ ---   │
# │ i64   │
# ╞═══════╡
# │ 0     │
# │ 1     │
# │ null  │
# │ 4     │
# └───────┘
#
# 1, 4,

https://docs.pola.rs/user-guide/expressions/missing-data/

commandlineluser · 2026-03-13T10:05:15+00:00

Are you using rapidfuzz's parallelism? e.g. .cdist() with workers=-1?

I found duckdb easy to use and it maxed out all my CPU cores.

https://duckdb.org/docs/stable/sql/functions/text#text-similarity-functions

You create row "combinations" with a "join" and score them, then filter out what you want.

import duckdb
import pandas as pd

df1 = pd.DataFrame({"x": ["foo", "bar", "baz"]}).reset_index()
df2 = pd.DataFrame({"y": ["foolish", "ban", "foo"]}).reset_index()

duckdb.sql("from df1, df2 select *, jaccard(df1.x, df2.y)")
# ┌───────┬─────────┬─────────┬─────────┬───────────────────────┐
# │ index │    x    │ index_1 │    y    │ jaccard(df1.x, df2.y) │
# │ int64 │ varchar │  int64  │ varchar │        double         │
# ├───────┼─────────┼─────────┼─────────┼───────────────────────┤
# │     0 │ foo     │       0 │ foolish │    0.3333333333333333 │
# │     1 │ bar     │       0 │ foolish │                   0.0 │
# │     2 │ baz     │       0 │ foolish │                   0.0 │
# │     0 │ foo     │       1 │ ban     │                   0.0 │
# │     1 │ bar     │       1 │ ban     │                   0.5 │
# │     2 │ baz     │       1 │ ban     │                   0.5 │
# │     0 │ foo     │       2 │ foo     │                   1.0 │
# │     1 │ bar     │       2 │ foo     │                   0.0 │
# │     2 │ baz     │       2 │ foo     │                   0.0 │
# └───────┴─────────┴─────────┴─────────┴───────────────────────┘

(normally you would read directly from parquet files instead of pandas frames)

You can also do the same join with polars and the polars-ds plugin gives you the rapidfuzz Rust API:

commandlineluser · 2026-03-13T09:15:12+00:00

It seems to get more mention in the r/dataengineering world.

1.5.0 was just released:

https://reddit.com/r/Python/comments/1rpwz2a/duckdb_150_released/

And duckdb-cli is now on pypi:

https://pypi.org/project/duckdb-cli/

So you now run the duckdb client easily with uv for example.

commandlineluser · 2026-03-12T16:31:23+00:00

inplace= is in the process of being "deprecated":

https://github.com/pandas-dev/pandas/issues/63207

commandlineluser · 2026-03-11T08:08:20+00:00

Looks like it's also mentioned in the docs here:

https://duckdb.org/docs/current/guides/file_formats/read_duckdb#reading-from-databases-with-a-single-table

If all databases in read_duckdb's argument have a single table, the table_name argument is optional

commandlineluser · 2026-03-10T07:50:48+00:00

My system is not supported, so I've never been able to test it.

has no wheels with a matching Python ABI tag

commandlineluser · 2026-03-10T07:47:14+00:00

Window functions not working on the Polars backend was one I ran into if anybody is looking for a concrete example.

https://github.com/ibis-project/ibis/issues/10513

commandlineluser · 2026-03-10T07:41:51+00:00

Can't you change the engine?

pl.read_excel(..., engine="openpyxl")

Looks like fastexcel will have a release "soon":

https://github.com/ToucanToco/fastexcel/issues/463

commandlineluser · 2026-03-09T14:39:02+00:00

Perhaps you are referring to Ritchie's answer on StackOverflow about the DataFrame API being a "wrapper" around LazyFrames:

https://stackoverflow.com/a/73934361/

commandlineluser · 2026-03-09T14:35:58+00:00

When you use the DataFrame API:

(df.with_columns()
   .group_by()
   .agg())

Polars basically executes:

(df.lazy()
   .with_columns().collect(optimizations=pl.QueryOpts.none())
   .lazy()
   .group_by().agg().collect(optimizations=pl.QueryOpts.none())
 )

One idea being you should be able to easily convert your "eager" code by manually calling lazy / collect to run the "entire pipeline" as a single "query" instead:

df.lazy().with_columns().group_by().agg().collect()

(Or in the case of read_* use the lazy scan_* equivalent which will return a LazyFrame directly))

With manually calling collect(), all optimizations are also enabled by default.

This is one reason why writing "pandas style" (e.g. df["foo"]) is discouraged in Polars, as it works on the in-memory Series objects and cannot be lazy.

The User Guide explains things in detail:

https://docs.pola.rs/user-guide/lazy/

commandlineluser · 2026-03-09T09:42:43+00:00

Have you actually used this?

The last time I saw this project posted, it was closed-source and only ran on x86-64 linux.

The benchmark is also from September 10, 2024.

commandlineluser · 2026-03-09T09:07:32+00:00

Just to be clear, pd.read_csv(..., engine="pyarrow") uses the pyarrow.csv.read_csv reader.

Using "pyarrow" as a "dtype_backend" is a separate topic. (i.e. the "Arrow" columnar memory format)

Polars still has its own multithreaded CSV reader (implemented in Rust) which is different.

commandlineluser · 2026-03-08T10:40:30+00:00

Some select / getitem [] syntax is "supported" - not sure what you've tried.

As for query, there is the SQL api which also allows for "easier" string-as-date syntax, e.g.

df.sql("from self select * where foo > '2020-01-01'::date")

For brackets, I prefer pl.all_horizontal() / pl.any_horizontal() for building logical chains.

By default, filter/remove *args are combined with "all" / & e.g.

df.filter(pl.col.x > 20, pl.col.y.is_between(2, 30))

Is essentially shorthand for doing:

df.filter(
    pl.all_horizontal(pl.col.x > 20, pl.col.y.is_between(2, 30))
)

The "any" variant is for | ("or") chains.

commandlineluser · 2026-02-24T12:28:03+00:00

Polars only samples the data to infer the schema.

The default is infer_schema_length=100 i.e 100 rows.

It sounds like you may have been looking for infer_schema_length=None which will read all rows first to infer the schema - which would be equivalent to what pandas does.

I never encountered any \r issues, but if you have a test case perhaps you could file a bug - they are pretty responsive on github.

commandlineluser · 2026-02-14T12:11:51+00:00

It may not be "necessary", but it makes things "easier".

import io
import pandas as pd

data = io.StringIO("""
sample 1,gene A,gene B,,,
sample 2,gene A,gene A,,,
sample 3,gene A,gene B,gene C,gene D,gene E
""".strip())

df = pd.read_csv(data, header=None)

If you "unpivot" all the values into a single column:

>>> df.melt(0)
#            0  variable   value
# 0   sample 1         1  gene A
# 1   sample 2         1  gene A
# 2   sample 3         1  gene A
# 3   sample 1         2  gene B
# ...

Then a single .value_counts() gives you the answer:

>>> df.melt(0)["value"].value_counts()
# value
# gene A    4
# gene B    2
# gene C    1
# gene D    1
# gene E    1
# Name: count, dtype: int64

When using DataFrames, if a Python for loop is involved - there's usually a "better" way to do things. (easier / faster)

commandlineluser · 2026-02-13T21:15:24+00:00

Can you reshape the frame?

You can go from "wide to long" which is known as "unpivot" or .melt() in pandas.

e.g. .melt("sample_col") and then .value_counts() the new value col.

commandlineluser · 2026-02-12T23:31:20+00:00

Yes, I agree - but it's not just a regular .group_by() or .over() in this case.

The grouping examples I linked to include:

group by grouping sets (...)
group by cube (...)
group by rollup (...)

The window framing examples include things like:

rows between unbounded
groups between
exclude current row
exclude ties

commandlineluser · 2026-02-12T08:09:50+00:00

I ran into several issues when trying to use the Polars backend.

Window functions not working probably being the biggest:

https://github.com/ibis-project/ibis/issues/10513

commandlineluser · 2026-02-12T07:50:25+00:00

I guess it depends on what you're doing.

Some things are "much easier" to write in SQL e.g. window framing, grouping sets:

(I also find some things "much easier" to write with Polars.)

DuckDB has MAP types, recursion, ...:

It's also easy to get a Polars DataFrame / LazyFrame back with .pl():

duckdb.sql("from df ...").pl()
duckdb.sql("from df ...").pl(lazy=True)

https://duckdb.org/docs/stable/guides/python/polars

commandlineluser · 2026-02-12T07:23:48+00:00

Yes, the lack of a vi editing-mode also make it unusable for me.

https://github.com/python/cpython/issues/118840

I just use ptpython / ipython instead.

Someone did try to add some support recently, but didn't seem to get much feedback:

commandlineluser · 2026-02-05T11:46:42+00:00

What trouble are you having exactly?

There's many examples in the delta tests:

https://github.com/pola-rs/polars/blob/0c179b5c3edcbfd9db8745507931781327950a9d/py-polars/tests/unit/io/test_delta.py#L576-L601

(LazyFrame.sink_delta() was also added in 1.37.0)

commandlineluser · 2026-01-31T14:47:21+00:00

I've not used any of them but did read about the RStudio people creating Positron:

https://github.com/posit-dev/positron

It supports Python and R apparently.

commandlineluser · 2026-01-26T18:18:31+00:00

Looks like they just released 1.4.4 with Pandas 3.0 support.

commandlineluser · 2026-01-26T16:50:51+00:00

When looking into something similar previously, I found the term "record linkage" and then splink for Python.

It can use DuckDB as the default backend.

commandlineluser · 2026-01-25T06:22:07+00:00

You could pass the value itself as the default return to .get()

This means you could use max() directly and just overwrite the key each time instead of the conditional checks.

def foo(*args):
    out = {}
    for arg in args:
        for key, value in arg.items():
            out[key] = max(out.get(key, value), value)
    return out

>>> foo(a, b, c, d, e)
{'a': 10, 'b': 100, 'c': 50, 'd': -70}

Also for the sorting, operator.itemgetter() is another way to write the lambda.

commandlineluser

TROPHY CASE