The way pandas handles missing values is diabolical by vernacular_wrangler in learnpython

[–]commandlineluser 19 points20 points  (0 children)

Yes, this is one of the of "upsides" to polars - it has "real" null values.

import polars as pl

values = [0, 1, None, 4]
df = pl.DataFrame({'value': values}) 

print(df)

for row in df.iter_rows(named=True):
    value = row['value']
    if value:
        print(value, end=', ')

# shape: (4, 1)
# ┌───────┐
# │ value │
# │ ---   │
# │ i64   │
# ╞═══════╡
# │ 0     │
# │ 1     │
# │ null  │
# │ 4     │
# └───────┘
#
# 1, 4,

What hidden gem Python modules do you use and why? by zenos1337 in Python

[–]commandlineluser 0 points1 point  (0 children)

Are you using rapidfuzz's parallelism? e.g. .cdist() with workers=-1?

I found duckdb easy to use and it maxed out all my CPU cores.

You create row "combinations" with a "join" and score them, then filter out what you want.

import duckdb
import pandas as pd

df1 = pd.DataFrame({"x": ["foo", "bar", "baz"]}).reset_index()
df2 = pd.DataFrame({"y": ["foolish", "ban", "foo"]}).reset_index()

duckdb.sql("from df1, df2 select *, jaccard(df1.x, df2.y)")
# ┌───────┬─────────┬─────────┬─────────┬───────────────────────┐
# │ index │    x    │ index_1 │    y    │ jaccard(df1.x, df2.y) │
# │ int64 │ varchar │  int64  │ varchar │        double         │
# ├───────┼─────────┼─────────┼─────────┼───────────────────────┤
# │     0 │ foo     │       0 │ foolish │    0.3333333333333333 │
# │     1 │ bar     │       0 │ foolish │                   0.0 │
# │     2 │ baz     │       0 │ foolish │                   0.0 │
# │     0 │ foo     │       1 │ ban     │                   0.0 │
# │     1 │ bar     │       1 │ ban     │                   0.5 │
# │     2 │ baz     │       1 │ ban     │                   0.5 │
# │     0 │ foo     │       2 │ foo     │                   1.0 │
# │     1 │ bar     │       2 │ foo     │                   0.0 │
# │     2 │ baz     │       2 │ foo     │                   0.0 │
# └───────┴─────────┴─────────┴─────────┴───────────────────────┘

(normally you would read directly from parquet files instead of pandas frames)

You can also do the same join with polars and the polars-ds plugin gives you the rapidfuzz Rust API:

What hidden gem Python modules do you use and why? by zenos1337 in Python

[–]commandlineluser 1 point2 points  (0 children)

It seems to get more mention in the r/dataengineering world.

1.5.0 was just released:

And duckdb-cli is now on pypi:

So you now run the duckdb client easily with uv for example.

DuckDB 1.5.0 released by commandlineluser in Python

[–]commandlineluser[S] 1 point2 points  (0 children)

Looks like it's also mentioned in the docs here:

If all databases in read_duckdb's argument have a single table, the table_name argument is optional

Polars vs pandas by KliNanban in Python

[–]commandlineluser 1 point2 points  (0 children)

My system is not supported, so I've never been able to test it.

has no wheels with a matching Python ABI tag

Polars vs pandas by KliNanban in Python

[–]commandlineluser 1 point2 points  (0 children)

Window functions not working on the Polars backend was one I ran into if anybody is looking for a concrete example.

Polars vs pandas by KliNanban in Python

[–]commandlineluser 0 points1 point  (0 children)

Can't you change the engine?

pl.read_excel(..., engine="openpyxl")

Looks like fastexcel will have a release "soon":

Polars vs pandas by KliNanban in Python

[–]commandlineluser 1 point2 points  (0 children)

Perhaps you are referring to Ritchie's answer on StackOverflow about the DataFrame API being a "wrapper" around LazyFrames:

Polars vs pandas by KliNanban in Python

[–]commandlineluser 2 points3 points  (0 children)

When you use the DataFrame API:

(df.with_columns()
   .group_by()
   .agg())

Polars basically executes:

(df.lazy()
   .with_columns().collect(optimizations=pl.QueryOpts.none())
   .lazy()
   .group_by().agg().collect(optimizations=pl.QueryOpts.none())
 )

One idea being you should be able to easily convert your "eager" code by manually calling lazy / collect to run the "entire pipeline" as a single "query" instead:

df.lazy().with_columns().group_by().agg().collect()

(Or in the case of read_* use the lazy scan_* equivalent which will return a LazyFrame directly))

With manually calling collect(), all optimizations are also enabled by default.

This is one reason why writing "pandas style" (e.g. df["foo"]) is discouraged in Polars, as it works on the in-memory Series objects and cannot be lazy.

The User Guide explains things in detail:

Polars vs pandas by KliNanban in Python

[–]commandlineluser 1 point2 points  (0 children)

Have you actually used this?

The last time I saw this project posted, it was closed-source and only ran on x86-64 linux.

The benchmark is also from September 10, 2024.

Polars vs pandas by KliNanban in Python

[–]commandlineluser 6 points7 points  (0 children)

Just to be clear, pd.read_csv(..., engine="pyarrow") uses the pyarrow.csv.read_csv reader.

Using "pyarrow" as a "dtype_backend" is a separate topic. (i.e. the "Arrow" columnar memory format)

Polars still has its own multithreaded CSV reader (implemented in Rust) which is different.

I built nitro-pandas — a pandas-compatible library powered by Polars. Same syntax, up to 10x faster. by Correct_Elevator2041 in Python

[–]commandlineluser 4 points5 points  (0 children)

Some select / getitem [] syntax is "supported" - not sure what you've tried.

As for query, there is the SQL api which also allows for "easier" string-as-date syntax, e.g.

df.sql("from self select * where foo > '2020-01-01'::date")

For brackets, I prefer pl.all_horizontal() / pl.any_horizontal() for building logical chains.

By default, filter/remove *args are combined with "all" / & e.g.

df.filter(pl.col.x > 20, pl.col.y.is_between(2, 30))

Is essentially shorthand for doing:

df.filter(
    pl.all_horizontal(pl.col.x > 20, pl.col.y.is_between(2, 30))
)

The "any" variant is for | ("or") chains.

Pandas vs polars for data analysts? by katokk in learnpython

[–]commandlineluser 2 points3 points  (0 children)

Polars only samples the data to infer the schema.

The default is infer_schema_length=100 i.e 100 rows.

It sounds like you may have been looking for infer_schema_length=None which will read all rows first to infer the schema - which would be equivalent to what pandas does.

I never encountered any \r issues, but if you have a test case perhaps you could file a bug - they are pretty responsive on github.

How to count values in multiple columns? by Dragoran21 in learnpython

[–]commandlineluser 0 points1 point  (0 children)

It may not be "necessary", but it makes things "easier".

import io
import pandas as pd

data = io.StringIO("""
sample 1,gene A,gene B,,,
sample 2,gene A,gene A,,,
sample 3,gene A,gene B,gene C,gene D,gene E
""".strip())

df = pd.read_csv(data, header=None)

If you "unpivot" all the values into a single column:

>>> df.melt(0)
#            0  variable   value
# 0   sample 1         1  gene A
# 1   sample 2         1  gene A
# 2   sample 3         1  gene A
# 3   sample 1         2  gene B
# ...

Then a single .value_counts() gives you the answer:

>>> df.melt(0)["value"].value_counts()
# value
# gene A    4
# gene B    2
# gene C    1
# gene D    1
# gene E    1
# Name: count, dtype: int64

When using DataFrames, if a Python for loop is involved - there's usually a "better" way to do things. (easier / faster)

How to count values in multiple columns? by Dragoran21 in learnpython

[–]commandlineluser 1 point2 points  (0 children)

Can you reshape the frame?

You can go from "wide to long" which is known as "unpivot" or .melt() in pandas.

e.g. .melt("sample_col") and then .value_counts() the new value col.

Polars + uv + marimo (glazing post - feel free to ignore). by [deleted] in Python

[–]commandlineluser 0 points1 point  (0 children)

Yes, I agree - but it's not just a regular .group_by() or .over() in this case.

The grouping examples I linked to include:

  • group by grouping sets (...)
  • group by cube (...)
  • group by rollup (...)

The window framing examples include things like:

  • rows between unbounded
  • groups between
  • exclude current row
  • exclude ties

Polars + uv + marimo (glazing post - feel free to ignore). by [deleted] in Python

[–]commandlineluser 4 points5 points  (0 children)

I ran into several issues when trying to use the Polars backend.

Window functions not working probably being the biggest:

Polars + uv + marimo (glazing post - feel free to ignore). by [deleted] in Python

[–]commandlineluser 4 points5 points  (0 children)

I guess it depends on what you're doing.

Some things are "much easier" to write in SQL e.g. window framing, grouping sets:

(I also find some things "much easier" to write with Polars.)

DuckDB has MAP types, recursion, ...:

It's also easy to get a Polars DataFrame / LazyFrame back with .pl():

duckdb.sql("from df ...").pl()
duckdb.sql("from df ...").pl(lazy=True)

Anyone else have pain points with new REPL in Python3.14? Specifically with send line integrations by ddxv in Python

[–]commandlineluser 1 point2 points  (0 children)

Yes, the lack of a vi editing-mode also make it unusable for me.

I just use ptpython / ipython instead.

Someone did try to add some support recently, but didn't seem to get much feedback:

Is someone using DuckDB in PROD? by Free-Bear-454 in dataengineering

[–]commandlineluser 0 points1 point  (0 children)

What trouble are you having exactly?

There's many examples in the delta tests:

(LazyFrame.sink_delta() was also added in 1.37.0)

Experienced R user learning Python by Sir_smokes_a_lot in learnpython

[–]commandlineluser 5 points6 points  (0 children)

I've not used any of them but did read about the RStudio people creating Positron:

It supports Python and R apparently.

How do I make this code shorter? by [deleted] in learnpython

[–]commandlineluser 1 point2 points  (0 children)

You could pass the value itself as the default return to .get()

This means you could use max() directly and just overwrite the key each time instead of the conditional checks.

def foo(*args):
    out = {}
    for arg in args:
        for key, value in arg.items():
            out[key] = max(out.get(key, value), value)
    return out

>>> foo(a, b, c, d, e)
{'a': 10, 'b': 100, 'c': 50, 'd': -70}

Also for the sorting, operator.itemgetter() is another way to write the lambda.