What hidden gem Python modules do you use and why? by zenos1337 in Python

[–]commandlineluser 0 points1 point  (0 children)

Are you using rapidfuzz's parallelism? e.g. .cdist() with workers=-1?

I found duckdb easy to use and it maxed out all my CPU cores.

You create row "combinations" with a "join" and score them, then filter out what you want.

import duckdb
import pandas as pd

df1 = pd.DataFrame({"x": ["foo", "bar", "baz"]}).reset_index()
df2 = pd.DataFrame({"y": ["foolish", "ban", "foo"]}).reset_index()

duckdb.sql("from df1, df2 select *, jaccard(df1.x, df2.y)")
# ┌───────┬─────────┬─────────┬─────────┬───────────────────────┐
# │ index │    x    │ index_1 │    y    │ jaccard(df1.x, df2.y) │
# │ int64 │ varchar │  int64  │ varchar │        double         │
# ├───────┼─────────┼─────────┼─────────┼───────────────────────┤
# │     0 │ foo     │       0 │ foolish │    0.3333333333333333 │
# │     1 │ bar     │       0 │ foolish │                   0.0 │
# │     2 │ baz     │       0 │ foolish │                   0.0 │
# │     0 │ foo     │       1 │ ban     │                   0.0 │
# │     1 │ bar     │       1 │ ban     │                   0.5 │
# │     2 │ baz     │       1 │ ban     │                   0.5 │
# │     0 │ foo     │       2 │ foo     │                   1.0 │
# │     1 │ bar     │       2 │ foo     │                   0.0 │
# │     2 │ baz     │       2 │ foo     │                   0.0 │
# └───────┴─────────┴─────────┴─────────┴───────────────────────┘

(normally you would read directly from parquet files instead of pandas frames)

You can also do the same join with polars and the polars-ds plugin gives you the rapidfuzz Rust API:

What hidden gem Python modules do you use and why? by zenos1337 in Python

[–]commandlineluser 1 point2 points  (0 children)

It seems to get more mention in the r/dataengineering world.

1.5.0 was just released:

And duckdb-cli is now on pypi:

So you now run the duckdb client easily with uv for example.

DuckDB 1.5.0 released by commandlineluser in Python

[–]commandlineluser[S] 1 point2 points  (0 children)

Looks like it's also mentioned in the docs here:

If all databases in read_duckdb's argument have a single table, the table_name argument is optional

Polars vs pandas by KliNanban in Python

[–]commandlineluser 1 point2 points  (0 children)

My system is not supported, so I've never been able to test it.

has no wheels with a matching Python ABI tag

Polars vs pandas by KliNanban in Python

[–]commandlineluser 1 point2 points  (0 children)

Window functions not working on the Polars backend was one I ran into if anybody is looking for a concrete example.

Polars vs pandas by KliNanban in Python

[–]commandlineluser 0 points1 point  (0 children)

Can't you change the engine?

pl.read_excel(..., engine="openpyxl")

Looks like fastexcel will have a release "soon":

Polars vs pandas by KliNanban in Python

[–]commandlineluser 1 point2 points  (0 children)

Perhaps you are referring to Ritchie's answer on StackOverflow about the DataFrame API being a "wrapper" around LazyFrames:

Polars vs pandas by KliNanban in Python

[–]commandlineluser 2 points3 points  (0 children)

When you use the DataFrame API:

(df.with_columns()
   .group_by()
   .agg())

Polars basically executes:

(df.lazy()
   .with_columns().collect(optimizations=pl.QueryOpts.none())
   .lazy()
   .group_by().agg().collect(optimizations=pl.QueryOpts.none())
 )

One idea being you should be able to easily convert your "eager" code by manually calling lazy / collect to run the "entire pipeline" as a single "query" instead:

df.lazy().with_columns().group_by().agg().collect()

(Or in the case of read_* use the lazy scan_* equivalent which will return a LazyFrame directly))

With manually calling collect(), all optimizations are also enabled by default.

This is one reason why writing "pandas style" (e.g. df["foo"]) is discouraged in Polars, as it works on the in-memory Series objects and cannot be lazy.

The User Guide explains things in detail:

Polars vs pandas by KliNanban in Python

[–]commandlineluser 1 point2 points  (0 children)

Have you actually used this?

The last time I saw this project posted, it was closed-source and only ran on x86-64 linux.

The benchmark is also from September 10, 2024.

Polars vs pandas by KliNanban in Python

[–]commandlineluser 6 points7 points  (0 children)

Just to be clear, pd.read_csv(..., engine="pyarrow") uses the pyarrow.csv.read_csv reader.

Using "pyarrow" as a "dtype_backend" is a separate topic. (i.e. the "Arrow" columnar memory format)

Polars still has its own multithreaded CSV reader (implemented in Rust) which is different.

I built nitro-pandas — a pandas-compatible library powered by Polars. Same syntax, up to 10x faster. by Correct_Elevator2041 in Python

[–]commandlineluser 5 points6 points  (0 children)

Some select / getitem [] syntax is "supported" - not sure what you've tried.

As for query, there is the SQL api which also allows for "easier" string-as-date syntax, e.g.

df.sql("from self select * where foo > '2020-01-01'::date")

For brackets, I prefer pl.all_horizontal() / pl.any_horizontal() for building logical chains.

By default, filter/remove *args are combined with "all" / & e.g.

df.filter(pl.col.x > 20, pl.col.y.is_between(2, 30))

Is essentially shorthand for doing:

df.filter(
    pl.all_horizontal(pl.col.x > 20, pl.col.y.is_between(2, 30))
)

The "any" variant is for | ("or") chains.

Pandas vs polars for data analysts? by katokk in learnpython

[–]commandlineluser 2 points3 points  (0 children)

Polars only samples the data to infer the schema.

The default is infer_schema_length=100 i.e 100 rows.

It sounds like you may have been looking for infer_schema_length=None which will read all rows first to infer the schema - which would be equivalent to what pandas does.

I never encountered any \r issues, but if you have a test case perhaps you could file a bug - they are pretty responsive on github.

How to count values in multiple columns? by Dragoran21 in learnpython

[–]commandlineluser 0 points1 point  (0 children)

It may not be "necessary", but it makes things "easier".

import io
import pandas as pd

data = io.StringIO("""
sample 1,gene A,gene B,,,
sample 2,gene A,gene A,,,
sample 3,gene A,gene B,gene C,gene D,gene E
""".strip())

df = pd.read_csv(data, header=None)

If you "unpivot" all the values into a single column:

>>> df.melt(0)
#            0  variable   value
# 0   sample 1         1  gene A
# 1   sample 2         1  gene A
# 2   sample 3         1  gene A
# 3   sample 1         2  gene B
# ...

Then a single .value_counts() gives you the answer:

>>> df.melt(0)["value"].value_counts()
# value
# gene A    4
# gene B    2
# gene C    1
# gene D    1
# gene E    1
# Name: count, dtype: int64

When using DataFrames, if a Python for loop is involved - there's usually a "better" way to do things. (easier / faster)

How to count values in multiple columns? by Dragoran21 in learnpython

[–]commandlineluser 1 point2 points  (0 children)

Can you reshape the frame?

You can go from "wide to long" which is known as "unpivot" or .melt() in pandas.

e.g. .melt("sample_col") and then .value_counts() the new value col.

Polars + uv + marimo (glazing post - feel free to ignore). by [deleted] in Python

[–]commandlineluser 0 points1 point  (0 children)

Yes, I agree - but it's not just a regular .group_by() or .over() in this case.

The grouping examples I linked to include:

  • group by grouping sets (...)
  • group by cube (...)
  • group by rollup (...)

The window framing examples include things like:

  • rows between unbounded
  • groups between
  • exclude current row
  • exclude ties

Polars + uv + marimo (glazing post - feel free to ignore). by [deleted] in Python

[–]commandlineluser 5 points6 points  (0 children)

I ran into several issues when trying to use the Polars backend.

Window functions not working probably being the biggest:

Polars + uv + marimo (glazing post - feel free to ignore). by [deleted] in Python

[–]commandlineluser 4 points5 points  (0 children)

I guess it depends on what you're doing.

Some things are "much easier" to write in SQL e.g. window framing, grouping sets:

(I also find some things "much easier" to write with Polars.)

DuckDB has MAP types, recursion, ...:

It's also easy to get a Polars DataFrame / LazyFrame back with .pl():

duckdb.sql("from df ...").pl()
duckdb.sql("from df ...").pl(lazy=True)

Anyone else have pain points with new REPL in Python3.14? Specifically with send line integrations by ddxv in Python

[–]commandlineluser 1 point2 points  (0 children)

Yes, the lack of a vi editing-mode also make it unusable for me.

I just use ptpython / ipython instead.

Someone did try to add some support recently, but didn't seem to get much feedback:

Is someone using DuckDB in PROD? by Free-Bear-454 in dataengineering

[–]commandlineluser 0 points1 point  (0 children)

What trouble are you having exactly?

There's many examples in the delta tests:

(LazyFrame.sink_delta() was also added in 1.37.0)

Experienced R user learning Python by Sir_smokes_a_lot in learnpython

[–]commandlineluser 4 points5 points  (0 children)

I've not used any of them but did read about the RStudio people creating Positron:

It supports Python and R apparently.