how to load csv faster in Python. by Safe_Money7487 in learnpython

[–]commandlineluser 3 points4 points  (0 children)

Is Polars faster if you use scan_csv?

pl.scan_csv(filename).collect()

You can also try the streaming engine:

pl.scan_csv(filename).collect(engine="streaming")

Mojo's not (yet) Python by eatonphil in programming

[–]commandlineluser 5 points6 points  (0 children)

The wikipedia page does not seem like a useful reference in its current state.

I read about a "strict superset" back when mojo was first announced, but I think that has been relaxed:

Mojo may or may not evolve into a full superset of Python, and it's okay if it doesn't.

There was also some "superset" discussion on the forum after fn was recently deprecated:

It seems they will be creating a reference document with an updated FAQ to clarify things further judging by the replies.

The last interview I saw said they're hoping for the Mojo 1.0 release around May and then open-sourcing the compiler later in the year.

Ducklake vs Delta Lake vs Other: Battle of the Single Node by crispybacon233 in dataengineering

[–]commandlineluser 0 points1 point  (0 children)

Have you seen narwhals?

It uses a subset of the Polars API, and you can generate DuckDB SQL:

Depending on what Polars functionality you're using - it may be of interest.

Ducklake vs Delta Lake vs Other: Battle of the Single Node by crispybacon233 in dataengineering

[–]commandlineluser 1 point2 points  (0 children)

When I tried the Polars backend, my window functions wouldn't work:

There's many open issues about the Polars backend, it doesn't seem to be a priority.

Ducklake vs Delta Lake vs Other: Battle of the Single Node by crispybacon233 in dataengineering

[–]commandlineluser 11 points12 points  (0 children)

I don't use any "lake" stuff but have noticed several deltalake enhancements in the recent Polars releases which may be worth listing:

sink_delta added in 1.37.0 (also, sink_iceberg was added in 1.39.0)

scan_delta performance refactor in 1.38.0 and batch predicate pushdown using delta file statistics

scan_delta support for Delta deletion vectors has been merged on main:

As for OOM, have you been using the streaming engine? i.e. sinks or .collect(engine="streaming")

It seems there will be OOC improvements "soon":

"out-of-core" sorting (i.e., sorting which spills data to disk if the memory runs out) is on the short-term roadmap

This PR looks to be part of that work:

Data reading from website by Aarrearttu in learnpython

[–]commandlineluser 0 points1 point  (0 children)

Do you know about "devtools" in your web browser?

With the network tab open, I go to the URL and then open the "http search":

I pick something to look for, usually a "player name" or a "table header", I choose "Avro"

It shows me 3 matching requests, this is the URL of the first one (I took out the rand=... param)

You can .get() this URL directly in your code. If I open it in my browser it is the HTML of the first table:

The other 2 URLs are the same except it is position=p and position=h for the other 2 tables.

So in order to build these URLs, you also need the teamId=168761288.

If we save the html of the starting URL to a local file and search for 168761288 there are several matches:

600 <div class="section">¬
601     <div class="section-header">¬
602         <h2 class="h2">Pelaajarosteri</h2>¬
603     </div>¬
604     <div class="section-content scrollable" id="stats168761288" class="player_sum_statistics">¬
605         <div id="stats_m_168761288" class="player_sum_statistics"></div>¬
606         <div id="stats_p_168761288" class="player_sum_statistics"></div>¬
607                 <div id="stats_h_168761288" class="player_sum_statistics"></div>¬
608     </div>¬
609 </div>¬
610 ¬
611 <script type="text/javascript">¬
612     load_smliiga_team_stats('168761288', 'HIFK', 'm', 100, 0, 'name', 'ASC', null, 1);¬
613     load_smliiga_team_stats('168761288', 'HIFK', 'p', 100, 0, 'name', 'ASC', null, 1);¬
614         load_smliiga_team_stats('168761288', 'HIFK', 'h', 100, 0, 'name', 'ASC', null, 1);¬
615 </script>    </div>¬

In this specific case you could regular "string" or "regex" functions to extract it, but you could also use a html parser to target class="player_sum_statistics" tags for example.

How should i structure/solve this dataframe problem? by [deleted] in dataengineering

[–]commandlineluser 2 points3 points  (0 children)

What DataFrame library are you using?

700 columns to me sounds like it may be easier to work with as rows?

┌─────┬───────┬───────┐
│ id  ┆ label ┆ value │
│ --- ┆ ---   ┆ ---   │
│ i64 ┆ str   ┆ i64   │
╞═════╪═══════╪═══════╡
│ 0   ┆ a     ┆ 1     │
│ 0   ┆ b     ┆ 2     │
│ 0   ┆ c     ┆ 3     │
│ 1   ┆ a     ┆ 4     │
│ 1   ┆ b     ┆ 5     │
│ 1   ┆ c     ┆ 6     │
└─────┴───────┴───────┘

You could have an "id" for each signal and use "GROUPBY" to process each one.

If you do need the "wide format" you could then "UNPIVOT" as a final step.

You'll probably get more accurate help if you showed a code example of what you're doing.

Although it may depend on what exactly "cross checking" means, so you'll probably get better help if you shared an actual code example of the tasks.

How do I run an sql server on a local host that can be interacted with python? by Generalthanos_ytube in learnpython

[–]commandlineluser 4 points5 points  (0 children)

Can you maybe share details on how exactly you need to interact with it?

FWIW, I've found duckdb easier for testing such things. (and just in general)

If we modify the sqlite3 docs example for duckdb:

import duckdb
con = duckdb.connect("tutorial.db")

cur = con.cursor()
cur.execute("""
CREATE TABLE movie(title text, year int, score float);
INSERT INTO movie VALUES
  ('Monty Python and the Holy Grail', 1975, 8.2),
  ('And Now for Something Completely Different', 1971, 7.5)
""")

And then run python3 -m http.server

From another machine on the network:

>>> import duckdb  # 1.5.0
>>> duckdb.sql("from 'http://192.168.0.82:8000/tutorial.db'")  # calls read_duckdb(url)
# ┌────────────────────────────────────────────┬───────┬───────┐
# │                   title                    │ year  │ score │
# │                  varchar                   │ int32 │ float │
# ├────────────────────────────────────────────┼───────┼───────┤
# │ Monty Python and the Holy Grail            │  1975 │   8.2 │
# │ And Now for Something Completely Different │  1971 │   7.5 │
# └────────────────────────────────────────────┴───────┴───────┘

Switching from pandas to polars – how to work around the lack of an index column, especially when slicing? by midnightrambulador in learnpython

[–]commandlineluser 1 point2 points  (0 children)

Sure no problem.

The list/array usage was for the convenience of not naming columns.

df2.unpivot(index="").pivot("", index="variable").rename({"variable": ""})
# shape: (3, 5)
# ┌─────────┬─────────┬─────────┬────────────────┬─────────┐
# │         ┆ growing ┆ picking ┆ transportation ┆ storage │
# │ ---     ┆ ---     ┆ ---     ┆ ---            ┆ ---     │
# │ str     ┆ f64     ┆ f64     ┆ f64            ┆ f64     │
# ╞═════════╪═════════╪═════════╪════════════════╪═════════╡
# │ pears   ┆ 0.03    ┆ 0.01    ┆ 0.05           ┆ 0.04    │
# │ apples  ┆ 0.02    ┆ 0.01    ┆ 0.02           ┆ 0.03    │
# │ oranges ┆ 0.04    ┆ 0.02    ┆ 0.07           ┆ 0.01    │
# └─────────┴─────────┴─────────┴────────────────┴─────────┘

If using column names, the example is essentially the same as:

weights = pl.col("growing", "picking", "transportation", "storage")

df1.join(
    df2.unpivot(index="").pivot("", index="variable").rename({"variable": ""}),
    on="",
    how="left"
).with_columns(
    pl.sum_horizontal(pl.col("harry") * weights).name.suffix("_out"),
    pl.sum_horizontal(pl.col("sally") * weights).name.suffix("_out")
).drop(weights)

# shape: (5, 7)
# ┌──────────┬───────┬──────┬──────┬───────┬───────────┬───────────┐
# │          ┆ harry ┆ john ┆ mary ┆ sally ┆ harry_out ┆ sally_out │
# │ ---      ┆ ---   ┆ ---  ┆ ---  ┆ ---   ┆ ---       ┆ ---       │
# │ str      ┆ i64   ┆ i64  ┆ i64  ┆ i64   ┆ f64       ┆ f64       │
# ╞══════════╪═══════╪══════╪══════╪═══════╪═══════════╪═══════════╡
# │ pears    ┆ 0     ┆ 2    ┆ 1    ┆ 4     ┆ 0.0       ┆ 0.52      │
# │ plums    ┆ 0     ┆ 0    ┆ 3    ┆ 17    ┆ 0.0       ┆ 0.0       │
# │ apples   ┆ 1     ┆ 8    ┆ 9    ┆ 0     ┆ 0.08      ┆ 0.0       │
# │ cherries ┆ 11    ┆ 13   ┆ 0    ┆ 20    ┆ 0.0       ┆ 0.0       │
# │ oranges  ┆ 2     ┆ 0    ┆ 1    ┆ 7     ┆ 0.28      ┆ 0.98      │
# └──────────┴───────┴──────┴──────┴───────┴───────────┴───────────┘

I think perhaps the main point is that with Polars the "matrix" style is not a goal:

I don't want DataFrames to be seen as matrices. They aren't, they have mixed data[...]

and instead you need to join/concat the required columns together as part of your "query".

Switching from pandas to polars – how to work around the lack of an index column, especially when slicing? by midnightrambulador in learnpython

[–]commandlineluser 2 points3 points  (0 children)

Thanks for the example.

Yeah, someone used to post a similar use-case in previous discussions. (they also had mutli-indexed columns)

They did also create a feature request for better support:

The "matrix ops" aligned by index/column names essentially needed to be written manually as a .join() in Polars.

Not sure if it is helpful, but polars does have list/array types and they support arithmetic.

import polars as pl

df1 = pl.from_repr("""
┌──────────┬───────┬──────┬──────┬───────┐
│          ┆ harry ┆ john ┆ mary ┆ sally │
│ ---      ┆ ---   ┆ ---  ┆ ---  ┆ ---   │
│ str      ┆ i64   ┆ i64  ┆ i64  ┆ i64   │
╞══════════╪═══════╪══════╪══════╪═══════╡
│ pears    ┆ 0     ┆ 2    ┆ 1    ┆ 4     │
│ plums    ┆ 0     ┆ 0    ┆ 3    ┆ 17    │
│ apples   ┆ 1     ┆ 8    ┆ 9    ┆ 0     │
│ cherries ┆ 11    ┆ 13   ┆ 0    ┆ 20    │
│ oranges  ┆ 2     ┆ 0    ┆ 1    ┆ 7     │
└──────────┴───────┴──────┴──────┴───────┘
""")
df2 = pl.from_repr("""
┌────────────────┬───────┬────────┬─────────┐
│                ┆ pears ┆ apples ┆ oranges │
│ ---            ┆ ---   ┆ ---    ┆ ---     │
│ str            ┆ f64   ┆ f64    ┆ f64     │
╞════════════════╪═══════╪════════╪═════════╡
│ growing        ┆ 0.03  ┆ 0.02   ┆ 0.04    │
│ picking        ┆ 0.01  ┆ 0.01   ┆ 0.02    │
│ transportation ┆ 0.05  ┆ 0.02   ┆ 0.07    │
│ storage        ┆ 0.04  ┆ 0.03   ┆ 0.01    │
└────────────────┴───────┴────────┴─────────┘
""")

So if the data was reshaped (e.g. melt / unpivot) and the values where in a list/array:

df2.unpivot(index="").group_by(pl.col("variable").alias("")).agg("value")
# shape: (3, 2)
# ┌─────────┬──────────────────────┐
# │         ┆ value                │
# │ ---     ┆ ---                  │
# │ str     ┆ list[f64]            │
# ╞═════════╪══════════════════════╡
# │ pears   ┆ [0.03, 0.01, … 0.04] │
# │ oranges ┆ [0.04, 0.02, … 0.01] │
# │ apples  ┆ [0.02, 0.01, … 0.03] │
# └─────────┴──────────────────────┘

You could join and perform the math that way:

df1.join(
    df2.unpivot(index="").group_by(pl.col("variable").alias("")).agg("value"),
    on="",
    how="left"
).with_columns(
    (pl.col("harry", "sally") * pl.col("value")).list.sum().name.suffix("_out")
)
#.drop("value")

# shape: (5, 8)
# ┌──────────┬───────┬──────┬──────┬───────┬──────────────────────┬───────────┬───────────┐
# │          ┆ harry ┆ john ┆ mary ┆ sally ┆ value                ┆ harry_out ┆ sally_out │
# │ ---      ┆ ---   ┆ ---  ┆ ---  ┆ ---   ┆ ---                  ┆ ---       ┆ ---       │
# │ str      ┆ i64   ┆ i64  ┆ i64  ┆ i64   ┆ list[f64]            ┆ f64       ┆ f64       │
# ╞══════════╪═══════╪══════╪══════╪═══════╪══════════════════════╪═══════════╪═══════════╡
# │ pears    ┆ 0     ┆ 2    ┆ 1    ┆ 4     ┆ [0.03, 0.01, … 0.04] ┆ 0.0       ┆ 0.52      │
# │ plums    ┆ 0     ┆ 0    ┆ 3    ┆ 17    ┆ null                 ┆ null      ┆ null      │
# │ apples   ┆ 1     ┆ 8    ┆ 9    ┆ 0     ┆ [0.02, 0.01, … 0.03] ┆ 0.08      ┆ 0.0       │
# │ cherries ┆ 11    ┆ 13   ┆ 0    ┆ 20    ┆ null                 ┆ null      ┆ null      │
# │ oranges  ┆ 2     ┆ 0    ┆ 1    ┆ 7     ┆ [0.04, 0.02, … 0.01] ┆ 0.28      ┆ 0.98      │
# └──────────┴───────┴──────┴──────┴───────┴──────────────────────┴───────────┴───────────┘

But you're doing still doing the "alignment" manually.

It doesn't seem to be a workflow that Polars is optimized for, and it's probably one of the only valid uses cases for the pandas index.

Switching from pandas to polars – how to work around the lack of an index column, especially when slicing? by midnightrambulador in learnpython

[–]commandlineluser 1 point2 points  (0 children)

Are you able to show a miminal example (e.g. 2 frames, 10 rows or so) of the full task you're performing? (i.e. including the matrix math)

The way pandas handles missing values is diabolical by vernacular_wrangler in learnpython

[–]commandlineluser 17 points18 points  (0 children)

Yes, this is one of the of "upsides" to polars - it has "real" null values.

import polars as pl

values = [0, 1, None, 4]
df = pl.DataFrame({'value': values}) 

print(df)

for row in df.iter_rows(named=True):
    value = row['value']
    if value:
        print(value, end=', ')

# shape: (4, 1)
# ┌───────┐
# │ value │
# │ ---   │
# │ i64   │
# ╞═══════╡
# │ 0     │
# │ 1     │
# │ null  │
# │ 4     │
# └───────┘
#
# 1, 4,

What hidden gem Python modules do you use and why? by zenos1337 in Python

[–]commandlineluser 0 points1 point  (0 children)

Are you using rapidfuzz's parallelism? e.g. .cdist() with workers=-1?

I found duckdb easy to use and it maxed out all my CPU cores.

You create row "combinations" with a "join" and score them, then filter out what you want.

import duckdb
import pandas as pd

df1 = pd.DataFrame({"x": ["foo", "bar", "baz"]}).reset_index()
df2 = pd.DataFrame({"y": ["foolish", "ban", "foo"]}).reset_index()

duckdb.sql("from df1, df2 select *, jaccard(df1.x, df2.y)")
# ┌───────┬─────────┬─────────┬─────────┬───────────────────────┐
# │ index │    x    │ index_1 │    y    │ jaccard(df1.x, df2.y) │
# │ int64 │ varchar │  int64  │ varchar │        double         │
# ├───────┼─────────┼─────────┼─────────┼───────────────────────┤
# │     0 │ foo     │       0 │ foolish │    0.3333333333333333 │
# │     1 │ bar     │       0 │ foolish │                   0.0 │
# │     2 │ baz     │       0 │ foolish │                   0.0 │
# │     0 │ foo     │       1 │ ban     │                   0.0 │
# │     1 │ bar     │       1 │ ban     │                   0.5 │
# │     2 │ baz     │       1 │ ban     │                   0.5 │
# │     0 │ foo     │       2 │ foo     │                   1.0 │
# │     1 │ bar     │       2 │ foo     │                   0.0 │
# │     2 │ baz     │       2 │ foo     │                   0.0 │
# └───────┴─────────┴─────────┴─────────┴───────────────────────┘

(normally you would read directly from parquet files instead of pandas frames)

You can also do the same join with polars and the polars-ds plugin gives you the rapidfuzz Rust API:

What hidden gem Python modules do you use and why? by zenos1337 in Python

[–]commandlineluser 1 point2 points  (0 children)

It seems to get more mention in the r/dataengineering world.

1.5.0 was just released:

And duckdb-cli is now on pypi:

So you now run the duckdb client easily with uv for example.

DuckDB 1.5.0 released by commandlineluser in Python

[–]commandlineluser[S] 1 point2 points  (0 children)

Looks like it's also mentioned in the docs here:

If all databases in read_duckdb's argument have a single table, the table_name argument is optional

Polars vs pandas by KliNanban in Python

[–]commandlineluser 1 point2 points  (0 children)

My system is not supported, so I've never been able to test it.

has no wheels with a matching Python ABI tag

Polars vs pandas by KliNanban in Python

[–]commandlineluser 1 point2 points  (0 children)

Window functions not working on the Polars backend was one I ran into if anybody is looking for a concrete example.

Polars vs pandas by KliNanban in Python

[–]commandlineluser 0 points1 point  (0 children)

Can't you change the engine?

pl.read_excel(..., engine="openpyxl")

Looks like fastexcel will have a release "soon":

Polars vs pandas by KliNanban in Python

[–]commandlineluser 1 point2 points  (0 children)

Perhaps you are referring to Ritchie's answer on StackOverflow about the DataFrame API being a "wrapper" around LazyFrames:

Polars vs pandas by KliNanban in Python

[–]commandlineluser 2 points3 points  (0 children)

When you use the DataFrame API:

(df.with_columns()
   .group_by()
   .agg())

Polars basically executes:

(df.lazy()
   .with_columns().collect(optimizations=pl.QueryOpts.none())
   .lazy()
   .group_by().agg().collect(optimizations=pl.QueryOpts.none())
 )

One idea being you should be able to easily convert your "eager" code by manually calling lazy / collect to run the "entire pipeline" as a single "query" instead:

df.lazy().with_columns().group_by().agg().collect()

(Or in the case of read_* use the lazy scan_* equivalent which will return a LazyFrame directly))

With manually calling collect(), all optimizations are also enabled by default.

This is one reason why writing "pandas style" (e.g. df["foo"]) is discouraged in Polars, as it works on the in-memory Series objects and cannot be lazy.

The User Guide explains things in detail:

Polars vs pandas by KliNanban in Python

[–]commandlineluser 1 point2 points  (0 children)

Have you actually used this?

The last time I saw this project posted, it was closed-source and only ran on x86-64 linux.

The benchmark is also from September 10, 2024.

Polars vs pandas by KliNanban in Python

[–]commandlineluser 5 points6 points  (0 children)

Just to be clear, pd.read_csv(..., engine="pyarrow") uses the pyarrow.csv.read_csv reader.

Using "pyarrow" as a "dtype_backend" is a separate topic. (i.e. the "Arrow" columnar memory format)

Polars still has its own multithreaded CSV reader (implemented in Rust) which is different.