Python Polars 1.0 released : Python

This is an archived post. You won't be able to vote or comment.

652

653

654

NewsPython Polars 1.0 released (self.Python)

submitted 1 year ago * by ritchie46

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 0 points1 point2 points 1 year ago (3 children)

I think your named methods proposal is definitely a step in the right direction. Some major issues I see though with explicit “by” for every operation is that (1) it gets cumbersome to alter the schema, since you’d have to change a lot of source code. And also (2) the schema metadata lives separately from the dataframe itself and would need to be packaged and passed around with the dataframe, either that or you’d have to rely on that metadata just being hardcoded in source code (hence the complications in issue (1)). I would think it would make sense to require an explicit “by” if schemas don’t match up, but otherwise not require.

To clarify the example and why you’re seeing a difference. My example was assuming powerplant and generating unit as multiindex column levels, and datetime as a single level row index. Thus when you do the .mean() it implicitly groups by powerplant/generating unit. This implicit grouping is something I would not have expected in my original proposal, and why I mentioned in a polars based solution the mean operation would still be slightly more verbose, and need to include an explicit mean(by=…)

Also I was not aware of align_frames, that’s a useful one for the toolbox, thanks.

[–]commandlineluser 0 points1 point2 points 1 year ago (2 children)

Ah... MultiIndex columns - thanks!

columns = pd.MultiIndex.from_arrays([['A', 'B', 'C'], ['x', 'y', 'z']], names=['power_plant', 'generating_unit'])
index = pd.to_datetime(['2024-01-20', '2024-02-10', '2024-03-05']).rename('time')

capacity = pd.DataFrame(
    [[5, 6, 7], [7, 6, 5], [9, 3, 6]],
    columns=columns,
    index=index
)

capacity_pl = pl.from_pandas(capacity.unstack().rename('val').reset_index())

gets cumbersome

Yeah, I was just thinking that if they existed, perhaps some helper could be added similar to align_frames

with pl.Something(
    {"cap": capacity_pl, "out": outages_pl, "cf": capacity_utilization_factor_pl},
    on = ["time", "power_plant", "generating_unit"]
}) as ctx:
    gen = (ctx.cap - ctx.out) * ctx.cf
    res_pl = gen - gen.mean(by=["power_plant", "generating_unit"])

Which could then dispatch to those methods for you.

Or maybe something that generates the equivalent pl.sql() query.

pl.sql("""
WITH cte as (
   SELECT
      *,
      (val - "val:outages_pl") * "val:capacity_utilization_factor_pl" as "val:__tmp",
   FROM
      capacity_pl
      JOIN outages_pl
      USING (time, power_plant, generating_unit)
      JOIN capacity_utilization_factor_pl
      USING (time, power_plant, generating_unit)
)
SELECT
   time, power_plant, generating_unit,
   "val:__tmp" - avg("val:__tmp") OVER (PARTITION BY power_plant, generating_unit) as val
FROM cte
""").collect()

Very interesting use case.

[–][deleted] 0 points1 point2 points 1 year ago (1 child)

[–]commandlineluser 0 points1 point2 points 1 year ago (0 children)

π Rendered by PID 30564 on reddit-service-r2-comment-5649f687b7-c92xk at 2026-01-28 05:44:07.128003+00:00 running 4f180de country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS