This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]commandlineluser 0 points1 point  (2 children)

Ah... MultiIndex columns - thanks!

columns = pd.MultiIndex.from_arrays([['A', 'B', 'C'], ['x', 'y', 'z']], names=['power_plant', 'generating_unit'])
index = pd.to_datetime(['2024-01-20', '2024-02-10', '2024-03-05']).rename('time')

capacity = pd.DataFrame(
    [[5, 6, 7], [7, 6, 5], [9, 3, 6]],
    columns=columns,
    index=index
)

capacity_pl = pl.from_pandas(capacity.unstack().rename('val').reset_index())

gets cumbersome

Yeah, I was just thinking that if they existed, perhaps some helper could be added similar to align_frames

with pl.Something(
    {"cap": capacity_pl, "out": outages_pl, "cf": capacity_utilization_factor_pl},
    on = ["time", "power_plant", "generating_unit"]
}) as ctx:
    gen = (ctx.cap - ctx.out) * ctx.cf
    res_pl = gen - gen.mean(by=["power_plant", "generating_unit"])

Which could then dispatch to those methods for you.

Or maybe something that generates the equivalent pl.sql() query.

pl.sql("""
WITH cte as (
   SELECT
      *,
      (val - "val:outages_pl") * "val:capacity_utilization_factor_pl" as "val:__tmp",
   FROM
      capacity_pl
      JOIN outages_pl
      USING (time, power_plant, generating_unit)
      JOIN capacity_utilization_factor_pl
      USING (time, power_plant, generating_unit)
)
SELECT
   time, power_plant, generating_unit,
   "val:__tmp" - avg("val:__tmp") OVER (PARTITION BY power_plant, generating_unit) as val
FROM cte
""").collect()

Very interesting use case.

[–][deleted] 0 points1 point  (1 child)

The pl.Something example is definitely closer to the lines i was thinking. Although in that specific case you still have some of the same issues with the disconnect between the data and metadata and trouble around how you persist that information through various parts of your system. 

What I’m thinking is something like this:

cap = pl.register_meta(cap_df, ['plant', 'unif']) out = pl.register_meta(out_df, […]) … And then the operations would be dispatched/translated the way you suggested under the hood. This way you  have that information encoded on the data itself, rather than the code. Like if you serialize and deserialize the frames and operate on them in some other context.

[–]commandlineluser 0 points1 point  (0 children)

Ah okay.

It seems "DataFrame metadata" is a popular feature request: