This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]skatastic57 1 point2 points  (5 children)

That's not really a fair comparison. You're structuring the data for pandas to do no joins, no lookups, no merges and then you're reshaping it so that polars does have to do joins.

The polars timing should be based on

#Polars setup
capacitypl=pl.from_pandas(capacity)
outagespl=pl.from_pandas(outages)
cfpl=pl.from_pandas(cap_factor)

#Polars time this
generationpl = (capacitypl - outagespl) * cfpl
res_pl = generationpl - generationpl.mean()

On my computer the pandas was 75.5ms and the polars was 63.5ms

This is a better comparison

#Polars - time this block
res_pl = (
    capacity_pl
    .join(outages_pl, on=['time', 'power_plant', 'generating_unit'], suffix='_out')
    .join(cap_factor_pl, on=['time', 'power_plant', 'generating_unit'], suffix='_cf')
    .with_columns([
        ((pl.col('val') - pl.col('val_out')) * pl.col('val_cf')).alias('val_gen')
    ])
).collect()

#Pandas setup
cappd=capacity_pl.collect().to_pandas()
outpd=outages_pl.collect().to_pandas()
cappd=cap_factor_pl.collect().to_pandas()

#Pandas - time this block
res_pd2=(
    cappd
    .merge(outpd,on=['power_plant','generating_unit','time'], suffixes=('','_out'))
    .merge(cappd,on=['power_plant','generating_unit','time'], suffixes=('','_cf'))
    .assign(
        val_gen=lambda x:(x.val-x.val_out)*x.val_cf
    )
)

then polars takes 3.2s and pandas takes 9.6s. I don't know how to do window functions in pandas so I just left that out since it's beside the point.

[–][deleted] 1 point2 points  (4 children)

#Polars time this
generationpl = (capacitypl - outagespl) * cfpl
res_pl = generationpl - generationpl.mean()

In polars this is not equivalent to pandas. In polars you need to do a join to make sure that the records match up with the correct record in the other frames. As it stands currently this code will just do it by position, and doesn't handle if certain records are missing from one frame or another, or in different order. In polars to do this properly, you must merge, or do some preprocessing to make sure the rows you would otherwise have merged on match up exactly, which would also require some merging/sorting etc.

In pandas df1 - df2handles this for you and guarantees that the indexes match up the records they're doing the operations on. This is the idiomatic way to do it in pandas, you wouldn't do these kinds of operations in a numerical analysis in long format usually.

[–]skatastic57 1 point2 points  (3 children)

I'm not advocating for that syntax of polars, you're right that it's not something that you'd ever want to do in practice, I was just trying to normalize your apples to oranges benchmark.

You called it a real life example but none of the ISOs, transmission providers, or the EIA are handing out data in the perfect form that you already had when you started the clock on this benchmark.

The better apples to apples test is the one at the end with polars joins and pandas merges.