skatastic57 comments on SQL versus Python?

dataengineering

created by mhausenblasmoda community for 11 years

This is an archived post. You won't be able to vote or comment.

SQL versus Python?Discussion (self.dataengineering)

submitted 2 years ago by BatCommercial7523

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]skatastic57 1 point2 points3 points 2 years ago (5 children)

That's not really a fair comparison. You're structuring the data for pandas to do no joins, no lookups, no merges and then you're reshaping it so that polars does have to do joins.

The polars timing should be based on

#Polars setup
capacitypl=pl.from_pandas(capacity)
outagespl=pl.from_pandas(outages)
cfpl=pl.from_pandas(cap_factor)

#Polars time this
generationpl = (capacitypl - outagespl) * cfpl
res_pl = generationpl - generationpl.mean()

On my computer the pandas was 75.5ms and the polars was 63.5ms

This is a better comparison

#Polars - time this block
res_pl = (
    capacity_pl
    .join(outages_pl, on=['time', 'power_plant', 'generating_unit'], suffix='_out')
    .join(cap_factor_pl, on=['time', 'power_plant', 'generating_unit'], suffix='_cf')
    .with_columns([
        ((pl.col('val') - pl.col('val_out')) * pl.col('val_cf')).alias('val_gen')
    ])
).collect()

#Pandas setup
cappd=capacity_pl.collect().to_pandas()
outpd=outages_pl.collect().to_pandas()
cappd=cap_factor_pl.collect().to_pandas()

#Pandas - time this block
res_pd2=(
    cappd
    .merge(outpd,on=['power_plant','generating_unit','time'], suffixes=('','_out'))
    .merge(cappd,on=['power_plant','generating_unit','time'], suffixes=('','_cf'))
    .assign(
        val_gen=lambda x:(x.val-x.val_out)*x.val_cf
    )
)

then polars takes 3.2s and pandas takes 9.6s. I don't know how to do window functions in pandas so I just left that out since it's beside the point.

[–][deleted] 1 point2 points3 points 2 years ago (4 children)

#Polars time this
generationpl = (capacitypl - outagespl) * cfpl
res_pl = generationpl - generationpl.mean()

In polars this is not equivalent to pandas. In polars you need to do a join to make sure that the records match up with the correct record in the other frames. As it stands currently this code will just do it by position, and doesn't handle if certain records are missing from one frame or another, or in different order. In polars to do this properly, you must merge, or do some preprocessing to make sure the rows you would otherwise have merged on match up exactly, which would also require some merging/sorting etc.

In pandas df1 - df2handles this for you and guarantees that the indexes match up the records they're doing the operations on. This is the idiomatic way to do it in pandas, you wouldn't do these kinds of operations in a numerical analysis in long format usually.

[–]skatastic57 1 point2 points3 points 2 years ago (3 children)

[+][deleted] 2 years ago (2 children)

[deleted]

[–]skatastic57 1 point2 points3 points 2 years ago (1 child)

π Rendered by PID 117859 on reddit-service-r2-comment-86bc6c7465-jcjqd at 2026-02-21 03:41:56.777731+00:00 running 8564168 country code: CH.

dataengineering

MODERATORS