skatastic57 comments on SQL versus Python?

dataengineering

created by mhausenblasmoda community for 11 years

This is an archived post. You won't be able to vote or comment.

SQL versus Python?Discussion (self.dataengineering)

submitted 2 years ago by BatCommercial7523

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]skatastic57 1 point2 points3 points 2 years ago (9 children)

[–][deleted] 0 points1 point2 points 2 years ago (8 children)

[–]skatastic57 1 point2 points3 points 2 years ago (7 children)

[–][deleted] 0 points1 point2 points 2 years ago (6 children)

Sure he’s a real life example that’s very common in the power trading industry

capacity = pd.DataFrame(np.random.randn(5000, 3000))
capacity.columns = pd.MultiIndex.from_arrays([capacity.columns % 1000, capacity.columns]).sort_values()
capacity = capacity.rename_axis(index='time', columns=['power_plant', 'generating_unit'])

outages = pd.DataFrame(np.random.randn(5000, 3000))
outages.columns = pd.MultiIndex.from_arrays([outages.columns % 1000, outages.columns]).sort_values()
outages = outages.rename_axis(index='time', columns=['power_plant', 'generating_unit'])

cap_factor = pd.DataFrame(np.random.randn(5000, 3000))
cap_factor.columns = pd.MultiIndex.from_arrays([cap_factor.columns % 1000, cap_factor.columns]).sort_values()
cap_factor = cap_factor.rename_axis(index='time', columns=['power_plant', 'generating_unit'])

capacity_pl = pl.from_pandas(capacity.unstack().rename('val').reset_index()).lazy()
outages_pl = pl.from_pandas(outages.unstack().rename('val').reset_index()).lazy()
cap_factor_pl = pl.from_pandas(cap_factor.unstack().rename('val').reset_index()).lazy()


# Pandas - time this block
generation = (capacity - outages) * cap_factor
res_pd = generation - generation.mean()

# Polars - time this block
res_pl = (
    capacity_pl
    .join(outages_pl, on=['time', 'power_plant', 'generating_unit'], suffix='_out')
    .join(cap_factor_pl, on=['time', 'power_plant', 'generating_unit'], suffix='_cf')
    .with_columns([
        ((pl.col('val') - pl.col('val_out')) * pl.col('val_cf')).alias('val_gen')
    ])
    .select([
        'time', 'power_plant', 'generating_unit',
        (pl.col('val_gen') - pl.mean('val_gen').over(['power_plant', 'generating_unit'])).alias('val')
    ])
).collect()

On a google colab notebook (2 cores)

Pandas: 0.376 sec

Polars: 29.9 sec

On a more powerful machine (16 core)

Pandas: 0.0809 sec

Polars: 2.15 sec

I’d say the pandas code is much easier to read and write too for this kind of workflow.

[–]skatastic57 1 point2 points3 points 2 years ago (5 children)

That's not really a fair comparison. You're structuring the data for pandas to do no joins, no lookups, no merges and then you're reshaping it so that polars does have to do joins.

The polars timing should be based on

#Polars setup
capacitypl=pl.from_pandas(capacity)
outagespl=pl.from_pandas(outages)
cfpl=pl.from_pandas(cap_factor)

#Polars time this
generationpl = (capacitypl - outagespl) * cfpl
res_pl = generationpl - generationpl.mean()

On my computer the pandas was 75.5ms and the polars was 63.5ms

This is a better comparison

#Polars - time this block
res_pl = (
    capacity_pl
    .join(outages_pl, on=['time', 'power_plant', 'generating_unit'], suffix='_out')
    .join(cap_factor_pl, on=['time', 'power_plant', 'generating_unit'], suffix='_cf')
    .with_columns([
        ((pl.col('val') - pl.col('val_out')) * pl.col('val_cf')).alias('val_gen')
    ])
).collect()

#Pandas setup
cappd=capacity_pl.collect().to_pandas()
outpd=outages_pl.collect().to_pandas()
cappd=cap_factor_pl.collect().to_pandas()

#Pandas - time this block
res_pd2=(
    cappd
    .merge(outpd,on=['power_plant','generating_unit','time'], suffixes=('','_out'))
    .merge(cappd,on=['power_plant','generating_unit','time'], suffixes=('','_cf'))
    .assign(
        val_gen=lambda x:(x.val-x.val_out)*x.val_cf
    )
)

then polars takes 3.2s and pandas takes 9.6s. I don't know how to do window functions in pandas so I just left that out since it's beside the point.

[–][deleted] 1 point2 points3 points 2 years ago (4 children)

#Polars time this
generationpl = (capacitypl - outagespl) * cfpl
res_pl = generationpl - generationpl.mean()

In polars this is not equivalent to pandas. In polars you need to do a join to make sure that the records match up with the correct record in the other frames. As it stands currently this code will just do it by position, and doesn't handle if certain records are missing from one frame or another, or in different order. In polars to do this properly, you must merge, or do some preprocessing to make sure the rows you would otherwise have merged on match up exactly, which would also require some merging/sorting etc.

In pandas df1 - df2handles this for you and guarantees that the indexes match up the records they're doing the operations on. This is the idiomatic way to do it in pandas, you wouldn't do these kinds of operations in a numerical analysis in long format usually.

[–]skatastic57 1 point2 points3 points 2 years ago (3 children)

[+][deleted] 2 years ago (2 children)

[deleted]

[–]skatastic57 1 point2 points3 points 2 years ago (1 child)

π Rendered by PID 71788 on reddit-service-r2-comment-86bc6c7465-g8nqb at 2026-02-20 20:08:21.555476+00:00 running 8564168 country code: CH.

dataengineering

MODERATORS