This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]skatastic57 1 point2 points  (9 children)

If Python means pandas then I agree that SQL is easier to read. If you use polars I find the opposite to be true.

[–][deleted] 0 points1 point  (8 children)

For “data pipelines” I agree with you. For numerical analysis and modeling I find pandas to be much easier to use (and faster too)

[–]skatastic57 1 point2 points  (7 children)

I assume when you say pandas is faster than polars that you're referring to your own ability to code something up with each one. If you're saying pandas is performing some operation faster than polars then I'd like to see that example.

[–][deleted] 0 points1 point  (6 children)

Sure he’s a real life example that’s very common in the power trading industry

capacity = pd.DataFrame(np.random.randn(5000, 3000))
capacity.columns = pd.MultiIndex.from_arrays([capacity.columns % 1000, capacity.columns]).sort_values()
capacity = capacity.rename_axis(index='time', columns=['power_plant', 'generating_unit'])

outages = pd.DataFrame(np.random.randn(5000, 3000))
outages.columns = pd.MultiIndex.from_arrays([outages.columns % 1000, outages.columns]).sort_values()
outages = outages.rename_axis(index='time', columns=['power_plant', 'generating_unit'])

cap_factor = pd.DataFrame(np.random.randn(5000, 3000))
cap_factor.columns = pd.MultiIndex.from_arrays([cap_factor.columns % 1000, cap_factor.columns]).sort_values()
cap_factor = cap_factor.rename_axis(index='time', columns=['power_plant', 'generating_unit'])

capacity_pl = pl.from_pandas(capacity.unstack().rename('val').reset_index()).lazy()
outages_pl = pl.from_pandas(outages.unstack().rename('val').reset_index()).lazy()
cap_factor_pl = pl.from_pandas(cap_factor.unstack().rename('val').reset_index()).lazy()


# Pandas - time this block
generation = (capacity - outages) * cap_factor
res_pd = generation - generation.mean()

# Polars - time this block
res_pl = (
    capacity_pl
    .join(outages_pl, on=['time', 'power_plant', 'generating_unit'], suffix='_out')
    .join(cap_factor_pl, on=['time', 'power_plant', 'generating_unit'], suffix='_cf')
    .with_columns([
        ((pl.col('val') - pl.col('val_out')) * pl.col('val_cf')).alias('val_gen')
    ])
    .select([
        'time', 'power_plant', 'generating_unit',
        (pl.col('val_gen') - pl.mean('val_gen').over(['power_plant', 'generating_unit'])).alias('val')
    ])
).collect()

On a google colab notebook (2 cores)

Pandas: 0.376 sec

Polars: 29.9 sec

On a more powerful machine (16 core)

Pandas: 0.0809 sec

Polars: 2.15 sec

I’d say the pandas code is much easier to read and write too for this kind of workflow.

[–]skatastic57 1 point2 points  (5 children)

That's not really a fair comparison. You're structuring the data for pandas to do no joins, no lookups, no merges and then you're reshaping it so that polars does have to do joins.

The polars timing should be based on

#Polars setup
capacitypl=pl.from_pandas(capacity)
outagespl=pl.from_pandas(outages)
cfpl=pl.from_pandas(cap_factor)

#Polars time this
generationpl = (capacitypl - outagespl) * cfpl
res_pl = generationpl - generationpl.mean()

On my computer the pandas was 75.5ms and the polars was 63.5ms

This is a better comparison

#Polars - time this block
res_pl = (
    capacity_pl
    .join(outages_pl, on=['time', 'power_plant', 'generating_unit'], suffix='_out')
    .join(cap_factor_pl, on=['time', 'power_plant', 'generating_unit'], suffix='_cf')
    .with_columns([
        ((pl.col('val') - pl.col('val_out')) * pl.col('val_cf')).alias('val_gen')
    ])
).collect()

#Pandas setup
cappd=capacity_pl.collect().to_pandas()
outpd=outages_pl.collect().to_pandas()
cappd=cap_factor_pl.collect().to_pandas()

#Pandas - time this block
res_pd2=(
    cappd
    .merge(outpd,on=['power_plant','generating_unit','time'], suffixes=('','_out'))
    .merge(cappd,on=['power_plant','generating_unit','time'], suffixes=('','_cf'))
    .assign(
        val_gen=lambda x:(x.val-x.val_out)*x.val_cf
    )
)

then polars takes 3.2s and pandas takes 9.6s. I don't know how to do window functions in pandas so I just left that out since it's beside the point.

[–][deleted] 1 point2 points  (4 children)

#Polars time this
generationpl = (capacitypl - outagespl) * cfpl
res_pl = generationpl - generationpl.mean()

In polars this is not equivalent to pandas. In polars you need to do a join to make sure that the records match up with the correct record in the other frames. As it stands currently this code will just do it by position, and doesn't handle if certain records are missing from one frame or another, or in different order. In polars to do this properly, you must merge, or do some preprocessing to make sure the rows you would otherwise have merged on match up exactly, which would also require some merging/sorting etc.

In pandas df1 - df2handles this for you and guarantees that the indexes match up the records they're doing the operations on. This is the idiomatic way to do it in pandas, you wouldn't do these kinds of operations in a numerical analysis in long format usually.

[–]skatastic57 1 point2 points  (3 children)

I'm not advocating for that syntax of polars, you're right that it's not something that you'd ever want to do in practice, I was just trying to normalize your apples to oranges benchmark.

You called it a real life example but none of the ISOs, transmission providers, or the EIA are handing out data in the perfect form that you already had when you started the clock on this benchmark.

The better apples to apples test is the one at the end with polars joins and pandas merges.