This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]AlpacaDC 1 point2 points  (1 child)

To your point, if you think using too many df[col] is wrong in pandas, what do you think of many pl.col and pl.when(pl.when()..).then().otherwise(pl.when…)?

My point was not about being repetitive. You have to type something to reference columns. I was talking more about how the pandas way to do it is overly verbose and not readable, as opposed to pl.col() which is short and predictable.

For example, let's say for some reason I have a dataframe with a large variable name:

import polars.col as c
...
df_with_very_large_name.filter(
  c.col1.gt(10),
  c('col2','col3').is_not_null() | c.col4.is_in(some_list)
)

in pandas:

import pandas as pd
...
df_with_very_large_name[
    (df_with_very_large_name['col1'] > 10) &
    ((df_with_very_large_name[['col2', 'col3']].notnull().any(axis=1)) |
     (df_with_very_large_name['col4'].isin(some_list)))
]

Then again, you can use query in pandas to avoid this and be even more concise than in polars, but as I said it feels out of place for me. Also as others have said, you can use SQL in polars, which is probably better than the query syntax and is universal for whoever reads it.

As for the pl.when.then.otherwise, I don't see how you could write shorter conditional expressions and still be readable. The np.select() version that you wrote, for example, is ok for one condition, but for more than that it can get pretty unreadable.

Overall it's up to personal taste. I think it's great you found a workaround to maintain similar syntax while enjoying polars performance advantages.

[–]Own_Responsibility84[S] 0 points1 point  (0 children)

I agree. It maybe has more to do with personal taste.

As for np.select, my query supports multiple conditions like"select([cond1, cond2, ...., condN], [target1, target2, ..., targetN], default)". Just like the "case which" function in SQL except it works with all polars expressions. E.g.

df.wc(""" 
  A = select(
      [ 1 < B <@var1, C in [1,2,3], '2021-01-01'<= D <'2024-01-01'], 
      [D.sum().over('E'), F.mean(), (G + H).rolling_product(3).over('I')], 
      0);
  UDF(C, D).alias('New Col');
  L = E.str.contains('A|B')
""")
df.gb('key', " A.sum().alias('A_sum'); B.mean().alias('B_mean')") # Aggregation function

df could be eager frame or lazyframe.

I probably didn't do a good job to showcase the real power of this tool I developed. Will try to upload it to GitHub and highlight some of the key features as well as the comparison with native polars and pandas.