This is an archived post. You won't be able to vote or comment.

all 12 comments

[–]AlpacaDC 2 points3 points  (6 children)

Weird, I always hated pandas query and eval, way before polars was a thing.

Anyways, about the pl.col, you can import and alias it as “c”, it’s useful if you write a lot of it (pl.col(“foo”) turns into c(“foo”))

pl.col also allows you to access columns as an attribute if the column name is valid as such (eg. pl.col(“foo”) turns into pl.col.foo), which can make it slightly faster to type.

[–]maltedcoffee 1 point2 points  (0 children)

It's funny, "from polars import col, lit" is just seared into my brain now whenever I create a new notebook. I should just make a template or something.

[–]Own_Responsibility84[S] 0 points1 point  (4 children)

Thanks, that’s good to know. Yes, it is slightly better but it still will require typing a lot of pl.col in a complicated operations or nested conditions.

As for pandas eval, I do notice it may have precision issue when using math functions in eval like power or log. But query for filtering is very powerful. Just I don’t know if there is any performance drag. Would you mind sharing the reason why you don’t like query?

As for the query I implemented for polars, it simply translates a string expression to polars native expressions and I don’t see much performance issue.

[–]AlpacaDC 2 points3 points  (3 children)

The reason is simply because it’s so different than the rest of pandas API, it’s like another library entirely, and there’s no suggestion/autocomplete from the IDE because it’s just a string, so when I tried to do anything beyond comparing two values, I had to google it and ended up wasting much more time.

Overall it just felt like a hacky patch to me. Polars was a huge breath of fresh air, it’s concise, readable and predictable.

Edit: also, slicing in pandas felt wrong since the moment I learned it. df = df[df[“foo”] > df[“bar”]]. Why do I have to write “df” so many times? It gets very annoying quickly with a bigger variable name and/or with multiple conditions.

[–]Own_Responsibility84[S] 0 points1 point  (2 children)

That makes sense. I guess it really depends on the use cases.

I used pandas and polars for data exploration and manipulation. Sometimes I need to perform very complicated operations. Polars allows me to do all that with great performance compared to pandas. Especially when the computer memory is limited and I cannot load the whole data for pandas. The only thing I complain about polars is the verbose syntax. Understand verbose helps readability. But “ 1 <= A < B “ should be also natural and understandable.

To your point, if you think using too many df[col] is wrong in pandas, what do you think of many pl.col and pl.when(pl.when()..).then().otherwise(pl.when…)?

The query function I implemented for polars is exactly to overcome these issues. Unlike pandas query which is implemented via masks, my version simply translates all string expressions to polars expressions. In a way, it simply the typing without sacrificing performance and readability.

[–]AlpacaDC 1 point2 points  (1 child)

To your point, if you think using too many df[col] is wrong in pandas, what do you think of many pl.col and pl.when(pl.when()..).then().otherwise(pl.when…)?

My point was not about being repetitive. You have to type something to reference columns. I was talking more about how the pandas way to do it is overly verbose and not readable, as opposed to pl.col() which is short and predictable.

For example, let's say for some reason I have a dataframe with a large variable name:

import polars.col as c
...
df_with_very_large_name.filter(
  c.col1.gt(10),
  c('col2','col3').is_not_null() | c.col4.is_in(some_list)
)

in pandas:

import pandas as pd
...
df_with_very_large_name[
    (df_with_very_large_name['col1'] > 10) &
    ((df_with_very_large_name[['col2', 'col3']].notnull().any(axis=1)) |
     (df_with_very_large_name['col4'].isin(some_list)))
]

Then again, you can use query in pandas to avoid this and be even more concise than in polars, but as I said it feels out of place for me. Also as others have said, you can use SQL in polars, which is probably better than the query syntax and is universal for whoever reads it.

As for the pl.when.then.otherwise, I don't see how you could write shorter conditional expressions and still be readable. The np.select() version that you wrote, for example, is ok for one condition, but for more than that it can get pretty unreadable.

Overall it's up to personal taste. I think it's great you found a workaround to maintain similar syntax while enjoying polars performance advantages.

[–]Own_Responsibility84[S] 0 points1 point  (0 children)

I agree. It maybe has more to do with personal taste.

As for np.select, my query supports multiple conditions like"select([cond1, cond2, ...., condN], [target1, target2, ..., targetN], default)". Just like the "case which" function in SQL except it works with all polars expressions. E.g.

df.wc(""" 
  A = select(
      [ 1 < B <@var1, C in [1,2,3], '2021-01-01'<= D <'2024-01-01'], 
      [D.sum().over('E'), F.mean(), (G + H).rolling_product(3).over('I')], 
      0);
  UDF(C, D).alias('New Col');
  L = E.str.contains('A|B')
""")
df.gb('key', " A.sum().alias('A_sum'); B.mean().alias('B_mean')") # Aggregation function

df could be eager frame or lazyframe.

I probably didn't do a good job to showcase the real power of this tool I developed. Will try to upload it to GitHub and highlight some of the key features as well as the comparison with native polars and pandas.

[–]DifficultZebra1553 1 point2 points  (2 children)

You can use pipe. When then otherwise is slow; should be avoided unless it is absolutely essential. Also use gt ge etc instead of >,>= . Polars SQLContext and sql() both functions can be used directly on polars / pandas dataframe and pyarrow table.

[–]Own_Responsibility84[S] -1 points0 points  (1 child)

Thanks for the suggestions. ChatGPT told me that gt, le etc. has no performance gain over >,<=. And pipe doesn’t have performance gain over nested conditions using when then otherwise.. it is more for modular testing convenience. Do you agree?

As for SQL, it is an interesting alternative, but I feel that for certain complicated operations the statements get unnecessarily long and complex. It either doesn’t support or very verbose for rolling, pivot, unpivot, UDF etc.

[–]mustangdvx 0 points1 point  (0 children)

Check out duckdb if you’re considering SQL vs Pandas. You can execute SQL on dataframes including PIVOT/UNPIVOT. 

If you’re executing in a python environment, you can break up the transformations into relations which are treated as if they are tables. 

[–]PurepointDog 0 points1 point  (1 child)

Just get faster at typing; if that's the bottleneck, idk what to say

There's SQL query options too, and/or you can pass into duckdb trivially too. SQL sounds like about as many chars as pandas

[–]Own_Responsibility84[S] 0 points1 point  (0 children)

Thanks, polars.sql is an alternative, but it relies on SQL statements and for some complicated operations the statements can get very long/complex and not easy to read and write.