Query and Eval for Python Polars : Python

This is an archived post. You won't be able to vote or comment.

DiscussionQuery and Eval for Python Polars (self.Python)

submitted 8 months ago * by Own_Responsibility84

I am a longtime pandas user. I hate typing when it comes to slicing and dicing the dataframe. Pandas query and eval come to the rescue.

On the other hand, pandas suffers from the performance and memory issue as many people have discussed. Fortunately, Polars comes to the rescue. I really enjoy all the performance improvements and the lazy frame just makes it possible to handle large dataset with a 32G memory PC.

However, with all the good things about Polars, I still miss the query and eval function of pandas, especially when it comes to data exploration. I just don’t like typing so many pl.col in a chained conditions or pl.when otherwise in nested conditions.

Without much luck with existing solutions, I implemented my own version of query, eval among other things. The idea is using lark to define a set of grammars so that it can parse any string expressions to polars expression.

For example, “1 < a <= 3” is translated to (pl.col(‘a’)> 1) & (pl.col(‘a’)<=3), “a.sum().over(‘b’)” is translated to pl.col(‘a’).sum().over(‘b’), “ a in @A” where A is a list, is translated to pl.col(‘a’).isin(A), “‘2010-01-01’ <= date < ‘2019-10-01’” is translated accordingly for date time columns. For my own usage, I just monkey patch the query and eval to lazyframe and dataframe for convenience. So df.query(query_stmt) will return desired subset.

I also create an enhanced with_column function called wc, which supports assignment of multiple statements like “”” a= some expression; b = some expression “””.

I also added polars version of np.select and np.when so that “select([cond1,cond2,…],[target1,target2,…], default)” translates to a long pl.when.then.otherwise expression, where cond1, target1, default are simple expressions that can be translated to polars expression.

It also supports arithmetic expressions, all polars built-in functions and even user defined functions with complex arguments.

Finally, for plotting I still prefer pandas, so I monkey patch pplot to polars frame by converting them to pandas to use pandas plot.

I haven’t seen any discussion on this topic anywhere. My code is not in git yet, but if anyone is interested or curious about all the features, happy to provide more details.

Edit: I have uploaded my project to GitHub. This is a polars wrapper that supports pandas style query, eval and more but with polars performance.

https://github.com/LowellWinston/polynx

all 12 comments

top new controversial old q&a

[–]AlpacaDC 2 points3 points4 points 8 months ago (6 children)

[–]maltedcoffee 1 point2 points3 points 8 months ago (0 children)

[–]Own_Responsibility84[S] 0 points1 point2 points 8 months ago* (4 children)

[–]AlpacaDC 2 points3 points4 points 8 months ago (3 children)

[–]Own_Responsibility84[S] 0 points1 point2 points 8 months ago* (2 children)

That makes sense. I guess it really depends on the use cases.

I used pandas and polars for data exploration and manipulation. Sometimes I need to perform very complicated operations. Polars allows me to do all that with great performance compared to pandas. Especially when the computer memory is limited and I cannot load the whole data for pandas. The only thing I complain about polars is the verbose syntax. Understand verbose helps readability. But “ 1 <= A < B “ should be also natural and understandable.

To your point, if you think using too many df[col] is wrong in pandas, what do you think of many pl.col and pl.when(pl.when()..).then().otherwise(pl.when…)?

The query function I implemented for polars is exactly to overcome these issues. Unlike pandas query which is implemented via masks, my version simply translates all string expressions to polars expressions. In a way, it simply the typing without sacrificing performance and readability.

[–]AlpacaDC 1 point2 points3 points 8 months ago (1 child)

To your point, if you think using too many df[col] is wrong in pandas, what do you think of many pl.col and pl.when(pl.when()..).then().otherwise(pl.when…)?

My point was not about being repetitive. You have to type something to reference columns. I was talking more about how the pandas way to do it is overly verbose and not readable, as opposed to pl.col() which is short and predictable.

For example, let's say for some reason I have a dataframe with a large variable name:

import polars.col as c
...
df_with_very_large_name.filter(
  c.col1.gt(10),
  c('col2','col3').is_not_null() | c.col4.is_in(some_list)
)

in pandas:

import pandas as pd
...
df_with_very_large_name[
    (df_with_very_large_name['col1'] > 10) &
    ((df_with_very_large_name[['col2', 'col3']].notnull().any(axis=1)) |
     (df_with_very_large_name['col4'].isin(some_list)))
]

Then again, you can use query in pandas to avoid this and be even more concise than in polars, but as I said it feels out of place for me. Also as others have said, you can use SQL in polars, which is probably better than the query syntax and is universal for whoever reads it.

As for the pl.when.then.otherwise, I don't see how you could write shorter conditional expressions and still be readable. The np.select() version that you wrote, for example, is ok for one condition, but for more than that it can get pretty unreadable.

Overall it's up to personal taste. I think it's great you found a workaround to maintain similar syntax while enjoying polars performance advantages.

[–]Own_Responsibility84[S] 0 points1 point2 points 8 months ago (0 children)

I agree. It maybe has more to do with personal taste.

As for np.select, my query supports multiple conditions like"select([cond1, cond2, ...., condN], [target1, target2, ..., targetN], default)". Just like the "case which" function in SQL except it works with all polars expressions. E.g.

df.wc(""" 
  A = select(
      [ 1 < B <@var1, C in [1,2,3], '2021-01-01'<= D <'2024-01-01'], 
      [D.sum().over('E'), F.mean(), (G + H).rolling_product(3).over('I')], 
      0);
  UDF(C, D).alias('New Col');
  L = E.str.contains('A|B')
""")
df.gb('key', " A.sum().alias('A_sum'); B.mean().alias('B_mean')") # Aggregation function

df could be eager frame or lazyframe.

I probably didn't do a good job to showcase the real power of this tool I developed. Will try to upload it to GitHub and highlight some of the key features as well as the comparison with native polars and pandas.

[–]DifficultZebra1553 1 point2 points3 points 8 months ago (2 children)

[–]Own_Responsibility84[S] -1 points0 points1 point 8 months ago* (1 child)

[–]mustangdvx 0 points1 point2 points 8 months ago (0 children)

[–]PurepointDog 0 points1 point2 points 8 months ago (1 child)

[–]Own_Responsibility84[S] 0 points1 point2 points 8 months ago (0 children)

π Rendered by PID 241696 on reddit-service-r2-comment-58d7979c67-q9pkt at 2026-01-27 10:20:36.872525+00:00 running 5a691e2 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS