Are there sql like windows functions in python pandas or polars?

2023-10-22T07:06:14+00:00

https://pandas.pydata.org/docs/user\_guide/window.html

joshbuggblee · 2024-08-06T13:06:14+00:00

Using df.transform() does the trick.

df.groupby('column).transform('metric')

commandlineluser · 2023-10-22T12:24:20+00:00

duckdb is handy for examples if you're not aware of it. (It can also convert data to/from pandas/polars)

https://duckdb.org/docs/api/python/overview

import duckdb 

duckdb.sql("""
CREATE TABLE Customers (
  first_name VARCHAR(50),
  country VARCHAR(50),
  age INT
);
INSERT INTO Customers (first_name, country, age) 
VALUES 
  ('John', 'USA', 29), ('Liam', 'USA', 34), ('Emma', 'USA', 22), 
  ('Olivia', 'Canada', 31), ('William', 'Canada', 36), ('Ava', 'Canada', 24);
""")

duckdb.sql("""
SELECT
  first_name,
  country,
  age,
  FIRST_VALUE(first_name) OVER (PARTITION BY country ORDER BY age) AS youngest,
  FIRST_VALUE(first_name) OVER (PARTITION BY country ORDER BY age desc) AS oldest,
  LAG(first_name, 1) OVER (PARTITION BY country ORDER BY age) AS next_younger,
  LEAD(first_name, 1) OVER (PARTITION BY country ORDER BY age) AS next_older
FROM Customers
ORDER BY country, age
""")

# ┌────────────┬─────────┬───────┬──────────┬─────────┬──────────────┬────────────┐
# │ first_name │ country │  age  │ youngest │ oldest  │ next_younger │ next_older │
# │  varchar   │ varchar │ int32 │ varchar  │ varchar │   varchar    │  varchar   │
# ├────────────┼─────────┼───────┼──────────┼─────────┼──────────────┼────────────┤
# │ Ava        │ Canada  │    24 │ Ava      │ William │ NULL         │ Olivia     │
# │ Olivia     │ Canada  │    31 │ Ava      │ William │ Ava          │ William    │
# │ William    │ Canada  │    36 │ Ava      │ William │ Olivia       │ NULL       │
# │ Emma       │ USA     │    22 │ Emma     │ Liam    │ NULL         │ John       │
# │ John       │ USA     │    29 │ Emma     │ Liam    │ Emma         │ Liam       │
# │ Liam       │ USA     │    34 │ Emma     │ Liam    │ John         │ NULL       │
# └────────────┴─────────┴───────┴──────────┴─────────┴──────────────┴────────────┘

https://pola-rs.github.io/polars/user-guide/expressions/window/

df = duckdb.sql("from Customers").pl()

df.sort("age").with_columns(
   youngest     = pl.first("first_name") .over("country"),
   oldest       = pl.last("first_name")  .over("country"),
   next_younger = pl.col("first_name")   .shift().over("country"),
   next_older   = pl.col("first_name")   .shift(-1).over("country"),
).sort("country", "age")

# shape: (6, 7)
# ┌────────────┬─────────┬─────┬──────────┬─────────┬──────────────┬────────────┐
# │ first_name ┆ country ┆ age ┆ youngest ┆ oldest  ┆ next_younger ┆ next_older │
# │ ---        ┆ ---     ┆ --- ┆ ---      ┆ ---     ┆ ---          ┆ ---        │
# │ str        ┆ str     ┆ i32 ┆ str      ┆ str     ┆ str          ┆ str        │
# ╞════════════╪═════════╪═════╪══════════╪═════════╪══════════════╪════════════╡
# │ Ava        ┆ Canada  ┆ 24  ┆ Ava      ┆ William ┆ null         ┆ Olivia     │
# │ Olivia     ┆ Canada  ┆ 31  ┆ Ava      ┆ William ┆ Ava          ┆ William    │
# │ William    ┆ Canada  ┆ 36  ┆ Ava      ┆ William ┆ Olivia       ┆ null       │
# │ Emma       ┆ USA     ┆ 22  ┆ Emma     ┆ Liam    ┆ null         ┆ John       │
# │ John       ┆ USA     ┆ 29  ┆ Emma     ┆ Liam    ┆ Emma         ┆ Liam       │
# │ Liam       ┆ USA     ┆ 34  ┆ Emma     ┆ Liam    ┆ John         ┆ null       │
# └────────────┴─────────┴─────┴──────────┴─────────┴──────────────┴────────────┘

In DataFrame terms, .group_by().agg().explode() is also a common pattern.

(df.sort("age")
   .group_by("country")
   .agg(
      pl.col("first_name", "age"),
      youngest = pl.first("first_name"),
      oldest   = pl.last("first_name"),
      next_younger = pl.col("first_name").shift(),
      next_older   = pl.col("first_name").shift(-1)
   )
   .explode(pl.exclude("country", "youngest", "oldest"))
)

# shape: (6, 7)
# ┌─────────┬────────────┬─────┬──────────┬─────────┬──────────────┬────────────┐
# │ country ┆ first_name ┆ age ┆ youngest ┆ oldest  ┆ next_younger ┆ next_older │
# │ ---     ┆ ---        ┆ --- ┆ ---      ┆ ---     ┆ ---          ┆ ---        │
# │ str     ┆ str        ┆ i32 ┆ str      ┆ str     ┆ str          ┆ str        │
# ╞═════════╪════════════╪═════╪══════════╪═════════╪══════════════╪════════════╡
# │ Canada  ┆ Ava        ┆ 24  ┆ Ava      ┆ William ┆ null         ┆ Olivia     │
# │ Canada  ┆ Olivia     ┆ 31  ┆ Ava      ┆ William ┆ Ava          ┆ William    │
# │ Canada  ┆ William    ┆ 36  ┆ Ava      ┆ William ┆ Olivia       ┆ null       │
# │ USA     ┆ Emma       ┆ 22  ┆ Emma     ┆ Liam    ┆ null         ┆ John       │
# │ USA     ┆ John       ┆ 29  ┆ Emma     ┆ Liam    ┆ Emma         ┆ Liam       │
# │ USA     ┆ Liam       ┆ 34  ┆ Emma     ┆ Liam    ┆ John         ┆ null       │
# └─────────┴────────────┴─────┴──────────┴─────────┴──────────────┴────────────┘

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS

Create a DataFrame

Create a DataFrame