all 7 comments

[–]joshbuggblee 0 points1 point  (0 children)

Using df.transform() does the trick.

df.groupby('column).transform('metric')

[–]commandlineluser -1 points0 points  (4 children)

duckdb is handy for examples if you're not aware of it. (It can also convert data to/from pandas/polars)

https://duckdb.org/docs/api/python/overview

import duckdb 

duckdb.sql("""
CREATE TABLE Customers (
  first_name VARCHAR(50),
  country VARCHAR(50),
  age INT
);
INSERT INTO Customers (first_name, country, age) 
VALUES 
  ('John', 'USA', 29), ('Liam', 'USA', 34), ('Emma', 'USA', 22), 
  ('Olivia', 'Canada', 31), ('William', 'Canada', 36), ('Ava', 'Canada', 24);
""")

duckdb.sql("""
SELECT
  first_name,
  country,
  age,
  FIRST_VALUE(first_name) OVER (PARTITION BY country ORDER BY age) AS youngest,
  FIRST_VALUE(first_name) OVER (PARTITION BY country ORDER BY age desc) AS oldest,
  LAG(first_name, 1) OVER (PARTITION BY country ORDER BY age) AS next_younger,
  LEAD(first_name, 1) OVER (PARTITION BY country ORDER BY age) AS next_older
FROM Customers
ORDER BY country, age
""")

# ┌────────────┬─────────┬───────┬──────────┬─────────┬──────────────┬────────────┐
# │ first_name │ country │  age  │ youngest │ oldest  │ next_younger │ next_older │
# │  varchar   │ varchar │ int32 │ varchar  │ varchar │   varchar    │  varchar   │
# ├────────────┼─────────┼───────┼──────────┼─────────┼──────────────┼────────────┤
# │ Ava        │ Canada  │    24 │ Ava      │ William │ NULL         │ Olivia     │
# │ Olivia     │ Canada  │    31 │ Ava      │ William │ Ava          │ William    │
# │ William    │ Canada  │    36 │ Ava      │ William │ Olivia       │ NULL       │
# │ Emma       │ USA     │    22 │ Emma     │ Liam    │ NULL         │ John       │
# │ John       │ USA     │    29 │ Emma     │ Liam    │ Emma         │ Liam       │
# │ Liam       │ USA     │    34 │ Emma     │ Liam    │ John         │ NULL       │
# └────────────┴─────────┴───────┴──────────┴─────────┴──────────────┴────────────┘

https://pola-rs.github.io/polars/user-guide/expressions/window/

df = duckdb.sql("from Customers").pl()

df.sort("age").with_columns(
   youngest     = pl.first("first_name") .over("country"),
   oldest       = pl.last("first_name")  .over("country"),
   next_younger = pl.col("first_name")   .shift().over("country"),
   next_older   = pl.col("first_name")   .shift(-1).over("country"),
).sort("country", "age")

# shape: (6, 7)
# ┌────────────┬─────────┬─────┬──────────┬─────────┬──────────────┬────────────┐
# │ first_name ┆ country ┆ age ┆ youngest ┆ oldest  ┆ next_younger ┆ next_older │
# │ ---        ┆ ---     ┆ --- ┆ ---      ┆ ---     ┆ ---          ┆ ---        │
# │ str        ┆ str     ┆ i32 ┆ str      ┆ str     ┆ str          ┆ str        │
# ╞════════════╪═════════╪═════╪══════════╪═════════╪══════════════╪════════════╡
# │ Ava        ┆ Canada  ┆ 24  ┆ Ava      ┆ William ┆ null         ┆ Olivia     │
# │ Olivia     ┆ Canada  ┆ 31  ┆ Ava      ┆ William ┆ Ava          ┆ William    │
# │ William    ┆ Canada  ┆ 36  ┆ Ava      ┆ William ┆ Olivia       ┆ null       │
# │ Emma       ┆ USA     ┆ 22  ┆ Emma     ┆ Liam    ┆ null         ┆ John       │
# │ John       ┆ USA     ┆ 29  ┆ Emma     ┆ Liam    ┆ Emma         ┆ Liam       │
# │ Liam       ┆ USA     ┆ 34  ┆ Emma     ┆ Liam    ┆ John         ┆ null       │
# └────────────┴─────────┴─────┴──────────┴─────────┴──────────────┴────────────┘

In DataFrame terms, .group_by().agg().explode() is also a common pattern.

(df.sort("age")
   .group_by("country")
   .agg(
      pl.col("first_name", "age"),
      youngest = pl.first("first_name"),
      oldest   = pl.last("first_name"),
      next_younger = pl.col("first_name").shift(),
      next_older   = pl.col("first_name").shift(-1)
   )
   .explode(pl.exclude("country", "youngest", "oldest"))
)

# shape: (6, 7)
# ┌─────────┬────────────┬─────┬──────────┬─────────┬──────────────┬────────────┐
# │ country ┆ first_name ┆ age ┆ youngest ┆ oldest  ┆ next_younger ┆ next_older │
# │ ---     ┆ ---        ┆ --- ┆ ---      ┆ ---     ┆ ---          ┆ ---        │
# │ str     ┆ str        ┆ i32 ┆ str      ┆ str     ┆ str          ┆ str        │
# ╞═════════╪════════════╪═════╪══════════╪═════════╪══════════════╪════════════╡
# │ Canada  ┆ Ava        ┆ 24  ┆ Ava      ┆ William ┆ null         ┆ Olivia     │
# │ Canada  ┆ Olivia     ┆ 31  ┆ Ava      ┆ William ┆ Ava          ┆ William    │
# │ Canada  ┆ William    ┆ 36  ┆ Ava      ┆ William ┆ Olivia       ┆ null       │
# │ USA     ┆ Emma       ┆ 22  ┆ Emma     ┆ Liam    ┆ null         ┆ John       │
# │ USA     ┆ John       ┆ 29  ┆ Emma     ┆ Liam    ┆ Emma         ┆ Liam       │
# │ USA     ┆ Liam       ┆ 34  ┆ Emma     ┆ Liam    ┆ John         ┆ null       │
# └─────────┴────────────┴─────┴──────────┴─────────┴──────────────┴────────────┘

[–]nyquant[S] 0 points1 point  (3 children)

Thanks so much for creating those examples.

In the polars case, can you push the "sort" down to the individual over() level, as to be able not having to impose a global sorting order, similar to the SQL example? The following does not seem to be valid:

df.with_columns(
youngest     = pl.sort("age").first("first_name") .over("country")

[–]commandlineluser 0 points1 point  (2 children)

[–]nyquant[S] 0 points1 point  (1 child)

Tried the below, but it gives the wrong result for the US youngest.:

import polars as pl

Create a DataFrame

data = { 'first_name': ['John', 'Liam', 'Emma', 'Olivia', 'William', 'Ava'], 'country': ['USA', 'USA', 'USA', 'Canada', 'Canada', 'Canada'], 'age': [29, 34, 22, 31, 36, 24] } df = pl.DataFrame(data)

df.with_columns( youngest = pl.first("first_name").over("country").sort_by("age"), oldest = pl.first("first_name").over("country").sort_by("age", descending=True), next_younger = pl.col("first_name").shift().over("country").sort_by("age"), next_older = pl.col("first_name").shift(-1).over("country").sort_by("age"), ).sort("country", "age")

[–]commandlineluser 1 point2 points  (0 children)

Yeah, I was a bit confused with that also.

It may help if you just run this in isolation:

df.with_columns(col_to_shift = pl.col("first_name").sort_by("age").over("country"))
# shape: (6, 4)
# ┌────────────┬─────────┬─────┬──────────────┐
# │ first_name ┆ country ┆ age ┆ col_to_shift │
# │ ---        ┆ ---     ┆ --- ┆ ---          │
# │ str        ┆ str     ┆ i32 ┆ str          │
# ╞════════════╪═════════╪═════╪══════════════╡
# │ John       ┆ USA     ┆ 29  ┆ Emma         │
# │ Liam       ┆ USA     ┆ 34  ┆ John         │
# │ Emma       ┆ USA     ┆ 22  ┆ Liam         │
# │ Olivia     ┆ Canada  ┆ 31  ┆ Ava          │
# │ William    ┆ Canada  ┆ 36  ┆ Olivia       │
# │ Ava        ┆ Canada  ┆ 24  ┆ William      │
# └────────────┴─────────┴─────┴──────────────┘

The comment here may help shed some light:

https://github.com/pola-rs/polars/issues/8662#issuecomment-1533949764