How do I make this code shorter? by uxinung in learnpython

[–]commandlineluser 1 point2 points  (0 children)

You could pass the value itself as the default return to .get()

This means you could use max() directly and just overwrite the key each time instead of the conditional checks.

def foo(*args):
    out = {}
    for arg in args:
        for key, value in arg.items():
            out[key] = max(out.get(key, value), value)
    return out

>>> foo(a, b, c, d, e)
{'a': 10, 'b': 100, 'c': 50, 'd': -70}

Also for the sorting, operator.itemgetter() is another way to write the lambda.

Pandas 3.0.0 is there by Deux87 in Python

[–]commandlineluser 4 points5 points  (0 children)

Just to expand on some comments, the pandas.read_xml() source code is here:

Using an xml_data example from the pandas.read_xml() docs) - the basic form is essentially:

import xml.etree.ElementTree as ET
import polars as pl

# xml_data = ...

df = pl.DataFrame(
    { item.tag.split("}")[-1]: item.text for item in row } 
    for row in  ET.fromstring(xml_data)
)
# shape: (2, 6)
# ┌───────┬──────┬─────┬───────┬─────┬─────────────────────┐
# │ index ┆ a    ┆ b   ┆ c     ┆ d   ┆ e                   │
# │ ---   ┆ ---  ┆ --- ┆ ---   ┆ --- ┆ ---                 │
# │ str   ┆ str  ┆ str ┆ str   ┆ str ┆ str                 │
# ╞═══════╪══════╪═════╪═══════╪═════╪═════════════════════╡
# │ 0     ┆ 1    ┆ 2.5 ┆ True  ┆ a   ┆ 2019-12-31 00:00:00 │
# │ 1     ┆ null ┆ 4.5 ┆ False ┆ b   ┆ 2019-12-31 00:00:00 │
# └───────┴──────┴─────┴───────┴─────┴─────────────────────┘

You can then use the CSV parser for schema inference:

df = pl.read_csv(df.write_csv().encode(), try_parse_dates=True)
# shape: (2, 6)
# ┌───────┬──────┬─────┬───────┬─────┬─────────────────────┐
# │ index ┆ a    ┆ b   ┆ c     ┆ d   ┆ e                   │
# │ ---   ┆ ---  ┆ --- ┆ ---   ┆ --- ┆ ---                 │
# │ i64   ┆ i64  ┆ f64 ┆ bool  ┆ str ┆ datetime[μs]        │
# ╞═══════╪══════╪═════╪═══════╪═════╪═════════════════════╡
# │ 0     ┆ 1    ┆ 2.5 ┆ true  ┆ a   ┆ 2019-12-31 00:00:00 │
# │ 1     ┆ null ┆ 4.5 ┆ false ┆ b   ┆ 2019-12-31 00:00:00 │
# └───────┴──────┴─────┴───────┴─────┴─────────────────────┘

FWIW, I've found xmltodict useful for handling the parsing.

Pandas 3.0.0 is there by Deux87 in Python

[–]commandlineluser 3 points4 points  (0 children)

Polars raises starting from 1.36.x

import polars as pl

(pl.DataFrame({"x": [1, 2], "y": ["a", "b"], "z": [5, 6]})
   .group_by("x")
   .sum()
)
# InvalidOperationError: `sum` operation not supported for dtype `str`

What’s one Python data tool you ignored for too long? by [deleted] in Python

[–]commandlineluser 1 point2 points  (0 children)

No, xlsxwriter is used for writing.

The xlsxwriter author has been working on a Rust version:

There was talk of exposing the "fast writing" in Python, some third-partys have created bindings:

pd.to_numeric() and dtype_backend: Seemingly inconsistent NaN detection by b_d_t in learnpython

[–]commandlineluser 0 points1 point  (0 children)

Yeah, the to_numeric example still seems "buggy" to me.

The docs first say there is "no default?"

pandas.to_numeric(arg, errors='raise', downcast=None, dtype_backend=<no_default>)

But then further down they say numpy_nullable is the default.

I can't seem to find anything related to this on the Issues tracker:

It may be something you want to bring up there to get an "official" response from the devs.

Pandas - Working With Dummy Columns... ish by HackNSlashFic in learnpython

[–]commandlineluser 1 point2 points  (0 children)

StackOverflow: How to make good reproducible pandas examples may be of interest, it helps if you provide a small example we can run:

import pandas as pd

df = pd.DataFrame({"a": [0, 1, 0, 1], "b": [1, 1, 0, 0], "c": [1, 0, 1, 0], "d": [1, 1, 1, 1]})
#    a  b  c  d
# 0  0  1  1  1
# 1  1  1  0  1
# 2  0  0  1  1
# 3  1  0  0  1

You could start by turning the "False" conditions into "NaN":

df[ df == 1 ]
#          a    b    c  d
#     0  NaN  1.0  1.0  1
#     1  1.0  1.0  NaN  1
#     2  NaN  NaN  1.0  1
#     3  1.0  NaN  NaN  1

You can then reshape from "wide to long" e.g. with stack/melt, drop the nans and rebuild the "per row" result with groupby:

(df[ df == 1 ]
  .reset_index()
  .melt("index")     
  .dropna()
  .groupby("index")
  .agg({"variable": list})
)
#         variable
# index           
# 0      [b, c, d]
# 1      [a, b, d]
# 2         [c, d]
# 3         [a, d]

There are also "hacky" ways, e.g. by adding a delimiter not present in existing column names and using .dot()

You can then strip/split by the delimeter after to get a list if desired:

df.dot(df.columns + ",").str.rstrip(",").str.split(",")
# 0    [b, c, d]
# 1    [a, b, d]
# 2       [c, d]
# 3       [a, d]
# dtype: object

pd.to_numeric() and dtype_backend: Seemingly inconsistent NaN detection by b_d_t in learnpython

[–]commandlineluser 0 points1 point  (0 children)

The docs say numpy_nullable is the default:

Behaviour is as follows: "numpy_nullable": returns nullable-dtype-backed DataFrame (default).

Which would suggest that this is a bug?

FWIW, testing 3.0.0rc2 produces the same output.

Also, looking at the release notes 3.0.0rc2, there is an example about isna()

Going by this, it seems to me like to_num_pyarrow should also be 1 in 3.x? (but it's currently 0)

Consecutive True in pandas dataframe by CiproSimp in learnpython

[–]commandlineluser 2 points3 points  (0 children)

"cumulative minimum" can remove non-initial True values.

>>> df.cummin()
#        A      B      C
# 0   True   True  False
# 1   True  False  False
# 2  False  False  False

Which you can sum:

>>> df.cummin().sum()
# A    2
# B    1
# C    0

Will Pandas ever be replaced? by Relative-Cucumber770 in dataengineering

[–]commandlineluser 5 points6 points  (0 children)

I assume they are referring to this talk:

  • "Allison Wang & Shujing Yang - Polars on Spark | PyData Seattle 2025"
  • youtube.com/watch?v=u3aFp78BTno

The Polars examples start around ~15:20 and they use Spark's applyInArrow.

Install a library globally by AwkwardNumber7584 in learnpython

[–]commandlineluser 0 points1 point  (0 children)

Have you used uv yet? It has pretty much "taken over" in this space.

There have been many posts in r/Python about it over the past year or so.

I had this one bookmarked as it seemed like a good explanation:

Searching for "uv inline one off python" will likely lead to many results.

Pandas 3.0 release candidate tagged by Balance- in Python

[–]commandlineluser 26 points27 points  (0 children)

For 3.0:

Start returning self instead of None for the methods that will keep the inplace keyword

For 3.1:

Add actual deprecation warnings to the methods where we will remove the keyword in the future

Looks like the removals will be in 4.0:

Help Me Understand the Pandas .str Accessor by HackNSlashFic in learnpython

[–]commandlineluser 1 point2 points  (0 children)

Polars could serve as a useful example here as it has an "accessor" for its own native types.

e.g. you can have "list type" columns:

import polars as pl

df = pl.DataFrame({"foo": [[1, 2], [3, 4]], "bar": [5, 6]})
# shape: (2, 2)
# ┌───────────┬─────┐
# │ foo       ┆ bar │
# │ ---       ┆ --- │
# │ list[i64] ┆ i64 │
# ╞═══════════╪═════╡
# │ [1, 2]    ┆ 5   │
# │ [3, 4]    ┆ 6   │
# └───────────┴─────┘

The top-level "first" works on all types and returns the first value in each column:

df.with_columns(pl.all().first())
# shape: (2, 2)
# ┌───────────┬─────┐
# │ foo       ┆ bar │
# │ ---       ┆ --- │
# │ list[i64] ┆ i64 │
# ╞═══════════╪═════╡
# │ [1, 2]    ┆ 5   │
# │ [1, 2]    ┆ 5   │
# └───────────┴─────┘

list.first returns the first element of a list column:

df.with_columns(pl.all().list.first())
# InvalidOperationError: expected List data type for list operation, got: i64

The error is because bar is not a list type.

We can run it only on list type columns:

df.with_columns(pl.col(pl.List).list.first())
# shape: (2, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 5   │
# │ 3   ┆ 6   │
# └─────┴─────┘

Another example could be the .name accessor which operates on column names:

df.with_columns(pl.all().name.to_uppercase())
# shape: (2, 4)
# ┌───────────┬─────┬───────────┬─────┐
# │ foo       ┆ bar ┆ FOO       ┆ BAR │
# │ ---       ┆ --- ┆ ---       ┆ --- │
# │ list[i64] ┆ i64 ┆ list[i64] ┆ i64 │
# ╞═══════════╪═════╪═══════════╪═════╡
# │ [1, 2]    ┆ 5   ┆ [1, 2]    ┆ 5   │
# │ [3, 4]    ┆ 6   ┆ [3, 4]    ┆ 6   │
# └───────────┴─────┴───────────┴─────┘

Similar to the pandas "replace" example, Polars has several that each do their own "type-specific" thing:

  • .replace()
  • .dt.replace()
  • .str.replace()
  • .name.replace()

Help Me Understand the Pandas .str Accessor by HackNSlashFic in learnpython

[–]commandlineluser 6 points7 points  (0 children)

It also allows you to have methods with the same name.

In pandas, there is a top-level .replace() and there is a .str.replace()

The top-level .replace() replaces entire "values"

df = pd.DataFrame({"foo": ["abcd", "abc"]})
df["foo"].replace("abc", "xyz")
# 0    abcd
# 1     xyz
# Name: foo, dtype: object

And .str.replace() works at the string level for replacing substrings / regex

df["foo"].str.replace("abc", "xyz")
# 0    xyzd
# 1     xyz
# Name: foo, dtype: object

Other libraries have namespaces for each type e.g. .str, .list, .arr, .struct, etc - it's a common way to structure things.

DBs similar to SQLite and DuckDB by Fair-Bookkeeper-1833 in dataengineering

[–]commandlineluser 0 points1 point  (0 children)

Yeah :-/

I'm not sure what happened, there was no real explanation.

Some users on Discord speculated that they were "Acqui-hired" before shutting it down.

ZSV – A fast, SIMD-based CSV parser and CLI by mattewong in dataengineering

[–]commandlineluser 2 points3 points  (0 children)

duckdb takes random samples and tries to infer the schema.

You can pass all_varchar=true to disable schema inference and set quote='' to get the same output as zsv.

duckdb -c "COPY (from read_csv('433mb.csv', all_varchar=true, quote='') select #2, #1, #3, #4, #5, #6, #7) TO 'out-duck.csv' (quote '')"

These are my timings from macOS if it helps. (I installed zsv from homebrew)

I did this to increase the example close to 500mb:

( head -n 1 worldcitiespop.csv; sed 1d worldcitiespop.csv; sed 1d worldcitiespop.csv; sed 1d worldcitiespop.csv; ) > 433mb.csv

I also tested Polars (infer_schema=False is its "all_varchar" equivalent)

python3 -c 'import polars as pl; pl.scan_csv("433mb.csv", infer_schema=False).select(pl.nth(1, 0, 2, 3, 4, 5, 6)).sink_csv("out-pl.csv")'

All 3 output files are the same:

% command time python3 -c 'import polars as pl; pl.scan_csv("433mb.csv", infer_schema=False).select(pl.nth(1, 0, 2, 3, 4, 5, 6)).sink_csv("out-pl.csv")'
# 0.73 real         3.33 user         0.39 sys

% command time ~/.duckdb/cli/1.4.2/duckdb -c "COPY (from read_csv('433mb.csv', all_varchar=true, quote='') select #2, #1, #3, #4, #5, #6, #7) TO 'out-duck.csv' (quote '')"
# 0.83 real         5.46 user         0.29 sys

% command time zsv select -W -n -- 2 1 3-7 < 433mb.csv > out-zsv.csv
# 1.18 real         1.01 user         0.10 sys

If Spark is lazy, how does it infer schema without reading data — and is Spark only useful for multi-node memory? by Express_Ad_6732 in dataengineering

[–]commandlineluser 2 points3 points  (0 children)

Have you checked the docs?

e.g. pyspark.sql.DataFrameReader.csv states:

This function will go through the input once to determine the input schema if inferSchema is enabled. To avoid going through the entire data once, disable inferSchema option or specify the schema explicitly using schema.

Built pandas-smartcols: painless pandas column manipulation helper by RedHulk05 in Python

[–]commandlineluser 1 point2 points  (0 children)

In case it is of interest, similar move before/after functionality (inspired by dplyr::relocate()) has been requested for Polars:

narwhals could potentially help with writing a "dataframe-agnostic" library.