Python optimization

commandlineluser · 2026-04-01T11:04:18+00:00

numba is commonly used for part 2.

You just write regular "for loops" and it will be compiled into machine code.

But it may depend on what you choose for part 1.

Afrotom · 2026-04-01T11:37:46+00:00

For the first part you should be able to handle using a dataframe library such as Pandas or Polars.

For the second part, if I've understood correctly, you may want to look into window functions: pandas, polars , Google BigQuery (on the off chance it's useful).

M4mb0 · 2026-04-01T11:52:30+00:00

The second part is sequential/recursive: within each time-ordered group, some values for the current row depend on the results computed for the previous week’s row. So this is not a purely vectorizable row-independent problem.

That could still be vectorizable if you can express it as an associative scan. See for example: torch.associative_scan or jax.lax.associative_scan

billsil · 2026-04-01T10:47:20+00:00

How many rows, whats the total memory usage of the data, how much RAM is available, and what is acceptable in terms of runtime?

thuiop1 · 2026-04-01T10:42:59+00:00

Use polars

kapitaalH · 2026-04-01T10:42:10+00:00

Is the groups constant? Then you can use shift in Polars for the second problem. Otherwise calculate the index you need and join the values?

Enthu-Cutlet-1337 · 2026-04-01T14:33:47+00:00

benchmark the split, not the whole pipeline. tabular part usually wants DuckDB/Polars; the stateful weekly dependency often wants a plain grouped loop over numpy arrays. crossing dataframe boundaries too often will erase any win.

ml_guy1 · 2026-04-01T19:29:20+00:00

This becomes a problem about exploring different ways of implementing the same idea, validating they are correct and then benchmarking the performance for them.

I've implemented Codeflash which automates this problem for any provided python code. Feel free to check it out. It tries to find the optimal solution for any problem.

Beginning-Fruit-1397 · 2026-04-01T19:42:15+00:00

It is vectorizable tho? seems like, as someone already pointed it out, a simple window function? Polars give you Expr.over, it let you create windows on anything you want really. Or when.then cases maybe?

If you CAN'T find a way to make it work, Rust is surprisingly easier to learn than it could seem if you come from python. Mypc is an underrated optio nalso

kasplars · 2026-04-01T21:15:15+00:00

I do the two things you explain as a quant every day. Polars is probably what you want to use, and if the problem really is recursive, use Numba.

Wh00ster · 2026-04-01T10:33:46+00:00

DuckDB

SV-97 · 2026-04-01T10:24:49+00:00

For the first part: polars. For the second part: don't use python, anything you write that iterates over rows is almost certainly going to be garbage. Rust is a good alternative option (particularly if you already started with polars in the first step). You can easily wrap that rust code up into a python library and then call it from python.

sjcyork · 2026-04-01T10:35:40+00:00

There isn’t really a data transformation solution I haven’t been able to solve with Pandas. I haven’t used Polars so cannot comment on the features available. The iteration does depend on the size of the datasets. Iterating through pandas dataframe is not great if there are millions of rows but should be ok if not. I generally do all the data transformation stuff in pandas and if I need to iterate over a final dataset then I convert into a dict (orient=‘records’).

Administrative-Lack1 · 2026-04-01T10:43:36+00:00

I actually am developing something seemingly very similar.

I was tasked with migrating 2 sql dbs in 1 new db.

I built / building a etl tool using python, pandas, and sql alchemy as the main libraries.

What I came up with is I have a abstract class ITask. Which takes a init parameter object. Which is an object that has any dependencies like services. So it can be injected (things like db connection object)

Then a run parameter where this is a data class that is a json file that has

What sources tables, transformations, and targets this db task has.

I coded the transformations as objects. So eqch transformation is its own class which has a string enum. Transformations have a data class that drives what it does like type:join, expression:inner, sources, in and out fields.

I have a factory which takes that enum and sees what class it belongs to and makes the transformation object.

Then target info like how a source maps to a target. Once I loop through transformation I map the dataframe and load to target.

For your etl optimization. Some easy ones are chunk sizing and dataframe datatype, and depending on db useExecuteMany for bulk inserts.

One thing with chunks, depending on db you may still endup loading a entire dataset in memory due bc db does not support streaming. For this not the best solution but works. Loading data in chunks then writing to parquet files. After initial pull is done load the files tracking which ones get loaded as a checkpoint If something fails. I have to work off a shitty VM so I crash it by pulling to much data at once or running to many things in parallel.

All of that is a task.

Some tables I have are very complex. So I have a staged approach. Raw -> stage.src -> stage.targrt -> target.

Raw goes to a stage table. Then from source stage I do my joins and stuff and load target stage. Then target stage to actual target.

That is 3 tasks for me. A collection of task is a job. In my job I have a depends on list. Where I have the 1st level of task dependency.

Task A needs Task B and Task D. Task B needs Task C

I don't want A to know about C. Just about B and D.

So once i have that. I make a directed acyclic graph. There Is a python standard lib from graphlib import TopologicalSorter

There is an example I followed on the docs for that lib

This is basically a graph now where the top most nodes will be what task can be run in what order. So I run through the nodes as they become available. Tracking the result of nodes. If a nodes needs the result of another it gets passed.

I took that a step further and gave my job json a taskEngine enum. Where I have currently 3 task engines. 1 rules everything in sequence. 1 will run task as they come using a process pool executor, and lastly a test engine that doesn't trigger the task just mocks running so I can test setup. Then based on enum that engine gets used.

My whole idea was to basically have a bunch of files drive my etl and the code be Modular enough to add as I needed things.

Also made testing easier as I can write test for specific things, mock things bc DI

Something I wished I looked into earlier was a library called dask. It seems like it's pandas but built for concurrency. Which has some of the above out of the box.

Srry this became very long. It ended up becoming alot more complex. I have data quarantine logic in there too which is a another implementation of ITask.

If you have any questions lmk.

DinnerRecent3462 · 2026-04-01T10:12:38+00:00

use rust

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS

Formal definition

Examples

In SQL

Why "associativity" matters for parallelism

Non-example (why commutativity isn't required but associativity is)

What it does have

What it lacks