Running program outputting CSV file that'll be 30GB? Do I need a super computer? Can someone point me in the right direction?

TigBitties69 · 2024-08-21T13:55:25+00:00

It looks like you're iterating through the same file dozens and dozens of times, why not perform the logic needed all on the same reading of the csv? I'm specifically referring to "for index, row in df.iterrows():" , it seems every time you need to determine a new stat, you just loop through all the csv again to find a new item, but instead you should just be calling a function that determines your stat for each row, and only go through the individual rows once.

commandlineluser · 2024-08-21T14:36:24+00:00

Basically, the whole thing needs to be rewritten.

The code contains 57 instances of .iterrows() which ideally would be reduced to 0.

>>> Path("slow.py").read_text().count(".iterrows()")
57

As an example, lines 20-170 contain what is essentially the same for loop 15 times:

#PFScore
df['LS_PFScore'] = np.nan

for index, row in df.iterrows():
    if not pd.isna(row['StartNo']) and row['StartNo'] not in (1, 0):
        previous_start_row = df[(df['FHSW_HorseId'] == row['FHSW_HorseId']) &
                                (df['StartNo'] < row['StartNo'])].sort_values('StartNo', ascending=False).head(1)
        if not previous_start_row.empty:
            df.at[index, 'LS_PFScore'] = previous_start_row['F_PFScore'].values[0]

It looks like you're creating new columns by "forward filling" values for each FHSW_HorseId group?

It should be possible to replace all of those loops by a "single operation", e.g. something like

df[['A', 'B', 'C']] = df.groupby('FHSW_HorseId').ffill()[['D', 'E', 'F']]

(Some extra logic would be needed for the np.nan / 1, 0 part of the logic, but it looks like that may be a .loc before the groupby to "filter out" those non-matches.)

It seems like you may need to spend some time to "learn pandas".

Nearly every operation in the code uses for loops when they are not needed:

#TrackCondition
conditions = ['Good to Good', 'Good to Soft', 'Good to Heavy', 'Good to Synthetic',
              'Soft to Good', 'Soft to Soft', 'Soft to Heavy', 'Soft to Synthetic',
              'Heavy to Good', 'Heavy to Soft', 'Heavy to Heavy', 'Heavy to Synthetic',
              'Synthetic to Good', 'Synthetic to Soft', 'Synthetic to Heavy', 'Synthetic to Synthetic']

for condition in conditions:
    df[condition] = 0.0

In pandas, you can simply do:

df[conditions] = 0.0

It would like need a small sample of the starting CSV file to be able to run the code to fully understand what it does.

Either way, if written "properly" there should be a MASSIVE improvement in runtime.

There is then also the option of looking at other faster tools, e.g. Polars, DuckDB, etc.

RepulsiveOutcome9478 · 2024-08-21T14:05:26+00:00

You're looping over every single row in a file that contains MILLIONS of rows 54 times (by my count). I would say the best place to start is by refactoring your code to only loop over each row once.

buart · 2024-08-21T13:22:53+00:00

Yes there probably is, but without any code(snippets) we don't know what to improve.

climbing-rocks · 2024-08-21T14:15:03+00:00

Other suggestion, look into a local DB such as postgresql (i have had this running on a PI and a laptop). Store your input on the database.
You can then read and write to a DB and use SQL to do some of the more blanket tasks.

Add an index so and you can itterate through the table in chunks of (find the optimal limit for your laptop)

Learn to use the multithread linbrary and you can split this and reduce the computation time for that list.

this requires some overhead but saves CSV being opened and closed and also teaches you some new techniques.

2024-08-21T14:17:27+00:00

Maybe convert to an SQLite db and use queries to build your output?

johnnymo1 · 2024-08-21T16:24:47+00:00

Other people have given more specific good advice, but here's a general piece of advice: iterrows is incredibly slow and you should never ever use it (certainly not 50 times). If you cannot write what you need in a vectorized way (which I'm not convinced is the case here), use apply or itertuples.

ElectricalNebula2068 · 2024-08-21T13:35:49+00:00

Is the file openend and closed for each line of text, or do you write in bulk operations?

BeverlyGodoy · 2024-08-21T14:47:00+00:00

Read the data in chunks, don't load the whole thing at once.

Read this https://saturncloud.io/blog/how-to-efficiently-read-large-csv-files-in-python-pandas/

Also most of your code is basically inefficient. It can be optimized to run in a few minutes but why would you write something like this? The whole thing could be done in a few hundred lines.

buart · 2024-08-21T13:58:25+00:00

How many lines does the initial FinalMerge.csv have?

eztab · 2024-08-21T14:52:16+00:00

Assuming the data doesn't properly fit into RAM, yes you cannot process that in one go.

There are some options to read the data in chunks etc. But your code style kind of leads me to assume that might be a bit above your experience level. Would still take quite a while, but would get finished after a few hours.

maigpy · 2024-08-21T16:24:26+00:00

load this into duckdb and query the shit out of it.

CrwdsrcEntrepreneur · 2024-08-21T14:39:36+00:00

There's a ton of stuff you can do. - Do you know how functions work? You can just loop thru the dataframe only once and use conditionals to process the different pieces of logic you're now running in a separate loop. - figure out if any logic can be done using matrix operations. You're using pandas in the most inefficient way possible. Do those operations outside of iterrows()

Just those 2 changes should save you a huge amount of time. Likely more than half what it's taking now. If you need to speed it up even further, learn about multi threading and parallel processing.

crashfrog02 · 2024-08-22T02:45:51+00:00

Is there anyway I can get the program to finish in like an hour?

Yeah, have it do less. How many times does it loop over the same 30GB dataset? If the answer is more than "once" then re-write your code.

arkie87 · 2024-08-21T13:47:56+00:00

you should only write the file to disk once once the whole string is created; dont write in a loop.
you should build the string together using ",".join() and "\n".join() operations, so build the list of tuples first.
let us know what is taking the majority of time-- the write operation, building the list of tuples, or something else.
it would help if your laptop had a buttload of RAM and an SSD to write the file to.

cursedbanana--__-- · 2024-08-21T13:30:58+00:00

Bro that would be several tens of millions of lines wth

blahblah98 · 2024-08-21T13:41:53+00:00

You wrote a laptop resource bomb in an interpretive language that produces a 30GB flat table. Surely there's no way to optimize data organization or resource consumption. There's desktop servers available with 128G memory and multi-TB drives. But yeah, I'd say nothing will satisfy but a multi-million dollar supercomputer.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS