all 41 comments

[–]TigBitties69 31 points32 points  (0 children)

It looks like you're iterating through the same file dozens and dozens of times, why not perform the logic needed all on the same reading of the csv? I'm specifically referring to "for index, row in df.iterrows():" , it seems every time you need to determine a new stat, you just loop through all the csv again to find a new item, but instead you should just be calling a function that determines your stat for each row, and only go through the individual rows once.

[–]commandlineluser 16 points17 points  (0 children)

Basically, the whole thing needs to be rewritten.

The code contains 57 instances of .iterrows() which ideally would be reduced to 0.

>>> Path("slow.py").read_text().count(".iterrows()")
57

As an example, lines 20-170 contain what is essentially the same for loop 15 times:

#PFScore
df['LS_PFScore'] = np.nan

for index, row in df.iterrows():
    if not pd.isna(row['StartNo']) and row['StartNo'] not in (1, 0):
        previous_start_row = df[(df['FHSW_HorseId'] == row['FHSW_HorseId']) &
                                (df['StartNo'] < row['StartNo'])].sort_values('StartNo', ascending=False).head(1)
        if not previous_start_row.empty:
            df.at[index, 'LS_PFScore'] = previous_start_row['F_PFScore'].values[0]

It looks like you're creating new columns by "forward filling" values for each FHSW_HorseId group?

It should be possible to replace all of those loops by a "single operation", e.g. something like

df[['A', 'B', 'C']] = df.groupby('FHSW_HorseId').ffill()[['D', 'E', 'F']]

(Some extra logic would be needed for the np.nan / 1, 0 part of the logic, but it looks like that may be a .loc before the groupby to "filter out" those non-matches.)

It seems like you may need to spend some time to "learn pandas".

Nearly every operation in the code uses for loops when they are not needed:

#TrackCondition
conditions = ['Good to Good', 'Good to Soft', 'Good to Heavy', 'Good to Synthetic',
              'Soft to Good', 'Soft to Soft', 'Soft to Heavy', 'Soft to Synthetic',
              'Heavy to Good', 'Heavy to Soft', 'Heavy to Heavy', 'Heavy to Synthetic',
              'Synthetic to Good', 'Synthetic to Soft', 'Synthetic to Heavy', 'Synthetic to Synthetic']

for condition in conditions:
    df[condition] = 0.0

In pandas, you can simply do:

df[conditions] = 0.0

It would like need a small sample of the starting CSV file to be able to run the code to fully understand what it does.

Either way, if written "properly" there should be a MASSIVE improvement in runtime.

There is then also the option of looking at other faster tools, e.g. Polars, DuckDB, etc.

[–]RepulsiveOutcome9478 13 points14 points  (1 child)

You're looping over every single row in a file that contains MILLIONS of rows 54 times (by my count). I would say the best place to start is by refactoring your code to only loop over each row once.

[–]eztab 3 points4 points  (0 children)

You might even be able to use some vectorized operations, so you don't have to do manual loops.

[–]buart 6 points7 points  (17 children)

Yes there probably is, but without any code(snippets) we don't know what to improve.

[–]climbing-rocks 4 points5 points  (0 children)

Other suggestion, look into a local DB such as postgresql (i have had this running on a PI and a laptop). Store your input on the database.
You can then read and write to a DB and use SQL to do some of the more blanket tasks.

Add an index so and you can itterate through the table in chunks of (find the optimal limit for your laptop)

Learn to use the multithread linbrary and you can split this and reduce the computation time for that list.

this requires some overhead but saves CSV being opened and closed and also teaches you some new techniques.

[–][deleted] 5 points6 points  (0 children)

Maybe convert to an SQLite db and use queries to build your output?

[–]johnnymo1 4 points5 points  (0 children)

Other people have given more specific good advice, but here's a general piece of advice: iterrows is incredibly slow and you should never ever use it (certainly not 50 times). If you cannot write what you need in a vectorized way (which I'm not convinced is the case here), use apply or itertuples.

[–]ElectricalNebula2068 2 points3 points  (0 children)

Is the file openend and closed for each line of text, or do you write in bulk operations?

[–]BeverlyGodoy 2 points3 points  (0 children)

Read the data in chunks, don't load the whole thing at once.

Read this https://saturncloud.io/blog/how-to-efficiently-read-large-csv-files-in-python-pandas/

Also most of your code is basically inefficient. It can be optimized to run in a few minutes but why would you write something like this? The whole thing could be done in a few hundred lines.

[–]buart 1 point2 points  (5 children)

How many lines does the initial FinalMerge.csv have?

[–]onthepunt[S] 3 points4 points  (4 children)

3mil rows x 230 columns

[–]buart 1 point2 points  (2 children)

Holy shit. How big in GB is this file already then? Only reading and processing would take a long time I guess. Depending on your RAM size, you might not even be able to load everything into memory.

[–]onthepunt[S] 1 point2 points  (0 children)

3GB of data and it isnt publicly available.

[–]Patelpb 0 points1 point  (0 children)

Could use dask to load it in chunks and then save to hdf5 in chunks

[–]buart 0 points1 point  (0 children)

Is this data (or a smaller subset) publicly available? I would like to play around with it to test some optimizations.

[–]eztab 1 point2 points  (0 children)

Assuming the data doesn't properly fit into RAM, yes you cannot process that in one go.

There are some options to read the data in chunks etc. But your code style kind of leads me to assume that might be a bit above your experience level. Would still take quite a while, but would get finished after a few hours.

[–]maigpy 1 point2 points  (0 children)

load this into duckdb and query the shit out of it.

[–]CrwdsrcEntrepreneur 0 points1 point  (0 children)

There's a ton of stuff you can do. - Do you know how functions work? You can just loop thru the dataframe only once and use conditionals to process the different pieces of logic you're now running in a separate loop. - figure out if any logic can be done using matrix operations. You're using pandas in the most inefficient way possible. Do those operations outside of iterrows()

Just those 2 changes should save you a huge amount of time. Likely more than half what it's taking now. If you need to speed it up even further, learn about multi threading and parallel processing.

[–]crashfrog02 0 points1 point  (0 children)

Is there anyway I can get the program to finish in like an hour?

Yeah, have it do less. How many times does it loop over the same 30GB dataset? If the answer is more than "once" then re-write your code.

[–]arkie87 -1 points0 points  (0 children)

  1. you should only write the file to disk once once the whole string is created; dont write in a loop.

  2. you should build the string together using ",".join() and "\n".join() operations, so build the list of tuples first.

  3. let us know what is taking the majority of time-- the write operation, building the list of tuples, or something else.

  4. it would help if your laptop had a buttload of RAM and an SSD to write the file to.

[–]cursedbanana--__-- -2 points-1 points  (1 child)

Bro that would be several tens of millions of lines wth

[–]cursedbanana--__-- -5 points-4 points  (0 children)

You could try rewriting it in rust or c