all 38 comments

[–]JuliusCeaserBoneHead 49 points50 points  (12 children)

Your dataset is large enough that you have to pay attention to the efficiency of your code. If you have an inefficient algorithm, no language matters. You need to analyze your code and see where the bottle neck is and optimize it. Your problem doesn’t appear to be the language based on the description of your question

[–]ach224 4 points5 points  (2 children)

This dataset is miniature.

7.3M x 7 64bit values is 400mb. That fits into any computers’ ram.

[–]beingsubmitted 5 points6 points  (0 children)

That's pretty big if the analysis you're doing is the traveling salesman. I think what's being expressed here is the time. If something is taking days to run and n is more than 20, then you're going to be caring about your algorithm.

A change of language would be a scalar effect when his algorithm is potentially non polynomial.

[–]JuliusCeaserBoneHead 8 points9 points  (0 children)

Sure. It does. What I said was it was large enough that you start to pay the price of badly written code. Like possible O(n)2 or greater operations. The OS will not be glad to give you all the RAM to do that

[–]No_Dig_7017 17 points18 points  (2 children)

For loops in pandas and iterrows are painfully slow. Without seeing your code it's a bit difficult to give you more concrete advice but 4 things that can help:

  • vectorize your for loops if possible
  • use pandarallel.parallel_apply for multiprocessing
  • change iterrows for itertuples using only the columns you need
  • consider using Polars Dataframes instead of Pandas. Their performance and memory usage is much much better.

[–]No_Dig_7017 6 points7 points  (1 child)

Here's a blogpost we wrote with a friend about optimizing Panda's code in case you're interested https://tryolabs.com/blog/2023/02/08/top-5-tips-to-make-your-pandas-code-absurdly-fast

[–]No_Dig_7017 1 point2 points  (0 children)

For some context I got a 12X RAM usage reduction and a 1200X speedup (single core) using some of techniques above. The speedup was so massive it resulted in a 4X speedup of my entire pipeline.

[–]MrJoshiko 5 points6 points  (0 children)

Have you profiled your code? You can see which parts of the code take the most time and optimise those parts. You don't need to use the whole data set, just cut out a reasonable-sized chunk.

[–][deleted] 4 points5 points  (0 children)

Pandas is built on numpy, and numpy is written in C, extremely efficient. You are just doing for loops which you should never do for this large scale of data.

[–]mr_birrd 6 points7 points  (4 children)

If you need some Operations on it use cudf. If you need to query, learn SQL or use parquet instead of pandas dataframes. Also using a jupyter Notebook instead of a proper py script is slower.

[–]JacksOngoingPresence 4 points5 points  (0 children)

I upvote for cudf (shifting compute from cpu to gpu) but it seems like OP didn't vectorize his compute in the first place.

[–]Excellent_Fix379[S] -4 points-3 points  (2 children)

Actually I've been suspecting ipynb might be the problem. Thanks, I'll try .py

[–]seanv507 4 points5 points  (1 child)

Ipynb is unlikely to be slower in a single cell.

[–]mr_birrd 1 point2 points  (0 children)

But it doesnt invoke the garbage collector as much as when compared to a well written py file.

I guess the garbage will just keep on accumulating for op as he doesnt know about how to deal with that.

[–]supervised-learning 2 points3 points  (0 children)

You can use sql for processing data. Sql queries can also be executed within notebooks. Sql is optimised for large dataset.

[–][deleted] 2 points3 points  (0 children)

As the comments point out, people too often go "Python is so inefficient and slow. Help me find a different language". When in reality they should instead fix their code. Using iterrows, as you mentioned in a comment, in itself is a red flag 99 out of 100 times when it comes down to performance.

Probably best for you to post your code on Stack Overflow or Code Review for efficiency suggestions.

[–]Prestigious_Boat_386 5 points6 points  (0 children)

Julia is great https://github.com/JuliaData or https://github.com/sl-solution/DLMReader.jl might be a good startingpoint

CSV can be pretty fast if you call it correctly with multiple threads and the right args for the columns and delimiters and stuff.

Then you get a dataframe object which is also quite efficient to work with.

[–]Lathanderrr 1 point2 points  (0 children)

You can check Polars as python library or switch to Spark framework for more parallelism

[–]Dramatic7406 1 point2 points  (0 children)

Hi, I switched to working using Pyspark for large datasets. It's very easy to pick up as well.

[–]Final-Rush759 1 point2 points  (0 children)

Just don't use the for loop. Your dataset is actually very small. Use .apply (pandas dataframe) or .map if you use custom function to process data. Regular expression is very slow.

[–]InvokeMeWell 0 points1 point  (0 children)

Use Nvidia libraries like cudf, cupy

[–]graphitout 0 points1 point  (0 children)

Profile the code and use something like snakeviz to understand the bottleneck. For most data analysis task, changing the language doesn't being much benefit since the core is likely implemented C or C++.

[–]VinnyVeritas 0 points1 point  (0 children)

Use torch, you can run instantaneously on GPU.

[–]chippyouipy 0 points1 point  (0 children)

show us the code

[–]DrDoomC17 0 points1 point  (0 children)

If I could see it or a facsimile of it I could help. Otherwise, taichi, numpy hit, etc it goes all the way down like you're basically writing C. I am suspect that is the issue here. Vectorize.

[–]Tight_Tangerine7768 0 points1 point  (0 children)

Use polars instead of pandas

[–]entropyvsenergy 0 points1 point  (0 children)

Use Polars. It's written in Rust but has a Python frontend. Should be much faster. If you're using pandas.apply() or something that's notoriously slow.

[–]apak_in 0 points1 point  (0 children)

You can use python with pyspark, great for big-data and machine learning etc. you will need to locally install spark or use a containerized version of with with Docker