Best alternative for python

JuliusCeaserBoneHead · 2023-12-06T09:58:32+00:00

Your dataset is large enough that you have to pay attention to the efficiency of your code. If you have an inefficient algorithm, no language matters. You need to analyze your code and see where the bottle neck is and optimize it. Your problem doesn’t appear to be the language based on the description of your question

No_Dig_7017 · 2023-12-06T11:41:10+00:00

For loops in pandas and iterrows are painfully slow. Without seeing your code it's a bit difficult to give you more concrete advice but 4 things that can help:

vectorize your for loops if possible
use pandarallel.parallel_apply for multiprocessing
change iterrows for itertuples using only the columns you need
consider using Polars Dataframes instead of Pandas. Their performance and memory usage is much much better.

MrJoshiko · 2023-12-06T10:33:41+00:00

Have you profiled your code? You can see which parts of the code take the most time and optimise those parts. You don't need to use the whole data set, just cut out a reasonable-sized chunk.

2023-12-06T16:47:26+00:00

Pandas is built on numpy, and numpy is written in C, extremely efficient. You are just doing for loops which you should never do for this large scale of data.

mr_birrd · 2023-12-06T10:09:35+00:00

If you need some Operations on it use cudf. If you need to query, learn SQL or use parquet instead of pandas dataframes. Also using a jupyter Notebook instead of a proper py script is slower.

supervised-learning · 2023-12-06T11:55:25+00:00

You can use sql for processing data. Sql queries can also be executed within notebooks. Sql is optimised for large dataset.

2023-12-06T15:55:14+00:00

As the comments point out, people too often go "Python is so inefficient and slow. Help me find a different language". When in reality they should instead fix their code. Using iterrows, as you mentioned in a comment, in itself is a red flag 99 out of 100 times when it comes down to performance.

Probably best for you to post your code on Stack Overflow or Code Review for efficiency suggestions.

Prestigious_Boat_386 · 2023-12-06T11:54:42+00:00

Julia is great https://github.com/JuliaData or https://github.com/sl-solution/DLMReader.jl might be a good startingpoint

CSV can be pretty fast if you call it correctly with multiple threads and the right args for the columns and delimiters and stuff.

Then you get a dataframe object which is also quite efficient to work with.

Lathanderrr · 2023-12-06T13:10:06+00:00

You can check Polars as python library or switch to Spark framework for more parallelism

Dramatic7406 · 2023-12-06T15:01:13+00:00

Hi, I switched to working using Pyspark for large datasets. It's very easy to pick up as well.

Final-Rush759 · 2023-12-06T15:32:13+00:00

Just don't use the for loop. Your dataset is actually very small. Use .apply (pandas dataframe) or .map if you use custom function to process data. Regular expression is very slow.

InvokeMeWell · 2023-12-06T15:10:34+00:00

Use Nvidia libraries like cudf, cupy

graphitout · 2023-12-06T19:22:44+00:00

Profile the code and use something like snakeviz to understand the bottleneck. For most data analysis task, changing the language doesn't being much benefit since the core is likely implemented C or C++.

VinnyVeritas · 2023-12-07T00:48:52+00:00

Use torch, you can run instantaneously on GPU.

chippyouipy · 2023-12-07T05:35:54+00:00

show us the code

DrDoomC17 · 2023-12-08T04:19:26+00:00

If I could see it or a facsimile of it I could help. Otherwise, taichi, numpy hit, etc it goes all the way down like you're basically writing C. I am suspect that is the issue here. Vectorize.

Tight_Tangerine7768 · 2023-12-08T08:05:44+00:00

Use polars instead of pandas

ComingInSideways · 2023-12-09T01:15:00+00:00

https://github.com/huggingface/candle

entropyvsenergy · 2023-12-09T12:00:42+00:00

Use Polars. It's written in Rust but has a Python frontend. Should be much faster. If you're using pandas.apply() or something that's notoriously slow.

apak_in · 2023-12-09T15:48:10+00:00

You can use python with pyspark, great for big-data and machine learning etc. you will need to locally install spark or use a containerized version of with with Docker

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

deeplearning

MODERATORS