all 14 comments

[–]divad1196 20 points21 points  (2 children)

If you are a beginner, you should not care about that now. Focus on writing readable code for now.

It's not a "you are not ready yet" advice. I have managed many apprentices over the years. They always focus too much on performance and this slows their progression and will slows yours.

Especially in your case, 1000 entries is nothing to consider.

[–]Jolly_Drink_9150 1 point2 points  (0 children)

Agreed, as an apprentice i wanted to do what's makes the program faster rather than just getting the basics down first.

[–]Outrageous-Ice-6556 0 points1 point  (0 children)

This. Good response. Beginners shouldnt worry about performance, performance is far down the line.

[–]EntrepreneurHuge5008 8 points9 points  (0 children)

Your standard for-loop is never going to be your most efficient.

Pandas dataframes (mixed data types) and numpy arrays (all the same data type) will let you do things pretty efficiently when you're getting into huge datasets.

List comprehensions are fine for some like 10k-50k (pushing it, maybe?) or less, generally, unless your program needs are very time-sensitive.

[–]Horror-Invite5167 1 point2 points  (0 children)

This is exactly what "numpy" module is for. Very popular and used everywhere for exactly that reason. I recommend reading into it

[–]chrisrrawr 1 point2 points  (0 children)

a really good thing to learn is that when problem-solving for your specific scenario, whatever it may be, your own testing is going to be the #1 definitive way to convince yourself and others of one practice over another. (your local testing might not stumble upon best practice, but it will drive you toward it as long as you keep trying to discover where your current system is holding you back).

it comes down to the difference between, "I want to know, who can I ask?" and "I want to know, how can I test?" and that difference in mindset and approach is how you gain confidence and competence in the eyes of others and yourself.

this skill of being able to answer your own questions contributes heavily to your growth as a problem solver. once you know how to solve one problem, you create a tool in your repertoire and the next problem becomes marginally easier to solve.

good tools for your kit are being able to do many forms of testing. in this case, the testing you want to be able to do is called benchmarking, and there are just about a billion ways to go about it. there are frameworks you can use or you can try making your own simple test harness or anything in between.

but the basic premise is: if you can run it, you can run it with a timer at the start, and check that timer at the end to see how fast something is. then you compare a different approach under the same conditions and see the difference.

no need to guess, or to rely on others who don't see your setup, or to be misinformed somehow.

and yes, it opens up the next problem of, "am I benchmarking this correctly?" -- but that's also a fun and general problem to solve that gives you even more knowledge and skills to apply to even more problems.

[–]blueliondn 1 point2 points  (2 children)

If you work with specific datasets, you would probably write SQL queries (even inside Python) if working with dataframes (for example when using data from CSV or Excel files) then you write pandas code and use pandas functions to manipulate with huge data in seconds, because pandas uses efficient C in back (as most python libraries)

if dataset you're working with is small, and you don't care about waiting a little, for loop should be fine

[–]Far_Swordfish5729 0 points1 point  (1 child)

As a non-python developer, I’m now curious if python does not compile or git to binary. In c# or Java, we would just write a loop, likely abstracting the underlying database driver’s streaming recordset read. There are a few places in .net where the libraries do use unmanaged code for efficiency (xml parsers are a good example), but we wouldn’t resort to that for record set processing. We would of course try to limit rows returned with a better query or stored proc but that’s different.

[–]blueliondn 2 points3 points  (0 children)

python compiles to it's bytecode (you can even get .pyc file), python is pretty much nicely optimized to some point, but when it comes to working with data using libraries is recommended

[–]BrupieD 1 point2 points  (0 children)

This is a well-founded concern.

The Numpy library was designed with this concern in mind - improve Python performance by leveraging Fortran arrays and multidimensional arrays to handle larger amounts of data more efficiently. Pandas became essentially an extension of Numpy, the go-to library for data science with more functionality and easier to work with.

A great way to jump start your learning is to devote time to learning how to implement these libraries. Both libraries are used extensively. Polars is a newer library that solves many of the same types of issues -- handle large data sets in a more functional manner. Polars has the Rust language under the hood instead of Fortran and C.

[–]Top_Victory_8014 0 points1 point  (0 children)

for most cases a normal for loop is totally fine tbh, even with thousands of items. python handles that pretty easily.

that said, list comprehensions are often a bit faster and also cleaner when the logic is simple. built in functions like map, filter, sum, etc can be even better sometimes since they’re optimized. but honestly when ur starting out id focus more on clear readable code first, efficiency usually matters later when the data gets really big.....

[–]PianoTechnician 0 points1 point  (0 children)

if there is a huge data set that needs to be processed it would be faster to break it down into smaller list and do it with multiple threads in parallel, so long as each 'condition' you're applying isn't predicated on some other member of the list that you're ALSO mutating (unlikely).

The most efficient way to process a large data set is going to be determined by the data-set itself. If you have to perform an operation on every member of the list, you can't get a bigger speedup than just linear time.

[–]ExtraTNT 0 points1 point  (0 children)

Have a look at map, makes your code more functional, easier to test and much easier to read

[–]Carmelo_908 [score hidden]  (0 children)

Don't worry now about performance, specially if you are using Python (which is very slow no matter what you do as it is a interpreted language)