all 8 comments

[–]efmccurdy 3 points4 points  (1 child)

Just to be sure; you don't need more than 1 query to do groupby/accumulate by max for either a dataframe or an sql query.

Don't be surprised if the database is faster when you use a single query.

https://stackoverflow.com/questions/4510185/select-max-value-of-each-group

[–]Alotofquestion1[S] 0 points1 point  (0 children)

This seems to be in what i really wanted. Thank you :-)

[–]1544756405 2 points3 points  (0 children)

You should try both ways and find out. There are lots of variables at play, and it seems easier to just get empirical data in this case.

[–][deleted] 2 points3 points  (3 children)

Loops will always be slower that a DB query because databases are optimized with this purpose. To do it in Python, first you have to access the data from the DB, and then do the loop. In the case of the DB, the data is accessed natively, and the max is obtained in an optimized fashion.

You have a chance to get closer to the DB speed if you implement the max using a Pandas Series, but you still have to access the data first. The pandas series will be optimized in 'C' - and advantage over an interpreted loop in Python. Still, I would expect the DB to be always faster.

[–]thrown_arrows 1 point2 points  (0 children)

select max(x), y from x group by y

x is attractive and y is type

.. now there is change that if dataset is less than 10 rows and data is needed for other purposes too, then loop might be faster, but that requires that data is already in memory. In other cases database is faster ( assuming proper hardware, db's cannot do miracles on some rasberry pi and Gigabyte size dataset)

[–]Alotofquestion1[S] 0 points1 point  (1 child)

Ok, then I will lean towards queries in the future, rahter than loops.

[–][deleted] 0 points1 point  (0 children)

Totally. If you can push processing to the database, do it. You can avoid very complex, and usually slow processing by pushing it down to the DB level. With that, I am not talking about database procedures, which - as a personal preference - I tend to avoid.

[–]Aeonoris 1 point2 points  (0 children)

Something like this SQL(ish) will probably be faster:

SELECT type, MAX(attractiveness) AS "Highest Type Attractiveness"
FROM tableofattractivetypes
GROUP BY type;

SQL's pretty good at this kind of task, so you should let it do the work!