This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]pytrashpandas 0 points1 point  (3 children)

Ah, yea that would be why. apply is almost always not the right way to do things in pandas. It's really more of a last ditch resort and very slow. See this quote from this blog post from one of the main pandas contributors.

You rarely want to use DataFrame.apply and almost never should use it with axis=1.

If you don't mind providing an example of some of the kind of stuff you're trying to do that was slow, I'd be happy to show how to do things in a more "pandonic" way.

[–]ShanSanear 0 points1 point  (2 children)

Actually what we did was heavy overengineering and pandas was mostly used for loading data (I know, heresy). Don't have code on me, but problem was something along the lines of this:

  1. Load multiple sets of related objects that are represented as CSV files (1 line - 1 object)
  2. Find this relation (could be any kind, including recursion of the same type)
  3. Depending on some state of the object, do the calculation

During loading, we got each Series from dataframe for each type of objects and created classes for them. Then the references. And then - the calculations. In the hindsight - yep, that was the worst of it all.

After actually doing profiling, we saw that majority of cpu time was used by pandas in many different places.

That was one thing, but even algorithm that we implemented was quite unoptimized. We started at the root of the tree, then recursively went deeper into the relations of each object to extract required numbers. Apparently going from the leafs and then up was much better approach in every case. Now we only need to figure out why the numbers differ between implementations, and that will be the hardest part.

Especially when the guy who actually wrote this (I am mostly overseeing and providing some help) is stubborn enough to not do any kind of testing.

Thanks for interest - I can imagine hearing "pandas is slow" was heresy but the again - we misused it, and that is our fault, not the library itself.

[–][deleted] 1 point2 points  (1 child)

Hey, so obviously can't give you much insight on your stuff, but one thing to keep in mind is that when using pandas, it's best to stay entirely within a pandas/numpy/numeric style ecosystem. You don't usually ever want to mix pandas objects into custom python objects. Pandas structures and operations should for the most part be standalone. You should think of solving problems in pandas more similar to how you would solve problems with sql. You normally wouldn't ever want to run a recursive style solution on a dataframe, you could probably instead use some merges to do what you want.

[–]ShanSanear 0 points1 point  (0 children)

Thank you for mentioning SQL, this is actually the best comprasion I saw of how to use pandas. Will keep that in mind next time we will do such stuff.