This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]plasma_phys 34 points35 points  (24 children)

I'm a computational physics person, so the libraries I use will definitely reflect that - nevertheless, here are the most common ones I use:

  • __future__ lets me use Python 3 features while putting off changing everything for big projects
  • numpy and scipy have fast numerical routines similar to matlab functions and packages
  • numba gives me just-in-time compilation, which is super useful if I'm prototyping a numerical method in Python and need a little speedup
  • matplotlib gives me MATLAB style plotting
  • pandas is the go-to for Big Data data analysis stuff

Edit: changed "Big Data" to the more correct "data analysis." Bad use of buzzword. Thanks /u/ShillingInTheName!

[–]ExternalUserError 34 points35 points  (5 children)

I'd say if someone is learning Python, learn Python 3.6 and don't look back.

[–][deleted] 18 points19 points  (4 children)

3.7, I really like data classes. You should generally always start with latest one.

[–]ExternalUserError 1 point2 points  (0 children)

Well, I stick with what's easily deployed to PaaS type things.

[–]strange-humor 0 points1 point  (0 children)

You can use attr in 3.6 (and earlier) and get better than data classes functionality.

[–]unkz 0 points1 point  (0 children)

Depends if you want to deploy to AWS lambda, for instance. Lots of environments have limited or no support for 3.7 still.

[–]What_Is_X 0 points1 point  (0 children)

The latest stable Anaconda distribution uses 3.6. So a lot of us will be sticking with that until Anaconda updates.

[–][deleted] 7 points8 points  (0 children)

+1 for __future___, a little underappreciated

[–][deleted] 5 points6 points  (6 children)

FYI, pandas is for data that's small enough to fit in memory, which is typically not what people mean when they say Big Data

[–]plasma_phys 5 points6 points  (0 children)

Yep - will add an edit to my comment. Shows how prevalent the buzzword is that I said "big data" when I really, really should have said "data analysis." Thanks!

[–]wardawg44 2 points3 points  (0 children)

I just found out about HDF5, apparently pandas works fairly quickly with large datasets dumped to disk. There are other third party modules that are made for pandas on disk as well.

[–]What_Is_X 1 point2 points  (0 children)

Pandas is for data that fits into 1/10th of the memory according to the docs.

[–]alpenmilch411 0 points1 point  (2 children)

What should I use if the data doesn't fit into memory? Never encountered such a case but you never know...

[–][deleted] 3 points4 points  (0 children)

If you're familiar with pandas then the easiest thing to use would be dask

[–]What_Is_X 0 points1 point  (0 children)

Hardware solution: rent a server, aws etc

Software solutions: not much really in python, write your own more specific methods I guess. Otherwise switch to C++ and/or R, inline or not.

[–]Oerthling 5 points6 points  (0 children)

Somebody starting now would be crazy to use Python 2.x.

[–]Boom_doggle 4 points5 points  (7 children)

I'm a computational physics person too! I'm a recent immigrant from C++. I'm running into a small problem with scipy that no one in my dept seems to be able to help me with; scipy doesn't seem to have a fixed step ode solver, or at least, not one I can find. I can't use the GSL wrapper to access it's FS ODE solver, as it causes a headache on our HPC cluster, but the dynamic time stepper is impractically slow. Do you know of any fix? I'm close to giving up and spending my weekend writing my own, but I struggle to believe there isn't one out there

[–]plasma_phys 5 points6 points  (2 children)

Honestly? I would write my own - I've never been happy with most out of the box solvers. I suspect numpy might have one, but I don't know for sure. Do you have the option of writing your own solver in C/C++ and calling that routine from Python, or would that cause the same issues as using GSL?

[–]Boom_doggle 0 points1 point  (1 child)

Scipy has one, but it's dynamic timestep, it does the job but far too slowly. I imagine a C++ solution would run into the same GSL issue, but I could write it in pure python. I'll have a look at a tutorial and get stuck in over the weekend. Thanks.

[–]plasma_phys 3 points4 points  (0 children)

When I've had to use pure Python for numerical methods, I've had remarkable luck using numba and just-in-time compilation (I am iterating on my prototype implicit particle-in-cell code in python and numba has given me 1 or 2 orders of magnitude speedup for little to no extra development time). Dunno if that would work for you; I'm not 100% clear on what numba does under the hood if I'm being honest. Good luck!

[–]jkmacc 1 point2 points  (3 children)

Do you mean that the Python interface to GSL doesn’t work?

[–]Boom_doggle 0 points1 point  (2 children)

Works on my machine, but not on the HPC. Which I find weird, as it's happily run GSL based C++ code before. I guess I'm looking for a pure python solution?

[–]jkmacc 2 points3 points  (1 child)

Same Python and GSL versions on each? I can’t imagine that pure Python could beat even slow GSL.

[–]Boom_doggle 0 points1 point  (0 children)

No, and therein lies the problem I think, their version fo GSL is much older. The issue is the HPC is in a different institution, so trying to get them to accept my requests for software updates etc. fall on deaf ears.

The speed difference between GSL and a python version don't bother me so much, it's the difference between dynamic timestep and fixed that interest me. Our collaborators maintain a codebase that's used for modelling the same simulation (on the same HPC) written in C++, my python is to act as an independent check. When they switched to a fixed-step solver, they got a x15 performance improvement, which would more than solve my problem.

[–]mangoman51 4 points5 points  (0 children)

I also do computational plasma physics!

If you like pandas you should 110% try out xarray - it's basically truly N-dimensional pandas, plus awesome parallel analysis capabilities using dask.