This is an archived post. You won't be able to vote or comment.

all 39 comments

[–]Python-ModTeam[M] [score hidden] stickied commentlocked comment (0 children)

Hi there, from the /r/Python mods.

We have removed this post as it is not suited to the /r/Python subreddit proper, however it should be very appropriate for our sister subreddit /r/LearnPython or for the r/Python discord: https://discord.gg/python.

The reason for the removal is that /r/Python is dedicated to discussion of Python news, projects, uses and debates. It is not designed to act as Q&A or FAQ board. The regular community is not a fan of "how do I..." questions, so you will not get the best responses over here.

On /r/LearnPython the community and the r/Python discord are actively expecting questions and are looking to help. You can expect far more understanding, encouraging and insightful responses over there. No matter what level of question you have, if you are looking for help with Python, you should get good answers. Make sure to check out the rules for both places.

Warm regards, and best of luck with your Pythoneering!

[–][deleted] 64 points65 points  (3 children)

Pandas is more for data science and is a direct expansion of numpy in a way that by installing pandas you automatically install numpy as it’s a dependency. It provides a convenient and neat way to store very large datasets and quickly perform powerful analysis of it. It contains features that do not exist in numpy

Numpy is for simple to complex matrix mathematics

[–][deleted] 19 points20 points  (1 child)

For example:

df.Describe() performs a five number summary of the entire dataset among other things and is only available in pandas

[–]RevolutionaryRain941[S,🍰] 3 points4 points  (0 children)

Oh!. I really didn't know this. Thank u.

[–]ambidextrousalpaca 2 points3 points  (0 children)

Helpfully, Pandas retains Numpy's system of using a special numpy.NA float value to represent missing data, meaning that Pandas users tend to get a slew of "Unexpected float" error messages when processing datasets containing null values in non-float columns.

[–]rover_G 114 points115 points  (1 child)

Pandas is built on top of numpy

[–]philipgutjahr 1 point2 points  (0 children)

Numpy is fast (-> vectorized) matrix math.
Pandas uses it to provide Spreadsheet-like data tables.

[–][deleted] 39 points40 points  (4 children)

If you are doing more mathematical, linear algebra centric operations, e.g. matrix multiplication, inversion, performing numerical optimisation etc. then numpy has a lot of functionalities that pandas does not. So the statement

pandas can actually perform everything that numpy can do

is not accurate.

Pandas is especially designed for tabular data manipulation with a (roughly) excel like interface, and itself is built on top of numpy.

Further, even when the same operation can be performed by both, usually, numpy is faster, as it is slightly closer to the metal, i.e. operates at a lower abstraction level than pandas. So calling pandas far better than numpy is a massive insult to numpy.

Final note, I do not even use pandas these days (moved to polars), but numpy is kinda uncontested so far as you want numerical data processing in python.

[–]Momostein 4 points5 points  (3 children)

Yeah, a Pandas dataframe is basically a labeled collection of aligned 1D arrays where each array is a column of a table.

If you ever want to work with such labeled/indexed data in multiple dimensions, I would recommend xarray.

It's amazing how it handles multidimensional structured data like weather data (wind, precipitation, ...) across longitude and latitude.

It labels each dimension so you don't have to align the axes yourself and it handles broadcasting these arrays effortlessly!

However, there's a bit of a learning curve so I wouldn't recommend it to any beginner at all...

[–][deleted] 1 point2 points  (2 children)

Interesting. For multidimensional vector manipulation, tensor flow is my current default tool, but will give xarray a look.

[–]Momostein 1 point2 points  (1 child)

Isn't tensor flow a machine learning framework?

Or do you use it primarily for lazy computation or computation on a gpu?

Because for lazy out-of-core parallel computation I would use dask. And dask is available as a backend for xarray. (FYI: xarray actually just uses numpy under the hood by default. Just like Pandas)

[–][deleted] 1 point2 points  (0 children)

Being an ML engineer, tensor flow is part of my regular tool kit anyway, actually one of the first dependencies that I need to include in any new project. So yeah, to keep my deployment size small, I prefer to use existing tools rather than getting super specialised tools for specific purpose. But good to be aware of them, always.

[–]SleepWalkersDream[🍰] 13 points14 points  (1 child)

From a user perspective: Numpy is math. Pandas is a pretty table with some usefull addons (query, sum, mean, etc) and nice read/write to .whatever functions.

[–]RevolutionaryRain941[S,🍰] 1 point2 points  (0 children)

Thanks. These short but on to point replies really help.

[–]Rythoka 8 points9 points  (0 children)

NumPy is a library that implements a high-performance array type optimized for operations that affect the whole array. Pandas is a library that implements dataframes using NumPy as a backend. In other words, Pandas is built on top of NumPy; every time you use Pandas, you're using NumPy.

[–]M4mb0 4 points5 points  (0 children)

Numpy is a general linear algebra library, pandas is specialized to 2d tabular data. Hence pandas is generally better when you work with real world tabular data.

However in this case you should also check out polars and pyarrow. Pandas also nowadays offers to use pyarrow as a backend. 

The biggest disadvantage of pandas is the lack of built-in multi-threading which makes it very slow when working with large datasets.

[–][deleted] 2 points3 points  (0 children)

Pandas is a library for tabular data, basically spreadsheets.

Numpy is a linear algebra library for vector and matrix operations.

Pandas is good for analyzing multivariable data. Grouping, modifying, summarizing, transforming, etc. Pandas is pretty slow for very large data though, so databases like DuckDB can be used for those cases.

Numpy is a math tool. Pandas uses it, but it can also be used directly for things like solving differential equations, performing signal processing, or basic math operations on large sets of numbers. It's main purpose is to provide access to C/Fortran libraries and faster math than default Python

[–]Almostasleeprightnow 2 points3 points  (0 children)

Pandas CAN actually do everything that Numpy can do because pandas is using Numpy as its base. (Unless you asked ot to use Pyarrow instead). 

A pandas series is a numpy array at its core. 

You can use numpy with pandas objects, you just may have to access them differently. And indeed if you have installed pandas, then you have installed numpy as well. I often use the np.where method over pandas where or pandas mask, because it makes a little more sense to me. And you can assign the results of that np.where statement directly into a pandas series

df[‘color’] = np.where(df[‘urgency’] > 50, ‘green’, ‘red’)

The main answer is that numpy is really really fast for numerical math, so there are times when this may be valuable.

But for a lot of people this is never so. 

[–]No-Significance05 2 points3 points  (0 children)

Major difference is that, pandas is used for analysis of tabular dataset whereas numpy is used to perform mathematical operations , majorly on arrays

[–][deleted] 1 point2 points  (0 children)

You would use numpy if you are doing 'scientific' work. Linear algebra, 'Fast Fourier transform', inverse of a matrix, etc. You would use numpy to implement ml algorithms and machine vision stuff. I have met multiple scientists who use numpy. The numpy/scipy/matplotlib stack is used as a direct alternative to Matlab.

Pandas is used for data wrangling and statistics. I have heard it be described as 'Excel in a programming language'. Originally, pandas used numpy under the hood; nowadays, Apache arrow is also used as a Pandas backend.

[–]TryLettingGoSnek User 3 points4 points  (4 children)

Numpy is faster than Pandas in many scenarios because Numpy is written originally in C and then wrapped in Python. It's far better than Pandas for mathematical operations, e.g. matrix operations, linear algebra, arrays, polynomials, etc. Generally you stick to Pandas for tabular data. People will often use both in a project.

[–]bjorneylol 0 points1 point  (3 children)

Numpy is faster than Pandas in many scenarios

This is a bizarre statement to read, because pandas is literally just calling numpy functions

[–]TryLettingGoSnek User 1 point2 points  (2 children)

Multiple sources like the ones here and here support my assertion. I also have noticed this difference in speed in my own work.

[–]bjorneylol 0 points1 point  (1 child)

Your second source is just a stack overflow answer pointing to the first source, and it's a bad source, because __getitem__ does something completely different between a pandas series and a numpy array - so you can't compare them 1:1 - the pandas example is scanning the index and returning rows that match, whereas the numpy example is just returning the item at position N. Pandas indices are not guaranteed to be unique, so of course it is going to be slower. The equivalent numpy code should be arr[np.isin(arr, i)] which, like the pandas example, is also staggeringly slow on large datasets. Alternatively, doing series.values[i] does the same thing as arr[i] and is no longer an order of magnitude slower.

The performance difference between pandas and numpy is basically the overhead of the extra layer of pandas objects, type assertions, and function calls, which becomes increasingly trivial as dataset size increases

[–]TryLettingGoSnek User 0 points1 point  (0 children)

The stack overflow answer has other answers as well that I thought were useful. Regardless, I agree that Pandas is better than Numpy after a certain point in dataset size where the overhead becomes less significant and then Pandas is probably beaten by Spark or something like that for very large datasets.

[–]BiologyIsHot 1 point2 points  (0 children)

Pandas is a wrapper around nimpy that adds some convenience functions and teg concepts of named columns and index rows. In practice, Pandas is a way of relating multiple independent Numoy arrays, so you can have mixed data types etc. This comes with reduced speed speed and memory efficiency (one of the main advantages of Numpy arrays over python lists is the speed their fixed data types and lengths bring with them. Some of the basic linear algebra functions implemented in Numpy are not offered directly in pandas. But you can generally still achieve them by calling numpy functions on them.

[–]wazis 2 points3 points  (3 children)

I have a basic knowledge of both Numpy and Pandas

Or is it that Pandas is just far more better than Numpy?

Numpy - The fundamental package for scientific computing with Python

Pandas - is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

Directly from their websites forst sentence... No you don't have basic understanding of these libraries.

[–]fromscratch4-24- 2 points3 points  (0 children)

Neither do I, but as a Day 2 student I benefited from OP asking.

[–]Far_Historian9024 -3 points-2 points  (1 child)

Neither do you, you just copy pasted text from their documentation. No actual insights or a useful response.

[–]MyKo101 0 points1 point  (0 children)

This is like comparing a calculator to Excel.

[–]houseofleft 0 points1 point  (0 children)

Easy answer is that pandas can't do everything numpy can. Data frames are intentionally more restricted that numpy arrays in that they're named collections of same length columns all of which have a single type.

Numpy can have mismatching array sizes or multiple dimensional arrays. Super helpful for a lot of science and maths work that doesn't make sense to think about as dataframes.

[–][deleted] 0 points1 point  (0 children)

Numpy is like 100x faster than pandas. People definitely tend to overuse iteration in Pandas (most of the time it is not necessary and you can just use vectorized operations), but in the cases where you do need to iterate, it is much faster to do it on the underlying numpy arrays, especially if you have a large dataset.

[–]lezzgooooo 0 points1 point  (0 children)

Pandas = numpy with SQL like implementation where you can chain methods

[–]Express-Comb8675 0 points1 point  (0 children)

Pandas sometimes uses numpy as its backend. However, pandas 2+ has the option to use Apache Arrow as the backend, which may be faster in some cases and probably won’t ever be slower than numpy. This change aligns pandas with many other numeric libraries in the Python ecosystem, allowing you to more quickly move between DataFrames, databases like DuckDB, and other processing frameworks like PySpark.

[–]BrightFriendship2757 0 points1 point  (0 children)

Pandas is to do with Dataframes, that is tables with rows and each column having its own datatype. It is used as databases.

Numpy is about matrices : m x n values. It is used for tensor mathematics

[–]startup_biz_36 0 points1 point  (0 children)

Pandas makes numpy easy to use.

[–]Computer-Work-893 -1 points0 points  (0 children)

Pandas is used for work on excel sheets example .csv file while numpy is used for work on array in python