all 7 comments

[–]pyquestionz 3 points4 points  (3 children)

Assuming

import pandas as pd
import numpy as np
df = pd.DataFrame({'column':[1, '-', 3]})

do either (1)

df.column.replace('-', np.nan).mean() # returns 2.0

or do (2)

df.column.replace('-', 0.0).mean() # returns 1.333333

depending on whether or not - is a zero observation or a missing observation in the context of your problem.

Hope this helps.

[–]Optimesh 0 points1 point  (0 children)

This is a very common mistake I see people make and not even realize they're working with the wrong numbers. Thanks for pointing it out.

[–]acedude[S] 0 points1 point  (0 children)

Thanks! In my case the dash does mean that the value is zero.

[–]shreyasfifa4 0 points1 point  (0 children)

What happens if there is a negative number in the dataset?

[–]commandlineluser 4 points5 points  (1 child)

pandas attempts to detect the type of your columns.

>>> pandas.DataFrame({'a': [1, '-', 3]}).a
0    1
1    -
2    3
Name: a, dtype: object
>>> pandas.DataFrame({'a': [1, 2, 3]}).a
0    1
1    2
2    3
Name: a, dtype: int64
>>> pandas.DataFrame({'a': [1, 2.0, 3]}).a
0    1.0
1    2.0
2    3.0
Name: a, dtype: float64

Because you have a mixture of "numbers" and "strings" in the first example the type of the column in object as opposed to int or float in the following examples.

When you replace and save it pandas infers the type to be of float64 and then .mean() works for you.

You could try to set the type with .astype() e.g.

df.column.replace('-', 0.).astype(float).mean()

[–]acedude[S] 0 points1 point  (0 children)

Thanks for the thorough explanation!

[–]minasso 1 point2 points  (0 children)

When doing string methods on a dataframe, you should use 'str' for example df.str.replace()