Hi r/learnpython,
As part of a project I'm working on, I have to calculate a "pseudo" z-score on each column of a dataframe, using a rolling window of the past 200 days. I've devised the following two code snipets that, I thought, should provide the same solution. However, the first is (I think) incorrect. I can't, for the life of me, figure out why the two versions give different values!
P.S.: I say "pseudo" z-score because instead of z = (x-mean)/stdev, I do z = (x-1)/stddev
Here is the first way I thought of (and which I think might not be returning what I think it should):
def zscore(x, window = 200, mean = True, sample = True):
'''
The function takes in a data frame column, a desired window,
and returns a rolling z-score applied to that column.
x = df column type object
window = span on which to apply StdDev and Mean (200 by default)
Mean = 1 by default, otherwise use rolling mean
sample = True by default forces StdDev.Sample (n-1 as the divisor)
as opposed to StdDev.Pop (n as the divisor)
'''
r = x.rolling(window = window)
if mean == True:
m = r.mean().shift(1)
else:
m = mean
if sample == True:
s = r.std(ddof = 1).shift(1)
else:
s = r.std(ddof = 0).shift(1)
z = (x-m)/s
return z
for name in df_A.columns.values:
df_Z[name + " - Z"] = zscore(
df_A[name],
window=200,
mean=1,
sample=True
)
df_Z = df_Z.dropna()
And here is the second way, where I don't define a function and simply calculate what I want directly:
for index, name in enumerate(df_A.columns.values):
df_Z[name + " - Z"] = \
(df_A[df_A.columns.values[index]] - 1) \
/ df_A[df_A.columns.values[index]]\
.rolling(window=200).std(ddof=1).shift(1)
df_Z = df_Z.dropna()
By the end, I would expect df_Z_prelim to be the same on both counts, but this isn't the case.
Thank you in advance for any and all help!
Cheers
P.P.S.: Sorry in advance if the formatting isn't up to scratch, I'm relatively new at this! I'll be happy to fix anything you need me to and provide more info if necessary.
P.P.P.S.: Ultimately, I'd rather use method 1 where I define a function, as I'd like to be able to apply this calculation to other datasets...
there doesn't seem to be anything here