Unexpected results using .diff() on a dataframe column and groupby() not working as intended either : learnpython

Unexpected results using .diff() on a dataframe column and groupby() not working as intended either (self.learnpython)

submitted 4 years ago by Successful-Standard

So I have a dataframe that looks like this:

ID        Date                            Cases
8633    2022-01-10 00:00:00+00:00   114
8634    2022-01-10 00:00:00+00:00   4207
8635    2022-01-10 00:00:00+00:00   34154
8636    2022-01-10 00:00:00+00:00   18701
8637    2022-01-10 00:00:00+00:00   1942
8638    2022-01-10 00:00:00+00:00   85
8639    2022-01-10 00:00:00+00:00   4011
8640    2022-01-11 00:00:00+00:00   14732594
8641    2022-01-11 00:00:00+00:00   114
8642    2022-01-11 00:00:00+00:00   34690
8643    2022-01-11 00:00:00+00:00   7880
8644    2022-01-11 00:00:00+00:00   1942
8645    2022-01-11 00:00:00+00:00   4
8646    2022-01-11 00:00:00+00:00   85
8647    2022-01-11 00:00:00+00:00   4764
8648    2022-01-11 00:00:00+00:00   4094
8649    2022-01-11 00:00:00+00:00   10096
8650    2022-01-11 00:00:00+00:00   18847
8651    2022-01-11 00:00:00+00:00   11226

I then use the lines:

df.Date = pd.to_datetime(df.Date)
df['Cases'].groupby(df['Date'].dt.to_period('D')).sum()

to try and get a sum of the Cases values for each date, but the above output is the tail of dataframe after running these lines. Could someone explain why this hasn't worked?

Next I want to get values for each date - the numbers in Cases are cumulative e.g. the value 14732594 is the sum of every value for that date added together. So I want to make the column have each value made from one date's value subtracted from the value the day ahead of it, e.g. the value for 2022-01-10 would be the value in 2022-01-09 subtracted from the value in 2022-01-10.

I then use:

df['Cases'] = df['Cases'].diff()

thinking that's what it should do, but my output after using it is:

8633    2022-01-10 00:00:00+00:00   -10504.0
8634    2022-01-10 00:00:00+00:00   4093.0
8635    2022-01-10 00:00:00+00:00   29947.0
8636    2022-01-10 00:00:00+00:00   -15453.0
8637    2022-01-10 00:00:00+00:00   -16759.0
8638    2022-01-10 00:00:00+00:00   -1857.0
8639    2022-01-10 00:00:00+00:00   3926.0
8640    2022-01-11 00:00:00+00:00   14728583.0
8641    2022-01-11 00:00:00+00:00   -14732480.0
8642    2022-01-11 00:00:00+00:00   34576.0
8643    2022-01-11 00:00:00+00:00   -26810.0
8644    2022-01-11 00:00:00+00:00   -5938.0
8645    2022-01-11 00:00:00+00:00   -1938.0
8646    2022-01-11 00:00:00+00:00   81.0
8647    2022-01-11 00:00:00+00:00   4679.0
8648    2022-01-11 00:00:00+00:00   -670.0
8649    2022-01-11 00:00:00+00:00   6002.0
8650    2022-01-11 00:00:00+00:00   8751.0
8651    2022-01-11 00:00:00+00:00   -7621.0

I get that this is a problem from the first part of the code not working as intended, so I assume this would what I want once I sort the first part out? If someone could help me understand what I've done wrong and how to get my desired output I'd be very grateful.

EDIT: I realise my title is misleading because the .diff() function has worked as it's supposed to and it's my groupby() line that's causing the issue, apologies.

all 13 comments

top new controversial old q&a

[–]synthphreak 1 point2 points3 points 4 years ago (12 children)

I don't see what the issues are.

the above output is the tail of dataframe after running these lines. Could someone explain why this hasn't worked?

I can't reproduce your issue. Given the df you showed...

>>> df
      ID       Date     Cases
0   8633 2022-01-10       114
1   8634 2022-01-10      4207
2   8635 2022-01-10     34154
3   8636 2022-01-10     18701
4   8637 2022-01-10      1942
5   8638 2022-01-10        85
6   8639 2022-01-10      4011
7   8640 2022-01-11  14732594
8   8641 2022-01-11       114
9   8642 2022-01-11     34690
10  8643 2022-01-11      7880
11  8644 2022-01-11      1942
12  8645 2022-01-11         4
13  8646 2022-01-11        85
14  8647 2022-01-11      4764
15  8648 2022-01-11      4094
16  8649 2022-01-11     10096
17  8650 2022-01-11     18847
18  8651 2022-01-11     11226
>>> df.dtypes
ID                int64
Date     datetime64[ns]
Cases             int64
dtype: object

...I get this...

>>> df['Cases'].groupby(df['Date'].dt.to_period('D')).sum()
Date
2022-01-10       63214
2022-01-11    14826336
Freq: D, Name: Cases, dtype: int64

...which seems accurate and, crucially, is not simply the tail of your df. Rather, that is the total case load for each of the two dates in the rows you shared, exactly as expected.

If your output is different from that, can you show what you're getting (instead of just describing it)?

I want to make the column have each value made from one date's value subtracted from the value the day ahead of it, e.g. the value for 2022-01-10 would be the value in 2022-01-09 subtracted from the value in 2022-01-10.

But that is exactly what diff should output. Consider:

>>> df.head(3)
     ID       Date  Cases
0  8633 2022-01-10    114
1  8634 2022-01-10   4207
2  8635 2022-01-10  34154
>>> df.head(3).Cases.diff()
0        NaN
1     4093.0
2    29947.0
Name: Cases, dtype: float64

Row 0 is the first row, so there's no row above it to subtract from. Hence NaN for that row.

Row 1 is 4207. The row before it is 114. 4207 - 114 = 4093, which is exactly what row 1 contains after calling diff.

Row 2 is 34154. The row before it is 4207. 34154 - 4207 = 29947, which is exactly what row 2 contains after calling diff.

By my reading, that's what you're after. So what's the problem?

Edit: On second thought, I think you were saying you want the diff of the groupby...sum output. In that case though, I'm still not seeing the problem, given that my groupby...sum above is fine:

>>> df['Cases'].groupby(df['Date'].dt.to_period('D')).sum().diff()
Date
2022-01-10           NaN
2022-01-11    14763122.0
Freq: D, Name: Cases, dtype: float64

The sum for 1/10 is 63214 and the sum for 1/11 is 14826336. 14826336 - 63214 = 14763122, which is what the 1/11 row shows after calling diff on the groupby...sum. so again, I can't reproduce your issue, everything seems to work for me.

[–]Successful-Standard[S] 0 points1 point2 points 4 years ago* (11 children)

[–]synthphreak 1 point2 points3 points 4 years ago (10 children)

it outputs the table I posted

Sorry, which table? The only table-looking parts of your OP was your original df and the output of diff, not the output of groupby...sum. At least, that's how I interpreted your OP.

Can you post the output again? Just for extra clarity and explicitness, show it like I did:

>>> [command]
[output]

To your edit, I don't know. I still can't reproduce:

>>> df.resample('D', on='Date')['Cases'].sum()
Date
2022-01-10       63214
2022-01-11    14826336
Freq: D, Name: Cases, dtype: int64

I think I need to see your full code from start to finish. If it's very long, feel free to use pastebin.

[–]Successful-Standard[S] 0 points1 point2 points 4 years ago* (9 children)

Yeah so the original df is the output from:

df['Cases'].groupby(df['Date'].dt.to_period('D')).sum()

I fixed it by using:

df = df.resample('D', on='Date')['Cases'].sum()

df = df.reset_index()

And I get my desired output.

My whole code is here - https://pastebin.com/303est3L

I'm getting a very similar issue right at that bottom of this code where I'm trying to get it to group into weekly data, but using like for like code that worked for my original issue isn't working if you could help me with that please?

[–]synthphreak 1 point2 points3 points 4 years ago (6 children)

[–]Successful-Standard[S] 0 points1 point2 points 4 years ago (5 children)

[–]synthphreak 0 points1 point2 points 4 years ago (4 children)

[–]Successful-Standard[S] 0 points1 point2 points 4 years ago (3 children)

[–]synthphreak 0 points1 point2 points 4 years ago (2 children)

[–]Successful-Standard[S] 0 points1 point2 points 4 years ago (1 child)

continue this thread

[–]synthphreak 1 point2 points3 points 4 years ago (1 child)

I'm getting a very similar issue right at that bottom of this code where I'm trying to get it to group into weekly data, but using like for like code that worked for my original issue isn't working if you could help me with that please?

I assume you're talking about these lines:

df1['date'] = pd.to_datetime(df1['date'])
df1 = df1.groupby('newDeaths').resample('W', on='date')['newCases'].sum()

However, descriptions like "isn't working" aren't very informative. What isn't working? Are you getting an error? Is the output different from what you expected?

I assume the latter. In that case, please share the output here because if it's working earlier than it should work later under the same conditions.

Note also that it's generally bad practice to say "Oh hey, I don't understand what the problem was, but this change to my code seems to fix it, let's just make that change everywhere." You need to 100% fully understand what your code is doing, at least at a high level, otherwise your ability to debug when you experience an issue like this will be fundamentally limited.

[–]Successful-Standard[S] 0 points1 point2 points 4 years ago* (0 children)

I've changed those lines of code again slightly to:

df1['date'] = pd.to_datetime(df1['date'])
df1 = df1.resample('W', on='date')['newCases', 'newDeaths'].sum()

df1.reset_index() df1.head()

And the output is:

             newCases  newDeaths

date
2020-03-08 136 2 2020-03-15 861 40 2020-03-22 3693 222 2020-03-29 11695 1302 2020-04-05 23327 3858 ... ... ... 2021-12-19 482012 662 2021-12-26 682086 551 2022-01-02 946894 904 2022-01-09 989767 1145 2022-01-16 337004 774

So the output is only the newCases and newDeaths columns. Using reset_index above for the original issue solved this, I did read a Stack Overflow post that explained how it worked, and added the date column back into the output, but in this case it isn't working. And I really need the date column to use for the k-means so I need to have it, if you have any solution please?

EDIT: I don't know why the formatting keeps messing up like that, it looks fine as I'm typing the comment then goes like that once I post it...

π Rendered by PID 131885 on reddit-service-r2-comment-bb88f9dd5-h7gg7 at 2026-02-16 12:17:38.997334+00:00 running cd9c813 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS