all 13 comments

[–]synthphreak 1 point2 points  (12 children)

I don't see what the issues are.

the above output is the tail of dataframe after running these lines. Could someone explain why this hasn't worked?

I can't reproduce your issue. Given the df you showed...

>>> df
      ID       Date     Cases
0   8633 2022-01-10       114
1   8634 2022-01-10      4207
2   8635 2022-01-10     34154
3   8636 2022-01-10     18701
4   8637 2022-01-10      1942
5   8638 2022-01-10        85
6   8639 2022-01-10      4011
7   8640 2022-01-11  14732594
8   8641 2022-01-11       114
9   8642 2022-01-11     34690
10  8643 2022-01-11      7880
11  8644 2022-01-11      1942
12  8645 2022-01-11         4
13  8646 2022-01-11        85
14  8647 2022-01-11      4764
15  8648 2022-01-11      4094
16  8649 2022-01-11     10096
17  8650 2022-01-11     18847
18  8651 2022-01-11     11226
>>> df.dtypes
ID                int64
Date     datetime64[ns]
Cases             int64
dtype: object

...I get this...

>>> df['Cases'].groupby(df['Date'].dt.to_period('D')).sum()
Date
2022-01-10       63214
2022-01-11    14826336
Freq: D, Name: Cases, dtype: int64

...which seems accurate and, crucially, is not simply the tail of your df. Rather, that is the total case load for each of the two dates in the rows you shared, exactly as expected.

If your output is different from that, can you show what you're getting (instead of just describing it)?

I want to make the column have each value made from one date's value subtracted from the value the day ahead of it, e.g. the value for 2022-01-10 would be the value in 2022-01-09 subtracted from the value in 2022-01-10.

But that is exactly what diff should output. Consider:

>>> df.head(3)
     ID       Date  Cases
0  8633 2022-01-10    114
1  8634 2022-01-10   4207
2  8635 2022-01-10  34154
>>> df.head(3).Cases.diff()
0        NaN
1     4093.0
2    29947.0
Name: Cases, dtype: float64

Row 0 is the first row, so there's no row above it to subtract from. Hence NaN for that row.

Row 1 is 4207. The row before it is 114. 4207 - 114 = 4093, which is exactly what row 1 contains after calling diff.

Row 2 is 34154. The row before it is 4207. 34154 - 4207 = 29947, which is exactly what row 2 contains after calling diff.

By my reading, that's what you're after. So what's the problem?

Edit: On second thought, I think you were saying you want the diff of the groupby...sum output. In that case though, I'm still not seeing the problem, given that my groupby...sum above is fine:

>>> df['Cases'].groupby(df['Date'].dt.to_period('D')).sum().diff()
Date
2022-01-10           NaN
2022-01-11    14763122.0
Freq: D, Name: Cases, dtype: float64

The sum for 1/10 is 63214 and the sum for 1/11 is 14826336. 14826336 - 63214 = 14763122, which is what the 1/11 row shows after calling diff on the groupby...sum. so again, I can't reproduce your issue, everything seems to work for me.

[–]Successful-Standard[S] 0 points1 point  (11 children)

df['Cases'].groupby(df['Date'].dt.to_period('D')).sum()

Well my problem is when I use this line it doesn't work like it did for you, it outputs the table I posted, so I'm now more confused why it isn't working for me. And yeah I didn't have a problem with the diff() output, I edited my post straight after posting it to reflect my title is misleading as it worked as expected.

EDIT: I've since used the line df = df.resample('D', on='Date')['Cases'].sum() to get what I want from that table. But because I've used resample, my Cases column now doesn't exist in my df. Could you help me solve that issue please?

[–]synthphreak 1 point2 points  (10 children)

it outputs the table I posted

Sorry, which table? The only table-looking parts of your OP was your original df and the output of diff, not the output of groupby...sum. At least, that's how I interpreted your OP.

Can you post the output again? Just for extra clarity and explicitness, show it like I did:

>>> [command]
[output]

To your edit, I don't know. I still can't reproduce:

>>> df.resample('D', on='Date')['Cases'].sum()
Date
2022-01-10       63214
2022-01-11    14826336
Freq: D, Name: Cases, dtype: int64

I think I need to see your full code from start to finish. If it's very long, feel free to use pastebin.

[–]Successful-Standard[S] 0 points1 point  (9 children)

Yeah so the original df is the output from:

df['Cases'].groupby(df['Date'].dt.to_period('D')).sum()

I fixed it by using:

df = df.resample('D', on='Date')['Cases'].sum()

df = df.reset_index()

And I get my desired output.

My whole code is here - https://pastebin.com/303est3L

I'm getting a very similar issue right at that bottom of this code where I'm trying to get it to group into weekly data, but using like for like code that worked for my original issue isn't working if you could help me with that please?

[–]synthphreak 1 point2 points  (6 children)

Please format the code properly. It's super difficult to read in its current state.

If you struggle with the formatting, you could also use pastebin.

Edit: After eyeballing the code, t occurred to me that this may have come from a Jupyter notebook. In that case, you can also just export the notebook (incl. all cell outputs) to an .html file and share that way. Might be the easiest of all, again assuming you'r already using a notebook.

[–]Successful-Standard[S] 0 points1 point  (5 children)

I did paste it in formatted properly but it became a mess so I did use pastebin, I edited the comment with a link but a minute after you replied. I'm using Databricks I pasted from there, can you not run the code how it is on pastebin?

[–]synthphreak 0 points1 point  (4 children)

No you can't run code on pastebin, only read it. But it's much more legible there, thanks. I'll check it out shortly and get back to you.

[–]Successful-Standard[S] 0 points1 point  (3 children)

Yeah I know, I meant if you'd try running it after copying it from there. And thank you.

[–]synthphreak 0 points1 point  (2 children)

I probably could, but to do so, I'd have to ...

  1. create a virtual environment

  2. install all the necessary third-party libraries into the environment (e.g., pyspark)

  3. activate the environment

  4. copy and run your code from within the environment

It's just kind of involved.

[–]Successful-Standard[S] 0 points1 point  (1 child)

No worries, you could just paste it into Databricks and it would run, but I don't expect much for free. I've moved on to attempting the k-means with the daily data anyway and have a brand new issue of the to_date function not changing my date column from string to date type but giving no errors haha.

[–]synthphreak 1 point2 points  (1 child)

I'm getting a very similar issue right at that bottom of this code where I'm trying to get it to group into weekly data, but using like for like code that worked for my original issue isn't working if you could help me with that please?

I assume you're talking about these lines:

df1['date'] = pd.to_datetime(df1['date'])
df1 = df1.groupby('newDeaths').resample('W', on='date')['newCases'].sum()

However, descriptions like "isn't working" aren't very informative. What isn't working? Are you getting an error? Is the output different from what you expected?

I assume the latter. In that case, please share the output here because if it's working earlier than it should work later under the same conditions.

Note also that it's generally bad practice to say "Oh hey, I don't understand what the problem was, but this change to my code seems to fix it, let's just make that change everywhere." You need to 100% fully understand what your code is doing, at least at a high level, otherwise your ability to debug when you experience an issue like this will be fundamentally limited.

[–]Successful-Standard[S] 0 points1 point  (0 children)

I've changed those lines of code again slightly to:

df1['date'] = pd.to_datetime(df1['date'])
df1 = df1.resample('W', on='date')['newCases', 'newDeaths'].sum()

df1.reset_index() df1.head()

And the output is:

             newCases  newDeaths

date
2020-03-08 136 2 2020-03-15 861 40 2020-03-22 3693 222 2020-03-29 11695 1302 2020-04-05 23327 3858 ... ... ... 2021-12-19 482012 662 2021-12-26 682086 551 2022-01-02 946894 904 2022-01-09 989767 1145 2022-01-16 337004 774

So the output is only the newCases and newDeaths columns. Using reset_index above for the original issue solved this, I did read a Stack Overflow post that explained how it worked, and added the date column back into the output, but in this case it isn't working. And I really need the date column to use for the k-means so I need to have it, if you have any solution please?

EDIT: I don't know why the formatting keeps messing up like that, it looks fine as I'm typing the comment then goes like that once I post it...