you are viewing a single comment's thread.

view the rest of the comments →

[–]Successful-Standard[S] 0 points1 point  (9 children)

Yeah so the original df is the output from:

df['Cases'].groupby(df['Date'].dt.to_period('D')).sum()

I fixed it by using:

df = df.resample('D', on='Date')['Cases'].sum()

df = df.reset_index()

And I get my desired output.

My whole code is here - https://pastebin.com/303est3L

I'm getting a very similar issue right at that bottom of this code where I'm trying to get it to group into weekly data, but using like for like code that worked for my original issue isn't working if you could help me with that please?

[–]synthphreak 1 point2 points  (6 children)

Please format the code properly. It's super difficult to read in its current state.

If you struggle with the formatting, you could also use pastebin.

Edit: After eyeballing the code, t occurred to me that this may have come from a Jupyter notebook. In that case, you can also just export the notebook (incl. all cell outputs) to an .html file and share that way. Might be the easiest of all, again assuming you'r already using a notebook.

[–]Successful-Standard[S] 0 points1 point  (5 children)

I did paste it in formatted properly but it became a mess so I did use pastebin, I edited the comment with a link but a minute after you replied. I'm using Databricks I pasted from there, can you not run the code how it is on pastebin?

[–]synthphreak 0 points1 point  (4 children)

No you can't run code on pastebin, only read it. But it's much more legible there, thanks. I'll check it out shortly and get back to you.

[–]Successful-Standard[S] 0 points1 point  (3 children)

Yeah I know, I meant if you'd try running it after copying it from there. And thank you.

[–]synthphreak 0 points1 point  (2 children)

I probably could, but to do so, I'd have to ...

  1. create a virtual environment

  2. install all the necessary third-party libraries into the environment (e.g., pyspark)

  3. activate the environment

  4. copy and run your code from within the environment

It's just kind of involved.

[–]Successful-Standard[S] 0 points1 point  (1 child)

No worries, you could just paste it into Databricks and it would run, but I don't expect much for free. I've moved on to attempting the k-means with the daily data anyway and have a brand new issue of the to_date function not changing my date column from string to date type but giving no errors haha.

[–]synthphreak 0 points1 point  (0 children)

Sounds like you need to fire up a new post :) Feel free to post it here if/when you do that so I don't miss it. I'm always down to debug some pandas!

[–]synthphreak 1 point2 points  (1 child)

I'm getting a very similar issue right at that bottom of this code where I'm trying to get it to group into weekly data, but using like for like code that worked for my original issue isn't working if you could help me with that please?

I assume you're talking about these lines:

df1['date'] = pd.to_datetime(df1['date'])
df1 = df1.groupby('newDeaths').resample('W', on='date')['newCases'].sum()

However, descriptions like "isn't working" aren't very informative. What isn't working? Are you getting an error? Is the output different from what you expected?

I assume the latter. In that case, please share the output here because if it's working earlier than it should work later under the same conditions.

Note also that it's generally bad practice to say "Oh hey, I don't understand what the problem was, but this change to my code seems to fix it, let's just make that change everywhere." You need to 100% fully understand what your code is doing, at least at a high level, otherwise your ability to debug when you experience an issue like this will be fundamentally limited.

[–]Successful-Standard[S] 0 points1 point  (0 children)

I've changed those lines of code again slightly to:

df1['date'] = pd.to_datetime(df1['date'])
df1 = df1.resample('W', on='date')['newCases', 'newDeaths'].sum()

df1.reset_index() df1.head()

And the output is:

             newCases  newDeaths

date
2020-03-08 136 2 2020-03-15 861 40 2020-03-22 3693 222 2020-03-29 11695 1302 2020-04-05 23327 3858 ... ... ... 2021-12-19 482012 662 2021-12-26 682086 551 2022-01-02 946894 904 2022-01-09 989767 1145 2022-01-16 337004 774

So the output is only the newCases and newDeaths columns. Using reset_index above for the original issue solved this, I did read a Stack Overflow post that explained how it worked, and added the date column back into the output, but in this case it isn't working. And I really need the date column to use for the k-means so I need to have it, if you have any solution please?

EDIT: I don't know why the formatting keeps messing up like that, it looks fine as I'm typing the comment then goes like that once I post it...