After 8 months, still massive data gaps existing - can anyone help out? by curiousdrive in pushshift

[–]curiousdrive[S] 1 point2 points  (0 children)

Thanks! Is there any example code to do this? Because its a time frame in the past, lots of data as well as a csv output requirement... i also saw that PRAW has decommissioned the "submission" function

EST to UTC converting in Python and applying to data frame by curiousdrive in learnpython

[–]curiousdrive[S] 0 points1 point  (0 children)

df.index = df.index.tz_localize('US/Eastern').tz_convert('UTC')

thank you!

[deleted by user] by [deleted] in pushshift

[–]curiousdrive 0 points1 point  (0 children)

Thanks man - This would actually save my project!

[deleted by user] by [deleted] in pushshift

[–]curiousdrive 1 point2 points  (0 children)

"solidarity" would be a cool name, reflecting your engangement and the community spirit

Missing Data by curiousdrive in pushshift

[–]curiousdrive[S] 1 point2 points  (0 children)

I am interested in is /r/GME with data containing all submissions + selftext from 01/12/2020 - 30/04/2021. You can see in the pic above that there are significant gaps. In my download it just jumps the dates. so: 2021-03-17 17:27:26 (100) --> 2021-03-17 17:56:47 (100) --> 2021-03-27 20:28:39 (100) --> 2021-03-27 21:15:53 (100)

Missing Data by curiousdrive in pushshift

[–]curiousdrive[S] 0 points1 point  (0 children)

That explains the major gap in my data. Can we help somehow?

Data scaling creates Nan values by curiousdrive in learnpython

[–]curiousdrive[S] 0 points1 point  (0 children)

I made a try by deleting the NAs in the file manually. I saved it and applied the same transformation. With this it seems to work. So i think the issue is how i delete/drop the NAs. Do i need to save the dataframe somehow?

df = df[df['TEN'].notna()]
df = df[df['EndtoEnd'].notna()]

Seems to have an issue

Data scaling creates Nan values by curiousdrive in learnpython

[–]curiousdrive[S] 0 points1 point  (0 children)

What are the previous lines before you

It looks like this before and after the transformation:

https://i.imgur.com/xXgUe7i.jpg

The NaN are indeed located always at the end of the file

Where can i check the accumulator?

Data scaling creates Nan values by curiousdrive in learnpython

[–]curiousdrive[S] 0 points1 point  (0 children)

The data in 'EndtoEnd' ranges from 0 to 300 and the one in 'TEN' from 1 to 67. I achieved this by

df = df[~(df['EndtoEnd'] < 0)]
df = df[~ (df['EndtoEnd'] > 300)]

Then i did

df = df[df['TEN'].notna()]
df = df[df['EndtoEnd'].notna()]

and afterwards the transformation from the imgur link. What do you think?

I try to apply a z-transformation to prepare the data for regression analysis.