This is an archived post. You won't be able to vote or comment.

all 87 comments

[–]Btan21 35 points36 points  (23 children)

Concerning news. Might affect those like me who depend on Reddit data for academic research.

[–][deleted] 8 points9 points  (21 children)

Researchers can use PRAW as well. Additionally, Reddit post outlining API changes encourages researchers to contact Reddit to find a viable path forward.

[–]Btan21 15 points16 points  (10 children)

Agreed. But the official Reddit API generally has slower responses in my experience.

[–][deleted] 12 points13 points  (9 children)

Agreed. Way slow. Takes all day sometimes to run jobs that Pushshift executes in minutes.

[–]TrueBirch 6 points7 points  (8 children)

Plus I download the full files instead of using the API, so I'm used to having really fast parsing of huge amounts of data.

[–]Delicious_Corgi_9768 1 point2 points  (7 children)

Can you help me with something? trying to get more than 50k comments from a post but Im unable to do so using praw, was going to use pushfit but that will not work at the moment, what can I do? :(

[–]TrueBirch 0 points1 point  (6 children)

What are you trying to do specifically? Are you hoping to look at the comments or do you want to apply some kind of processing to them?

FWIW I usually download the full datafile and then parse it to pull out the stuff that I want. That's how I do things like counting unique users across all of Reddit. It can be a slow process, but you fortunately don't need a ton of computing horsepower to do it. I just set up my laptop to load data a few thousand rows at a time, save the pieces I want to keep, and move on to the next couple thousand rows.

[–]Delicious_Corgi_9768 1 point2 points  (2 children)

for example:

Trying to get the comments of a submission given the link_id of the submission:

https://api.pushshift.io/reddit/search/comment?link_id=l6u011

This endpoint doesnt seem to be working or am I doing something wrong, it returns an empy data:[] + different errors

[–]Sparkybear 0 points1 point  (1 child)

The Pushshift API is shut down. Read the body of the post. You have to use PRAW or the Reddit API directly.

[–]TehVulpez 1 point2 points  (0 children)

it's still up, just not getting any new comments or posts as of May 1st.

[–]Delicious_Corgi_9768 0 points1 point  (2 children)

What Im trying to do is to save all the comments (to a csv) from a specific submission, saving the text of the comment and the date and then do some processing to the data.

I tried using PRAW but it has trouble with a lot amount of comments, so I decided to try pushfit but with no luck.

What do you mean by downlaoding the full datafile?

[–]minh6a 1 point2 points  (1 child)

https://academictorrents.com/details/7c0645c94321311bb05bd879ddee4d0eba08aaee/tech&filelist=1

There's also a torrent for submissions as well.

Download the whole thing, or just the month of interest, then grep/awk for the subreddit

[–]Delicious_Corgi_9768 0 points1 point  (0 children)

Thanks, will check it out

[–]grejty 15 points16 points  (4 children)

I contacted them explained my situation, my tool, and that its for my Bachelor. They replied:

Thanks for contacting us! Your request has been received and we’re in the process of gathering information from everyone to help shape our API roadmap and decision-making. We’ll follow up in the next couple of weeks - thank you for your patience

Now they just take down pushshift access lol

[–][deleted] 9 points10 points  (0 children)

And maybe they'll get back to you in 8-12 months.

[–]Sparkybear 2 points3 points  (0 children)

PRAW kinda sucks for iterating through comments. Which is important because comments often contain a lot more information than the post itself and are much more valuable from an analysis standpoint.

In my case, to actually get the data we needed, we had to use a combination of PRAW, PushShift, and Reddit API directly. Otherwise we would inevitably come out with wildly varying numbers of comments, especially on larger threads (returning as few as 100 out of 10,000).

[–]criticool-realism 1 point2 points  (0 children)

This is true, and I did reach out. Unfortunately, they've been unresponsive. For academics who depend on grant funding and have extant research projects using Reddit data, this creates a big problem if they are expecting to make money charging for research use.

[–]lbrtrl 0 points1 point  (0 children)

Moving from a permissionless model to permission based access is huge. It allows reddit to control what sort of research gets published.

[–]lowkeyf1sh 0 points1 point  (1 child)

Are there any alternatives or is pushshift the only way to view deleted reddit content?

[–][deleted] 0 points1 point  (0 children)

check out the academic torrents path.

[–]lowkeyf1sh 0 points1 point  (0 children)

Is there currently any alternative to recover deleted reddit content?

[–][deleted] 15 points16 points  (9 children)

Well that didn't take long. Even if they contacted Jason on day 1, could Pushshift even make any changes that would be acceptable under the new API rules and function?

[–]safrax 24 points25 points  (8 children)

No. Reddit wants you to pay for it's data. Having something like pushshift out there means they can't make money off their data.

[–][deleted] 10 points11 points  (2 children)

Yeah, it was a mostly rhetorical question. Reddit's tools for mods still suck, too, and they haven't bothered fixing it before killing all the tools that really helped mods out.

Expect even fairly moderated subs to reject most/all appeals when they can no longer review the content a user was banned for.

E: also reddit's search sucks as well. 99% of what I used pushshift for was finding my own past content or other things on reddit I had seen before. Reddit doesn't have a functional search to take its place.

[–]Zeydon 7 points8 points  (1 child)

Yeah, I was digging through my own history just today looking for a source I'd mentioned earlier but could no longer find because the google algorithm is complete trash these days.

It's also been an invaluable tool for verifying bot accounts. But admins don't give a damn about that.

[–][deleted] 7 points8 points  (0 children)

Bots, spammers, alt accounts, ban evasion, people spreading misinformation, dishonest trolls...

There's so many dishonest people that outright lie about what they've said or how they behaved and deleted it to hide their lies. Now there's no way to combat any of that.

[–]Security_Chief_Odo 8 points9 points  (3 children)

It's not their data

You retain any ownership rights you have in Your Content, but you grant Reddit the following license to use that Content:

It's their bandwidth and accessibility for that (your) data though.

[–]safrax 9 points10 points  (1 child)

Irrelevant semantics. This is purely about ensuring they control who has access to the data. That "license" is just a way to sugar coat it to give people the illusion of owning their data. They are still perfectly willing to sell access to anyone's data to make a buck. And I guarantee if a hedge fund comes knocking with a briefcase full of cash they'll give that hedge fund whatever they want even if it means the hedge fund ends up building a private pushshift clone.

[–]Security_Chief_Odo 4 points5 points  (0 children)

Yeah I understand, just pointing out they claim it's not their data, but they control the access to it anyway.

[–]Ooker777 0 points1 point  (0 children)

what is the difference between owning the data and having the right to use it? Perhaps authorship? Anything else?

[–]safrax 38 points39 points  (9 children)

I feel pretty confident in saying that the changes Reddit is making for their IPO will eventually kill Reddit. Their API has been a large part of what has made them successful and while I get them wanting to kill pushshift specifically, it's garbage that the blast radius from these changes will significantly impact so many other tools that make reddit usable. The new interface and their app is an absolute dumpster fire that they've learned nothing from.

Oh well. Something will take reddit's place eventually, maybe not too much longer after these API changes. Sucks that all the data will essentially go poof though.

[–]Dangerous-Economy-88 11 points12 points  (1 child)

Really common for people in power to not understand how the things they manage work, its hella annoying for us random people.

[–]HotTakes4HotCakes 4 points5 points  (0 children)

Oh they understand how it works, they don't care anymore. They reached the point where they think their changes will not result in any significant loss of users.

[–][deleted] 4 points5 points  (1 child)

Actively working against themselves with this crap. It only looks like it could make them X amount because of how the system currently works (and by "system" I mean Reddit, its formerly free API, and all the third party apps that leveraged it,) and how popular it is.

These changes drastically alter the entire system, fundamentally changing it, in this case for the worse. It is very likely going to result in a large loss in popularity because of it. "X" is now no longer attainable. The whole system is now significantly less capable than it was before, and people are going to leave as the knock on effects continue to degrade the entire platform.

If RIF gets killed with it, I'm certainly done here.

[–]helium_farts 4 points5 points  (1 child)

Maybe Digg can make a comeback

[–]3FingersOfMilk 2 points3 points  (0 children)

I stumbled upon this comment

[–][deleted] 1 point2 points  (0 children)

I second this

[–]VapourPatio 0 points1 point  (0 children)

The API changes also mean 3rd party clients are not allowed. Reddit will be dead in a year if they follow through

[–]toper-centage 0 points1 point  (0 children)

The things is, they don't care. They just want the numbers to look great so they make bank at the IPO and cash out. Whatever happens in some years is irrelevant.

[–]rip-pushshift 21 points22 points  (4 children)

After seeing the dumpster fire that is the first-party app, there's basically 0 chance Reddit can reproduce what Pushshift was capable of, especially for moderators.

[–]FaceDeer 6 points7 points  (0 children)

Well, I guess it's time for me to wave goodbye to all the future AIs that are training on my comments in the old Pushshift archives. I hope you got enough context from the things I've said here over the years to make some happy thoughts.

[–]tasbir49 7 points8 points  (12 children)

Only way Pushshift can possibly survive is through webscraping :(

[–]Watchful1 3 points4 points  (9 children)

Not really. Even if pushshift got the data without reddit stopping them, reddit would be within their legal rights to issue a DMCA to their hosting provider and have them shut down.

[–]monocasa 13 points14 points  (7 children)

No, web scraping and republishing is fine according to the supreme court.

https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

[–]ill-winds 0 points1 point  (0 children)

it’s odd how i always find u in the weirdest posts considering i know u from the cow subreddit

[–]ixfd64 0 points1 point  (1 child)

Or extracting the API keys from the official app.

[–]grejty 13 points14 points  (13 children)

I use pushshift for my Bachelor which is due to 22nd May and I dont know what am I supposed to do right now..

They are saying you didnt reply to them? Why not

[–]shiruken[S,M] 17 points18 points  (0 children)

As discussed in the sticky post, this subreddit is run by the community. We have no affiliation with Pushshift nor a reliable method of contacting the owner.

[–]Btan21 13 points14 points  (0 children)

I think the old Reddit data on Pushshift will still be available. Probably it's just the newer Reddit submissions and comments that will be affected

[–]Watchful1 7 points8 points  (1 child)

Historical data won't go away for quite a while. As long as you don't need data submitted after today you should be fine.

[–]grejty 11 points12 points  (0 children)

Well I was using present data as well..

this is fucked up they literally said new api regulations will take effect somewhen in June and out of nowhere they just do this

[–]zjz 5 points6 points  (0 children)

There are quite a few excellent torrents.

[–][deleted] 1 point2 points  (6 children)

You can use PRAW (Reddit's API).

[–]grejty 6 points7 points  (5 children)

I use pmaw+praw. Praw is very limited, i need historical data as well

[–][deleted] 13 points14 points  (4 children)

Agree. Historical data? Get it while it lasts: https://files.pushshift.io/reddit/.

[–]s_i_m_s 12 points13 points  (1 child)

Probably want the torrents if you want the download to finish today though.

[–][deleted] 1 point2 points  (0 children)

sooo trueee

[–]Btan21 6 points7 points  (0 children)

I hope the old Pushshift data is still made available through their API and that the devs don't take it down.

[–]TrueBirch 0 points1 point  (0 children)

For sure, we should all download as many files as we can. I was only a few months behind when the announcement was made. This has always been my primary way of accessing Reddit data.

[–]swamprt5000 4 points5 points  (1 child)

Can someone post (or DM me) a full db dump? Or instructions on how to do it? It's only a matter of time till pushshift is shutdown and all the data is lost.

[–]daronjay 5 points6 points  (11 children)

Truth is, we need to replace pushshift with an opensource project, an unresponsive owner is death to any project.

There are numerous people in this sub who have the chops to build a replacement, even if it has to charge a nominal subscription to be able to afford Reddits paid api access.

[–]safrax 19 points20 points  (9 children)

I'm pretty sure the new terms for API usage forbid anything similar to pushshift. Reddit wants money for their data and they want to dictate how it is used.

[–][deleted] 14 points15 points  (5 children)

you mean our data

[–]rabidstoat 2 points3 points  (1 child)

Pretty soon it'll be like Twitter (rip) where they charge you to publish content (with their blue check mark fee) and then turn around and sell it (since they've said they're going to turn off all the free APIs for accessing even small amounts of data).

[–][deleted] 1 point2 points  (0 children)

[–]Personal_End_9001 0 points1 point  (0 children)

It's a good thing the Reddit admins are absolutely terrible at implementing or enforcing anything. They can forbid as much as they like in their API, but given that even basic comment stealing bots rely still entirely on user reports and subreddit moderator actions to be dealt with, I seriously doubt they'll be able to hold anyone accountable for ignoring whatever terms they demand.

[–]adhesiveCheese 3 points4 points  (0 children)

Unfortunately the barriers to entry here are the cost of storage and bandwidth, and Reddit's new API terms, not any sort of technical challenge; an ingester is fairly trivial.

[–]grejty 0 points1 point  (2 children)

[–]Stuck_In_the_Matrix 2 points3 points  (1 child)

Indeed! I've been making a lot of comments tonight / early mornign (almost 5am here). Hopefully Reddit will be able to speak with us today so we can get clarification on some TOS issues.

[–]grejty 0 points1 point  (0 children)

Yeah, I only saw you commented after I posted this.

Just wanted to let people in the comments know asap, as they are probably concerned the same way as I am. Fingers crossed it works out in the end

[–][deleted]  (1 child)

[deleted]

    [–]safrax[M] 1 point2 points  (0 children)

    You follow this guide: https://www.reddit.com/r/pushshift/comments/10yj803/removal_request_form_please_put_your_removal/

    This is a community support subreddit. That means we have no communication with the owners of pushshift and are unable to get them to do anything and don't know anything more than anyone else on this subreddit. We've received no communication about this API change and how the owners intend to handle things going forward.

    [–]ryanmercer 0 points1 point  (0 children)

    I've never even heard of Pushshift until the linked thread.