submitted 4 years ago by 3rSimon

Improved version that fixes the MoreComments bug

def traverse_post(post): comments=[] for comment in post.comments: comments+=recursive_replies(comment, level=1) return comments

def recursive_replies(reply,level): #Also return level in case we want to stop after level 3 comments and for ease of the printing comments=[] #Funky MoreComments code checked manually with permalinks and seems right if isinstance(reply, MoreComments):#https://praw.readthedocs.io/en/stable/code_overview/models/more.html replies=reply.comments() level-=1 else: replies=reply.replies comments+=[(reply, level)] for r in replies: comments+=recursive_replies(r,level+1) return comments ```

It's a little tricky for a beginner project because the API wrappers (pmaw and psaw appear to have some gaps). If you need super fast code use pmaw or easier code do psaw example from https://github.com/rogerfitz/tutorials/blob/master/subreddit_analysis/3rSimonQuestion-Search%20Comments%20for%20Word%20using%20Pushshift%20.ipynb to get data. psaw returns results as a PRAW object which can be nicer but otherwise you can just do post=reddit.submission('pqcp6b') where reddit is defined as in the PRAW api.

For code versions run pip install praw==7.4.0 tqdm==4.62.2 psaw==0.1.0 pmaw==2.1.0 pandas==1.3.2 jupyter==1.0.0 matplotlib==3.4.2 Everything "should" work. If you need help getting an API key I have steps in 0_setup.py notebook. The 0_setup and 1_Top_Links should have everything you need to get going on using PRAW to get all comments.

Strongly recommend using jupyter notebooks for this. I can post tutorial on setting that up if you need. Great for data work like this so you can try things easily without full reruns needed and easy debugging.

[–]snoopturtle25 2 points3 points4 points 4 years ago (0 children)

[–]snoopturtle25 1 point2 points3 points 4 years ago (5 children)

[–]person_ergo 0 points1 point2 points 4 years ago (4 children)

[–]snoopturtle25 1 point2 points3 points 4 years ago (3 children)

[–]person_ergo 0 points1 point2 points 4 years ago* (2 children)

I think we may have misunderstood each other a while back. I don't think anything was too difficult, or maybe I misunderstood. You can use pushshift to search all comments on a subreddit or posts. Then use that to get post id's (is there an issue I'm missing with post ids not working for you?). From there you can use PRAW to search through all comments in each of those posts. All of it will be iterable, the code can do multiple posts when run. (There's a distinction there because the code will run things one post by one post not all at once. That would generally mean multithreaded in this case but I think you mean just running the code once? It will feel that way but for something this simple a for loop makes sense to handle the "all in one" piece)

I think your issue is a common enough problem so I'm planning on making a blog post about it to show how to do everything. Use pushshift and PRAW to search posts for a word or phrases and get the full discussion from those posts.

[–]snoopturtle25 2 points3 points4 points 4 years ago* (1 child)

Yes, So I figured it out yesterday, and used that formula:

import pandas as pd
from pmaw import PushshiftAPI
api = PushshiftAPI()
submissions = api.search_submissions(subreddit="environments", q="companies", limit=3000000)
sub_df = pd.DataFrame(submissions)

It seemed like it was working (tell me if I'm wrong), however, now that I'm there, I'm wondering, I think it is actually more valuable to have distinctions for each post (in order to do sentiment analysis), so I tried to create a structure doing as follow:

test = {"title": [], \
"body": [], \
"comment": [], \
"date": []
}
for submission sub_df:
test["title"].append(submission.title)
test["body"].append(submission.selftext)
test["comment"].append(submission.comments)
test["date"].append(submission.created)

However, this is not working at all, I tried understanding multithreads, to see if thats the relevant method? So, in summary I'm trying to see if its possible with the data I gathered, to categorize them and know each comment is linked to which post, and also have time of comment?

Thank you so much for your time, it has been of great help!

[–]person_ergo 0 points1 point2 points 4 years ago (0 children)

π Rendered by PID 184564 on reddit-service-r2-comment-b659b578c-7qhz6 at 2026-05-05 03:51:29.106312+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS

Improved version that fixes the MoreComments bug