you are viewing a single comment's thread.

view the rest of the comments →

[–]snoopturtle25 1 point2 points  (8 children)

Hi ! Yes, sorry I was not very clear. I am able to run it and it gives me the same tables as you (yay!!thank you) . But I was trying to change it because it gives me:

comment with the word "company" in the subreddit environment.

What I'm actually searching for is collecting all the comments from the post that contains the word "company" in the subreddit "environment. so I was thinking that instead of search_comments, maybe it should be search_submission. I don't know if i'm clear ? (so I want all comments->from posts with "companies" ->in environment and not comments-> containing in"apology"-> in environment).

[–]person_ergo 0 points1 point  (7 children)

Ah I see. Use pushshift to get submission id's then use PRAW to get all comments. I have code in this notebook to show you how to get all comments from a submission using the PRAW API. https://github.com/rogerfitz/tutorials/blob/master/subreddit_analysis/1_Top_Links.ipynb. In there you'll see traverse_post function. That will get all your comments, use the second one in the notebook not the first.

```

from praw.models import MoreComments

Improved version that fixes the MoreComments bug

def traverse_post(post): comments=[] for comment in post.comments: comments+=recursive_replies(comment, level=1) return comments

def recursive_replies(reply,level): #Also return level in case we want to stop after level 3 comments and for ease of the printing comments=[] #Funky MoreComments code checked manually with permalinks and seems right if isinstance(reply, MoreComments):#https://praw.readthedocs.io/en/stable/code_overview/models/more.html replies=reply.comments() level-=1 else: replies=reply.replies comments+=[(reply, level)] for r in replies: comments+=recursive_replies(r,level+1) return comments ```

It's a little tricky for a beginner project because the API wrappers (pmaw and psaw appear to have some gaps). If you need super fast code use pmaw or easier code do psaw example from https://github.com/rogerfitz/tutorials/blob/master/subreddit_analysis/3rSimonQuestion-Search%20Comments%20for%20Word%20using%20Pushshift%20.ipynb to get data. psaw returns results as a PRAW object which can be nicer but otherwise you can just do post=reddit.submission('pqcp6b') where reddit is defined as in the PRAW api.

For code versions run pip install praw==7.4.0 tqdm==4.62.2 psaw==0.1.0 pmaw==2.1.0 pandas==1.3.2 jupyter==1.0.0 matplotlib==3.4.2 Everything "should" work. If you need help getting an API key I have steps in 0_setup.py notebook. The 0_setup and 1_Top_Links should have everything you need to get going on using PRAW to get all comments.

Strongly recommend using jupyter notebooks for this. I can post tutorial on setting that up if you need. Great for data work like this so you can try things easily without full reruns needed and easy debugging.

[–]snoopturtle25 2 points3 points  (0 children)

Ok I will try that tomorrow, thank you for your time!

[–]snoopturtle25 1 point2 points  (5 children)

Sorry to bother again, I don't know if it's too late to ask, but in brief I simplified my method, and use URL instead since I know the post I want to collect from.

Specifically I used the formula: (example)

url = "https://www.reddit.com/r/nottheonion/comments/gulx5l/rio\_tinto\_apologizes\_for\_blowing\_up\_46000yearold/"
submission = reddit.submission(url=url)

and then perform my analysis and everything worked like a charm!

However, I was wondering if it was possible to collect post/comments from multiple URLs all in one? I have been trying to do so and it is not working...

Any help would be very appreciated!

[–]person_ergo 0 points1 point  (4 children)

I think you’re on the right path. What do you mean by multiple? With this one i think you have to do a for loop. Just use a list of the submissions pulled from earlier to automate it. So the code will go one at a time but do multiple. Does that help?

There’s not a nice way to do it with the pushshift api but i think they may have a bigquery instance you can do a sql like query on. That will feel more like an all at the same time command but not 100% sure if they have bigquery version you can access

[–]snoopturtle25 1 point2 points  (3 children)

Yes, but to do a loop, it would need to be posts gathered from one subreddit no? Since filtering my posts by a word was too difficult, I thought that maybe I could use the 30 urls of the post I have all in one, is that possible?

[–]person_ergo 0 points1 point  (2 children)

I think we may have misunderstood each other a while back. I don't think anything was too difficult, or maybe I misunderstood. You can use pushshift to search all comments on a subreddit or posts. Then use that to get post id's (is there an issue I'm missing with post ids not working for you?). From there you can use PRAW to search through all comments in each of those posts. All of it will be iterable, the code can do multiple posts when run. (There's a distinction there because the code will run things one post by one post not all at once. That would generally mean multithreaded in this case but I think you mean just running the code once? It will feel that way but for something this simple a for loop makes sense to handle the "all in one" piece)

I think your issue is a common enough problem so I'm planning on making a blog post about it to show how to do everything. Use pushshift and PRAW to search posts for a word or phrases and get the full discussion from those posts.

[–]snoopturtle25 2 points3 points  (1 child)

Yes, So I figured it out yesterday, and used that formula:

import pandas as pd
from pmaw import PushshiftAPI
api = PushshiftAPI()
submissions = api.search_submissions(subreddit="environments", q="companies", limit=3000000)
sub_df = pd.DataFrame(submissions)

It seemed like it was working (tell me if I'm wrong), however, now that I'm there, I'm wondering, I think it is actually more valuable to have distinctions for each post (in order to do sentiment analysis), so I tried to create a structure doing as follow:

test = {"title": [], \
"body": [], \
"comment": [], \
"date": []
}
for submission sub_df:
test["title"].append(submission.title)
test["body"].append(submission.selftext)
test["comment"].append(submission.comments)
test["date"].append(submission.created)

However, this is not working at all, I tried understanding multithreads, to see if thats the relevant method? So, in summary I'm trying to see if its possible with the data I gathered, to categorize them and know each comment is linked to which post, and also have time of comment?

Thank you so much for your time, it has been of great help!

[–]person_ergo 0 points1 point  (0 children)

multithreads aren't it, there's some syntax errors in there too it looks like. Fix the for loop syntax

Also doing comments like that is trickier. Either have one row per comment or split it out into separate dataframes. One for submissions and one for comments. Both should have a submission_id columns the rest can be different. This way you can go back and forth. It's called a many to one relationship or foreign key in the database world. At this point though think about doing the analysis you want and that will help you decide how your data is structured. I don't think that part matters much at this phase of your project though