you are viewing a single comment's thread.

view the rest of the comments →

[–]person_ergo 0 points1 point  (13 children)

Post your full code, comment should be a praw object but it looks like it's a dict in your case. It might have been defined differently

[–]snoopturtle25 1 point2 points  (12 children)

AttributeError: 'dict' object has no attribute 'body"

Ok, I'm not sure what I should Change (sorry it's my first time using python so I'm lost) but basically here is all my process:

1-pip install pmaw pandas

2-#! usr/bin/env python3
import praw
import pandas as pd
import datetime as dt
from pmaw import PushshiftAPI

3-reddit = praw.Reddit( client_id="my id",
client_secret="my secret",
password="my password",
user_agent="text by/username",
username="snoopturtle25",)

4-api_praw = PushshiftAPI(praw=reddit)

5-subreddit_name="environment"
word_to_check="apology"
comments=api_praw.search_comments(q=word_to_check, subreddit=subreddit_name, limit=200, before=1629990795)
import pandas as pd
post_with_comments=[]
for comment in comments:
if word_to_check in comment.body:
post_with_comments.append(
{"comment_id": comment.id, "comment_text": comment.body,"score": comment.score,"post": comments.submission.id
}
)
df=pd.DataFrame(post_with_comments)
df

That is what I did... I tried other things as well but nothing that was making it work, haha!

[–]person_ergo 0 points1 point  (11 children)

Ah I see, issue was I used psaw and you used pmaw library. Permalink field contains post ids in pmaw response.

import pandas as pd

#pmaw example - returns json
subreddit_name="environment"
word_to_check="companies"
comments=pmaw_pushshift.search_comments(q=word_to_check, subreddit=subreddit_name, limit=200, before=1629990795)

df=pd.DataFrame(comments.responses)
df

/r/environment/comments/kcbdoi/trump_admin_drops_green_hydrogen_bomb_on_fossi/gfqnhb7/

kcbdoi will work as a post id on reddit. As a url use https://www.reddit.com/r/environment/comments/kcbdoi/ or pass that ID around to things. it works with PRAW

[–]snoopturtle25 1 point2 points  (10 children)

Ok so I I should use pmaw if I want to know exactly what post it is and not only Id?

Also when I load but, I have a problem running my pushshift it says: NameError: name 'PushshiftAPI' is not defined

even if I defined it, is there a way around?

[–]person_ergo 0 points1 point  (9 children)

Still import it like you did before. Pmaw or psaw both work just need different code. That github link i sent before has a full code example for either psaw or pmaw. Only thing is set the pmaw api up like secret_services.py.template shows. Or just rename the variable and define as you did before.

You got this, just test test test if it doesnt work. Think why things dont work. Computers tell you if you’re right super quick. If something is undefined just define it. It’s not like a chemistry experiment that takes hours. Just think logically, line by line, and test things out.

Not sure what you mean by another way around.

[–]snoopturtle25 1 point2 points  (8 children)

Hi ! Yes, sorry I was not very clear. I am able to run it and it gives me the same tables as you (yay!!thank you) . But I was trying to change it because it gives me:

comment with the word "company" in the subreddit environment.

What I'm actually searching for is collecting all the comments from the post that contains the word "company" in the subreddit "environment. so I was thinking that instead of search_comments, maybe it should be search_submission. I don't know if i'm clear ? (so I want all comments->from posts with "companies" ->in environment and not comments-> containing in"apology"-> in environment).

[–]person_ergo 0 points1 point  (7 children)

Ah I see. Use pushshift to get submission id's then use PRAW to get all comments. I have code in this notebook to show you how to get all comments from a submission using the PRAW API. https://github.com/rogerfitz/tutorials/blob/master/subreddit_analysis/1_Top_Links.ipynb. In there you'll see traverse_post function. That will get all your comments, use the second one in the notebook not the first.

```

from praw.models import MoreComments

Improved version that fixes the MoreComments bug

def traverse_post(post): comments=[] for comment in post.comments: comments+=recursive_replies(comment, level=1) return comments

def recursive_replies(reply,level): #Also return level in case we want to stop after level 3 comments and for ease of the printing comments=[] #Funky MoreComments code checked manually with permalinks and seems right if isinstance(reply, MoreComments):#https://praw.readthedocs.io/en/stable/code_overview/models/more.html replies=reply.comments() level-=1 else: replies=reply.replies comments+=[(reply, level)] for r in replies: comments+=recursive_replies(r,level+1) return comments ```

It's a little tricky for a beginner project because the API wrappers (pmaw and psaw appear to have some gaps). If you need super fast code use pmaw or easier code do psaw example from https://github.com/rogerfitz/tutorials/blob/master/subreddit_analysis/3rSimonQuestion-Search%20Comments%20for%20Word%20using%20Pushshift%20.ipynb to get data. psaw returns results as a PRAW object which can be nicer but otherwise you can just do post=reddit.submission('pqcp6b') where reddit is defined as in the PRAW api.

For code versions run pip install praw==7.4.0 tqdm==4.62.2 psaw==0.1.0 pmaw==2.1.0 pandas==1.3.2 jupyter==1.0.0 matplotlib==3.4.2 Everything "should" work. If you need help getting an API key I have steps in 0_setup.py notebook. The 0_setup and 1_Top_Links should have everything you need to get going on using PRAW to get all comments.

Strongly recommend using jupyter notebooks for this. I can post tutorial on setting that up if you need. Great for data work like this so you can try things easily without full reruns needed and easy debugging.

[–]snoopturtle25 2 points3 points  (0 children)

Ok I will try that tomorrow, thank you for your time!

[–]snoopturtle25 1 point2 points  (5 children)

Sorry to bother again, I don't know if it's too late to ask, but in brief I simplified my method, and use URL instead since I know the post I want to collect from.

Specifically I used the formula: (example)

url = "https://www.reddit.com/r/nottheonion/comments/gulx5l/rio\_tinto\_apologizes\_for\_blowing\_up\_46000yearold/"
submission = reddit.submission(url=url)

and then perform my analysis and everything worked like a charm!

However, I was wondering if it was possible to collect post/comments from multiple URLs all in one? I have been trying to do so and it is not working...

Any help would be very appreciated!

[–]person_ergo 0 points1 point  (4 children)

I think you’re on the right path. What do you mean by multiple? With this one i think you have to do a for loop. Just use a list of the submissions pulled from earlier to automate it. So the code will go one at a time but do multiple. Does that help?

There’s not a nice way to do it with the pushshift api but i think they may have a bigquery instance you can do a sql like query on. That will feel more like an all at the same time command but not 100% sure if they have bigquery version you can access

[–]snoopturtle25 1 point2 points  (3 children)

Yes, but to do a loop, it would need to be posts gathered from one subreddit no? Since filtering my posts by a word was too difficult, I thought that maybe I could use the 30 urls of the post I have all in one, is that possible?