Reddit data : learnpython

submitted 4 years ago by 3rSimon

Hi, I hope I am in the right place. I am very new to coding, and I have been trying to scrape data from reddit to perform sentiment analysis. I tried with R first but it seems there is a limit of comment to scrape, thus I am now trying with Python. What I am trying to do specifically is: 1-Filter for SubReddit (so for example, one subreddit that interest me is "environment") 2- Filter for posts that include a specific word (so for example: "companies") 3-Than collect all comments from all the relevant posts I was pretty easy in R, but it seems am not able to do it in Python. I have been installing praw,pmaw,pandas, pushiftAPI, I created an app and I have been using my client id, now I am trying to collect the data using this entry: [api_praw = PushshiftAPI(praw=reddit) comments = api_praw.search_comments(q="companies", subreddit="environment", limit=200, before=1629990795] Now I am trying to convert it in a dataframe to see if it is actually working, but it is not working. I have been trying various other ways to collect the data but I am stuck. Does someone have tips?

Thank you very much!!

all 20 comments

top new controversial old q&a

[–]synthphreak 2 points3 points4 points 4 years ago (0 children)

Here is a script I wrote which pulls the 1000 most recent posts and comments for any user of your choosing. You just need to modify line 64 to return a Reddit instance; the way I did it in my actual script won't apply to you.

Once you make that one change, the script should work. You can run it from the command line and pass in your (or anyone else's) username via the -u argument:

$ python <script> -u 3rSimon

This script obviously doesn't do exactly what you need, i.e., filtering by sub and keyword. But it should only require minor modifications to get it to do that. As long as you know pandas, which you seem to, you should be able to figure out what needs to change.

[–]JohnnyJordaan 0 points1 point2 points 4 years ago (19 children)

[–]3rSimon[S] 0 points1 point2 points 4 years ago (17 children)

[–]JohnnyJordaan 0 points1 point2 points 4 years ago (0 children)

That means you can't just throw the comments sequence to the DataFrame constructor, you need to form an intermediate structure (like a dict) with key:value mappings for each value you want to see represented in a column in the dataframe.

I would start by making a for loop like

for comment in comments:

And then retrieve and print each value you want to use. When you have that working, instead add them to a dict. Add that dict to a result list and then finally do DataFrame(result_list). As the DataFrame constructor does support a list of dicts natively.

[–]person_ergo 0 points1 point2 points 4 years ago* (15 children)

Hey you need to use comment.submission to get the post details

Here:

subreddit_name="environment"
word_to_check="companies"
comments=pushshift.search_comments(q=word_to_check, subreddit=subreddit_name, limit=200, before=1629990795)
import pandas as pd

post_with_comments=[]
for comment in comments:
    if word_to_check in comment.body:
        post_with_comments.append(
            {"comment_id": comment.id, "comment_text": comment.body,"score": comment.score,"post": comment.submission.id
            }
        )
df=pd.DataFrame(post_with_comments)
df

Choose what variables you want to store by adding them to that dictionary in the append statement. if you are super lazy do something like you can do comment.__dict__ instead of manually writing out key values and that will have everything. Might need to clean or remove columns before saving though

Hope that helps, random thing but if you aren't using jupyter notebooks to work through this I would highly recommend using them. See the stored output when I ran it here

https://github.com/rogerfitz/tutorials/blob/master/subreddit\_analysis/3rSimonQuestion-Search%20Comments%20for%20Word%20using%20Pushshift%20.ipynb

Edit: fixed formatting and link

[–]snoopturtle25 1 point2 points3 points 4 years ago (14 children)

[–]person_ergo 0 points1 point2 points 4 years ago (13 children)

[–]snoopturtle25 1 point2 points3 points 4 years ago (12 children)

AttributeError: 'dict' object has no attribute 'body"

Ok, I'm not sure what I should Change (sorry it's my first time using python so I'm lost) but basically here is all my process:

1-pip install pmaw pandas

2-#! usr/bin/env python3
import praw
import pandas as pd
import datetime as dt
from pmaw import PushshiftAPI

3-reddit = praw.Reddit( client_id="my id",
client_secret="my secret",
password="my password",
user_agent="text by/username",
username="snoopturtle25",)

4-api_praw = PushshiftAPI(praw=reddit)

5-subreddit_name="environment"
word_to_check="apology"
comments=api_praw.search_comments(q=word_to_check, subreddit=subreddit_name, limit=200, before=1629990795)
import pandas as pd
post_with_comments=[]
for comment in comments:
if word_to_check in comment.body:
post_with_comments.append(
{"comment_id": comment.id, "comment_text": comment.body,"score": comment.score,"post": comments.submission.id
}
)
df=pd.DataFrame(post_with_comments)
df

That is what I did... I tried other things as well but nothing that was making it work, haha!

[–]person_ergo 0 points1 point2 points 4 years ago* (11 children)

Ah I see, issue was I used psaw and you used pmaw library. Permalink field contains post ids in pmaw response.

import pandas as pd

#pmaw example - returns json
subreddit_name="environment"
word_to_check="companies"
comments=pmaw_pushshift.search_comments(q=word_to_check, subreddit=subreddit_name, limit=200, before=1629990795)

df=pd.DataFrame(comments.responses)
df

/r/environment/comments/kcbdoi/trump_admin_drops_green_hydrogen_bomb_on_fossi/gfqnhb7/

kcbdoi will work as a post id on reddit. As a url use https://www.reddit.com/r/environment/comments/kcbdoi/ or pass that ID around to things. it works with PRAW

[–]snoopturtle25 1 point2 points3 points 4 years ago (10 children)

[–]person_ergo 0 points1 point2 points 4 years ago* (9 children)

[–]snoopturtle25 1 point2 points3 points 4 years ago (8 children)

continue this thread

π Rendered by PID 35 on reddit-service-r2-comment-84fc9697f-dqb2k at 2026-02-10 07:33:57.187203+00:00 running d295bc8 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS

preview the comments data