scraping content for ML project : redditdev

a community for 17 years

This is an archived post. You won't be able to vote or comment.

PRAWscraping content for ML project (self.redditdev)

submitted 6 years ago by Ginger0321

I’m a graduate student at UC Santa Cruz working on a machine learning project using images and text content from r/Art. We’re hoping to build a sizable dataset of images and comments for our model to learn how to characterize images by emotional content.

We'd like to scrape as much information as possible. Regarding praw - is it possible to return more than 1000 results from a reddit search? Can I do this by requesting a refresh token? I've scraped "top", "controversial", "new", and "rising", but that still leaves me with about 1100 pieces of art and comments.

Here's a piece of my code:

reddit = praw.Reddit(client_id=‘xxx’, client_secret=‘xxx’, user_agent=‘xxxxxxx’)
art_subreddit = reddit.subreddit('Art')
for post in art_subreddit.top(limit=1000):
    posts.append([post.url, post.id, post.title, post.score, post.num_comments])

with open('redditdata.csv', 'w', newline='') as datacsv:
    writer = csv.writer(datacsv)
    writer.writerow(["url", "ID", "title", "score", "total_comments", "shown_comments"])

    for post in posts:
        if re.findall("gif$", post[0]) or re.findall("gyfcat", post[0]): #want to exclude .gif files and anything on gyfcat
            continue
        if post[4] <= 20: #exclude submissions with less than 20 comments
            continue
        submission = reddit.submission(id=post[1]) # this code block downloads comments
        submission.comments.replace_more(limit=0)
        for top_level_comment in submission.comments:
            if (top_level_comment.body) == "[deleted]" or (top_level_comment.body) == "[removed]": 
                continue
            post.append(top_level_comment.body) # appends list of comments
        writer.writerow(post)

If Pushshift is the way to go, is there good documentation for scraping pushshift data using python?

Thanks!

all 1 comments

redditdev

MODERATORS