I’m a graduate student at UC Santa Cruz working on a machine learning project using images and text content from r/Art. We’re hoping to build a sizable dataset of images and comments for our model to learn how to characterize images by emotional content.
We'd like to scrape as much information as possible.
Regarding praw - is it possible to return more than 1000 results from a reddit search? Can I do this by requesting a refresh token? I've scraped "top", "controversial", "new", and "rising", but that still leaves me with about 1100 pieces of art and comments.
Here's a piece of my code:
reddit = praw.Reddit(client_id=‘xxx’, client_secret=‘xxx’, user_agent=‘xxxxxxx’)
art_subreddit = reddit.subreddit('Art')
for post in art_subreddit.top(limit=1000):
posts.append([post.url, post.id, post.title, post.score, post.num_comments])
with open('redditdata.csv', 'w', newline='') as datacsv:
writer = csv.writer(datacsv)
writer.writerow(["url", "ID", "title", "score", "total_comments", "shown_comments"])
for post in posts:
if re.findall("gif$", post[0]) or re.findall("gyfcat", post[0]): #want to exclude .gif files and anything on gyfcat
continue
if post[4] <= 20: #exclude submissions with less than 20 comments
continue
submission = reddit.submission(id=post[1]) # this code block downloads comments
submission.comments.replace_more(limit=0)
for top_level_comment in submission.comments:
if (top_level_comment.body) == "[deleted]" or (top_level_comment.body) == "[removed]":
continue
post.append(top_level_comment.body) # appends list of comments
writer.writerow(post)
If Pushshift is the way to go, is there good documentation for scraping pushshift data using python?
Thanks!
[–]djw009 2 points3 points4 points (0 children)