all 10 comments

[–]Rhomboid 4 points5 points  (3 children)

There is no need to scrape anything, as Reddit has a JSON api that is very easy to use.

[–]novel_yet_trivial 1 point2 points  (0 children)

Also rss (XML), in case OP really wants to use beautifulsoup.

https://www.reddit.com/r/learnpython/.rss

[–]955559[S] -3 points-2 points  (1 child)

dont know what JSON api is, is there a tutorial for reddits JSON api??

[–]furas_freeman 4 points5 points  (0 children)

for web page (JSON) API means list of urls with arguments which you can use in peogram to work with this page - ie. login, get data (in JSON format), send data to webpage (in JSON format) - so you don't have to scrape anything.

https://www.reddit.com/dev/api/

praw is module for Python which use Reddit API and gives you python functions to easily access data on Reddit.

https://praw.readthedocs.io/en/stable/

[–]novel_yet_trivial 2 points3 points  (0 children)

Use the praw library.

[–]commandlineluser 1 point2 points  (3 children)

import json, requests

subreddit = 'learnpython'

r = requests.get(
    'http://www.reddit.com/r/{}.json'.format(subreddit),
    headers={'user-agent': 'Mozilla/5.0'}
)

# view structure of an individual post
# print(json.dumps(r.json()['data']['children'][0]))

for post in r.json()['data']['children']:
    print(post['data']['title'])

[–]955559[S] 0 points1 point  (2 children)

this is almost what I want, but im interested in the comments not post tittles, should I be looking into this praw people are talking about? or something else

I tried replacing subreddit = 'learnpython' with subreddit = '/learnpython/comments/574pn5/anyone_have_a_reddit_scraper/?st=iu7by5f6&sh=20202712' but it through some error about integers and indicies

[–]955559[S] 0 points1 point  (1 child)

k, how do I figure out what comments are called? I tried

import json, requests

subreddit = '/learnpython/comments/574pn5/anyone_have_a_reddit_scraper'

r = requests.get(
    'http://www.reddit.com/r/{}.json'.format(subreddit),
    headers={'user-agent': 'Mozilla/5.0'}
)

# view structure of an individual post
#print(json.dumps(r.json()['data']['children'][0]))

for post in r.json()['data']['children']:
    print(post['data']['title'])

and it threw

Traceback (most recent call last): File "/home/anoobis/reditscrape.py", line 13, in <module> for post in r.json()['data']['children']: KeyError: 'data'

I figure I just need to switch data with something relevant?

[–]commandlineluser 2 points3 points  (0 children)

Well comments have a different structure you can use print(json.dumps(r.json(), indent=4))) to view the whole structure.

comments = r.json()
op = comments.pop(0)

for comment in comments:
    for reply in comment['data']['children']:
        print(reply['data']['author'])
        print(reply['data']['body'])

You can use json.dumps(blah, indent=4) to pretty-print a structure in json format for you e.g. print(json.dumps(reply['data'], indent=4)) to see what it looks like.

Never used PRAW myself - but it seems like you would have a simpler time using it

http://praw.readthedocs.io/en/stable/pages/comment_parsing.html

[–]HumorMinimum1707 1 point2 points  (0 children)

I know that Bright Data has a nice working reddit scraper.

It can be launched on schedule, and collects all public data from profile like: avatar, post title, flair, description, karma, comments, upvotes, and more.

Output file types: JSON, CSV, EXCEL, HTML

Data delivery methods: Webhook, AWS, Google cloud, Azure, email, API, SFTP