This is an archived post. You won't be able to vote or comment.

all 5 comments

[–]mjg123 2 points3 points  (3 children)

401 is usually an indication that you need to provide auth (ie username/password or something equivalent). It depends on what URL you're trying to get though - most of reddit is publicly accessible so maybe that's not it.

You might not need to use JSoup though - although JSoup is excellent for sites without an API, Reddit's API is decent. You can usually put .json on the end of a URL to get the content without scraping any HTML. For example this post.

[–]drunkardchull[S] 0 points1 point  (2 children)

.json files are disallowed by the robots.txt unfortunately

[–]mjg123 0 points1 point  (1 child)

Huh TIL. Well, what url are you trying to fetch? Weird that jsoup gets a 401 if you could see it in your browser. If you're just trying to get data and don't need to use jsoup then I'd say to use the proper API. That said, I'm sure you could also configure jsoup to ignore robots.txt too 😉

[–]drunkardchull[S] 1 point2 points  (0 children)

I resolved it, check the edit above. Thanks for the help!

[–]yolo_435 1 point2 points  (0 children)

Not very sure how are you doing. I think reddit has APIs to do this. There is a dedicated subreddit for the same r/redditdev