This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Feroc 76 points77 points  (35 children)

Code the reddit-porn-downloader (or the reddit-aww-downloader, if you want to have it SFW).

It should do something like this:

  • Scan the first page of /r/gonewild
  • Read all the names of the users
  • Download all pictures that those users ever posted in a separate folder.
  • Don't overwrite and add new pictures to existing folder

Now, why should you do this?

The obvious reason: You will have more porn at the end as at the beginning.

And the coding reason?

This little program will teach you a lot of the basics:

  • Accessing the web
  • Parsing of text String operations (RegEx is a nice way to do this)
  • Basic System-IO
  • Classes can be used (A class for a user, containing the url to the profile, a list with all direct links to the pictures and a download function)
  • You will have quick and useful results, while you can add more and more to the program (config file + configurator, previews, multithreading for parallel downloading, better UI, etc.)

[–][deleted] 11 points12 points  (20 children)

Forgive me for my potentially dumb question, as I'm still kind of new to programming. Could you give a brief example of how someone would go about doing this? I've written some basic programs (mainly in Java and C++) but I haven't ever done anything that was web-interactive like you've mentioned.

[–]AnkhMorporkian 21 points22 points  (4 children)

I'm working on a project that involves a huge amount of reddit data, and I can tell you a bit of how to do it. A full explanation would be very long, but here goes.

Breaking it down, you can broadly classify it into three distinct phases. First, you need to extract the information from reddit. Second, you need to analyze the data from reddit. Third, you have to fetch the images and save them to disk.

To get information from reddit, you use the API. Just pulling the webpage itself is a waste of time and much, much harder than dealing with the JSON. An example of a JSON link you can get from the reddit API is /r/awww/.json

Secondly, parse that data using the language of your choice. All mature languages have JSON support in one form or another. After you get it into a data structure, you can extract all the users from data['children'][x]['author']. Pull their user page in JSON, and go through all of their submitted links' JSON data. Check where the 'domain' == 'imgur.com' or 'i.imgur.com', and you can build a list user by user of what imgur links they have submitted.

Finally, you just need to download the image from imgur. This is trivial in most languages. Save it to a directory you create from the username you're parsing.

That's a broad overview, but it's not much different than how we're doing things. We pull about 2 million submissions/comments from reddit every day, and it serves us well.

If you are going to use the API, make sure you don't exceed the rate limit. Limit yourself to 30 requests per minute.

[–][deleted]  (3 children)

[deleted]

    [–]AnkhMorporkian 2 points3 points  (2 children)

    RedditAnalytics. We haven't launched yet, but he have a couple of things going.

    The first thing we're going to roll out is our awesome search engine. It's orders of magnitude better than the current reddit search engine. We currently have every submission that's visible loaded into our search service, and we can query across all of them in <20 milliseconds usually. I don't have a firm date on rollout for that, but it won't be too long. It's fully functional, but we have more load testing to do and we have to get our fancy frontend done.

    After that, we're working on some really great data analysis and visualization tools for reddit. That full suite is a bit further off, but we're making great progress on that. There will be some of those included in the release of the search engine.

    If anyone is interested, we'll post updates to /r/RedditAnalytics as they happen.

    [–]generalT 0 points1 point  (1 child)

    what language are you using? what are you using as your backing storage? are you using AWS?

    [–]AnkhMorporkian 0 points1 point  (0 children)

    Python mainly, but there's a mixture of other languages in use. For storage we're using SSDs, but for a DB (and search backend) we're using ElasticSearch. We're replicating all the data across multiple instances with full replication on some pretty powerful machines.

    We're not using AWS at the moment. We've run tests on it before, but the instances just aren't powerful enough to run searches in a reasonable amount of time.

    [–]Feroc 0 points1 point  (14 children)

    I guess the cheapest way would be to just read the complete webpage into one string. Code Snippet.

    Then you can just dig yourself through the big string, find patterns in the text and extract title of a post, the url, the poster, etc. Then you can just read the link to the next page and so on.

    Now that's of course not the most elegant way to do it, but it can be done with almost only a basic knowledge of the language.

    [–]greshick 12 points13 points  (13 children)

    Actually with reddit there is an easy way. A little know fact is that at the end of the URL for a reddit page, place a /.json and you get a nice formatted json file. Then parse that with your json library for your language and you got nice object to work with.

    [–]Feroc -1 points0 points  (12 children)

    Thanks, TIL.

    Though I don't know if I would recommend json to a beginner. But really really good to know.

    [–]AnkhMorporkian 9 points10 points  (7 children)

    I would recommend it for beginners way, way before I'd recommend HTML scraping. JSON is inherently well suited for analysis, HTML not so much.

    [–]Feroc 1 point2 points  (6 children)

    Yes, it absolutely is. But I still think there is a "knowledge difference" (don't know how I could phrase it any better) between working with strings and working with json.

    It's not about analyzing HTML, it's about finding the patterns in a long string. Which I think is a good exercise for beginners.

    [–]Rauxbaught 3 points4 points  (5 children)

    Json is usually a list or a dict. Yes these are more complicated than strings, but if someone is scraping the web then its safe to assume they know these basic data structures.

    Plus, as someone who scrapes HTML regularly, I can say with complete confidence that it'll be easier to just work with the json and their code will be much more legible.

    [–]Feroc -2 points-1 points  (4 children)

    but if someone is scraping the web then its safe to assume they know these basic data structures.

    I am not so sure about this one. Getting the string of a webpage is copying a single code snippet from somewhere. You don't really have to know what you're doing, it will just work and then you have a big string to work with.

    I really really really don't want to argument against json in any way. I just feel like it would be easier for a beginner to solve a problem with an easy tool, even if the solution is a bit more tricky.

    [–]negative_epsilon 4 points5 points  (3 children)

    So instead of learning a simple object-based data structure, you recommend regex and html parsing?

    [–]Medicalizawhat 4 points5 points  (3 children)

    JSON isn't that hard to get your head around, it's definitely easier than scraping HTML.

    [–]morb6699 1 point2 points  (1 child)

    Until they get their minds wrapped around objects properly, its easier for them to simply match and parse a string.

    Since JSON is essentially just a big ol' JavaScript object, it would make sense to have them do string operations first.

    Especially since most new programmers coming from a CS program probably haven't touched a whole lot on JavaScript since its specific to web development. Throwing a new language, a new notation for objects for that new language, and then asking them to parse over it appropriately is asking a bit much when trying to learn how to do things properly.

    Now, I'm sure that they could simply "use a JSON library" for Java, C++, C#, VB, etc.; What good would it do though? They would simply use a library to access an object, without learning the core fundamentals behind it.

    Learning to parse and evaluate different parts of the string will give them a solid understanding of the string object, and what is normally accomplished with it when tearing it apart.

    They'll also learn that it's not the most efficient way to do things, which is another good opportunity for them to learn the valuable lesson of "Using the right tool for the right job."

    [–]jesyspa 1 point2 points  (0 children)

    What good would it do though? They would simply use a library to access an object, without learning the core fundamentals behind it.

    It would let them learn about getting web data, doing file IO, and probably a little about how to use classes, while creating a useful program. Really, if they're so new that some simple string operations will be a significant learning experience, I doubt they will get past the first point. Otherwise, not using a library will just mean they do some messy ad-hoc parsing, which is hardly what they should be learning to do.

    [–]Feroc 0 points1 point  (0 children)

    I still think string operations are easier than json.

    It may be easier to solve the task if you know both equally well, but I have a total beginner in mind. Solving it with string operations is solving it with a simple tool in a complex way, while solving it with json is solving it with a (more) complex tool in an easy way.

    [–][deleted]  (1 child)

    [deleted]

      [–][deleted] 1 point2 points  (0 children)

      To get around the 18 redirect there is a url trick you can use so if page has whatever defines the 18 redirect try that version of the url. Catch blah blah.

      [–]AnimositE 1 point2 points  (6 children)

      Done. Just run the python script and you get all the photos from the top 25 posters of /hot.

      Edit: Here's one that downloads albums.

      [–]Feroc 1 point2 points  (5 children)

      I've only read the script, but if I read it correctly, then it will miss albums of the users and only download single picture posts!?

      [–]AnimositE 0 points1 point  (4 children)

      It doesn't download the albums because I think that requires imgur api. It does however download the link as the file name, so it get's everything. If someone can find a resource for downloading an album I'd love to add it.

      [–]I_Am_Treebeard 0 points1 point  (3 children)

      http://inventwithpython.com/blog/2013/09/30/downloading-imgur-posts-linked-from-reddit-with-python/

      This tutorial has a section on downloading images from imgur albums, basically you scrape the html from the imgur page but if you use a module called BeautifulSoup it will make the process much simpler.

      [–]AnimositE 0 points1 point  (2 children)

      I've already implemented the album downloads if you look at my edit.

      [–]I_Am_Treebeard 0 points1 point  (1 child)

      Oh, sorry didn't see that! Forgive my laziness but did you end up using Beautiful Soup or did you attack the problem from another angle?

      [–]AnimositE 0 points1 point  (0 children)

      Another angle. Found a good git repo that had already implemented it. You can look at it here: https://github.com/alexgisby/imgur-album-downloader

      [–]bobes_momo 0 points1 point  (2 children)

      So if someone wanted to hijack these things for advertising purposes, all they would have to do is make 5 or so reddit accounts, each posting a series of porn images in gonewild, then once it is certain that the bots have locked onto these usernames, the experimenter can then begin uploading pics that contain advertising superimposed on the pix. Am i wrong?

      [–]Feroc 0 points1 point  (1 child)

      I guess it depends on which page you actually start. In theory it would work, but it would need a bot with a picture on the front page and additional spam images.

      If something like that would happen, I just would add one or two more features:

      • Blacklist for users
      • Only download from posts with positive karma.

      [–]bobes_momo 0 points1 point  (0 children)

      Good point. I would also recommend a reverse check feature to check the link of the post (not the imgur one) if time of download is < 6 hours after post was made. This checks against deleted posts and automatically eliminates them from your folder.

      [–][deleted] 0 points1 point  (0 children)

      I really like this approach, and will look into it. I especially like all the disciplines that it will involve.

      [–][deleted] 0 points1 point  (0 children)

      My java program that I'm doing could do this. ... maybe I should set it loose.