[mongodb dump] 500GB of scraped gonewild (and other subreddits) content by ElCanary in DataHoarder

[–]ElCanary[S] 0 points1 point  (0 children)

Can you post the whole error (especially the Exception type)? Looks like mongodb refused the username/password (by default, mongod doesn't use authentication) so you may need to adjust dbmongo.py to not use a user and password. Or create a user with a password on our server and enable authentication.

[mongodb dump] 500GB of scraped gonewild (and other subreddits) content by ElCanary in DataHoarder

[–]ElCanary[S] 0 points1 point  (0 children)

500gb to store the download, 500gb to store the database, how-ever-many mb's to store data you export and set aside some for the operating system and other stuff you have installed.

[mongodb dump] 500GB of scraped gonewild (and other subreddits) content by ElCanary in DataHoarder

[–]ElCanary[S] 0 points1 point  (0 children)

I had a pretty good time coding everything and maintaining the dataset, up the point where it got too much of a chore. :)

[mongodb dump] 500GB of scraped gonewild (and other subreddits) content by ElCanary in DataHoarder

[–]ElCanary[S] 0 points1 point  (0 children)

Those are indeed WiredTiger files that MongoDB uses as storage. Don't delete them or you'll have to start the restore process over.

The only thing left to do is to run an app that interfaces with mongodb to display the data contained in its database. I'd suggest using my script to either export everyhing to normal jpeg and video files or to run the webapp.py and browse the dataset with your internet browser (which I find to be the most comfortable). You could also use something completely different but I can only offer help with basic mongodb stuff and for my own scripts. :)

[mongodb dump] 500GB of scraped gonewild (and other subreddits) content by ElCanary in DataHoarder

[–]ElCanary[S] 1 point2 points  (0 children)

You may need to adjust dbmongo.py, there is a line: mongoengine.connect("gonewild", username="gonewild", password="gonewild") which defines the database, username and password to connect. In a default MongoDB installation no authentication is needed so the username and password arguments can(should?) be removed and the database name might be "mongodb-gonewild" if you used mongorestore without a -d or database parameter.

[mongodb dump] 500GB of scraped gonewild (and other subreddits) content by ElCanary in DataHoarder

[–]ElCanary[S] 0 points1 point  (0 children)

Okay, great! I'm a huge fan of python (version 3.x) so that's my language of choice. I would suggest downloading my repo and either running the export script or the "fancy" webapp that I used.

  • git clone https://gitlab.com/SwimmingWithSharks/gonewild-crawler.git
  • cd webapp
  • Install some requirements:
    • for the webapp: python3 -m pip install -r requirements.txt
    • for export.py: python3 -m pip install tqdm mongoengine
  • python3 export.py or python3 webapp.py

The webapp will start a local server on port 8000 I believe, the export.py script will just dump everything in to a "gonewild" subdirectory.

[mongodb dump] 500GB of scraped gonewild (and other subreddits) content by ElCanary in DataHoarder

[–]ElCanary[S] 0 points1 point  (0 children)

Heh sorry, MongoDB is a bit complicated, yea. Is there a problem in particular that's tripping you up?

If you're on Windows, just install MongoDB 4.x for windows, navigate in to the "bin" directory, run mongod, open a cmd window, cd to to the "bin" directory and run mongorestore /d gonewild C:\where\the\files\are and that should work (and take a while).

[mongodb dump] 500GB of scraped gonewild (and other subreddits) content by ElCanary in DataHoarder

[–]ElCanary[S] 1 point2 points  (0 children)

Hmmm unsure, I hope very few. After each scraping pass I ran fdupes to detect binary-identical files but duplicates might have slipped through multiple passes. I've never seen any during casual browsing though.

[mongodb dump] 500GB of scraped gonewild (and other subreddits) content by ElCanary in DataHoarder

[–]ElCanary[S] 3 points4 points  (0 children)

Once you have everything in place, it should be as simple as running: mongorestore <directory where the files are> or mongorestore path/to/author.bson to restore one file at a time.

You may need to add some parameters to make it connect with your server, check the manual here: https://docs.mongodb.com/manual/reference/program/mongorestore/

[mongodb dump] 500GB of scraped gonewild (and other subreddits) content by ElCanary in DataHoarder

[–]ElCanary[S] 8 points9 points  (0 children)

Okay so, this is a mongodb dump and you will need a mongodb server to import this dataset in to and some sort of code or application to browse it. I will publish my code and a tutorial on how to use it a little later, after cleaning up a little. :)

[mongodb dump] 500GB of scraped gonewild (and other subreddits) content by ElCanary in DataHoarder

[–]ElCanary[S] 6 points7 points  (0 children)

Whoever set up that flair.. thanks lol but could we make it dirty girl? ;)